Why I Chose arq and RQ Over Celery for LLM Workloads

If you’re building LLM-powered applications with FastAPI, you need a task queue. LLM API calls are slow — 2 to 30 seconds per request. You can’t block your web server on that. But the default answer in the Python world has always been Celery, and for LLM workloads, Celery is overkill.

LLM Workloads Are I/O Bound
Celery vs RQ vs arq
Memory Footprint
Rate Limiting LLM APIs
Why I Use Both arq and RQ
FastAPI Integration
When to Actually Use Celery
The Bottom Line

LLM Workloads Are I/O Bound

The first thing to understand is that LLM workloads are fundamentally I/O bound. You’re not doing heavy computation — you’re waiting for an HTTP response from OpenAI, Anthropic, or your self-hosted model. The CPU is idle while you wait. This changes everything about what you need from a task queue.

Celery was designed for a different world. It was built for CPU-bound tasks — image processing, data crunching, report generation. It uses multiprocessing by default, spawning separate OS processes for each worker. That makes sense when you need CPU isolation. But for I/O-bound LLM calls, you’re paying the memory overhead of multiple processes just to… wait on network responses.

Aspect	CPU-Bound (Celery’s sweet spot)	I/O-Bound (LLM calls)
Bottleneck	CPU computation	Network latency
Concurrency model	Multiprocessing (OS processes)	Async I/O or threading
Memory per worker	High (each process = full Python runtime)	Low (coroutines share one process)
Typical task duration	Milliseconds to seconds	2-30 seconds
Scaling strategy	More CPU cores	More concurrent connections

Celery vs RQ vs arq

Feature	Celery	RQ (Redis Queue)	arq
Broker	Redis, RabbitMQ, SQS, etc.	Redis only	Redis only
Concurrency	Multiprocessing, eventlet, gevent	Multiprocessing (1 task per worker)	Native async/await
Async support	No native async (gevent/eventlet as workaround)	No (sync only)	First-class
Dependencies	Heavy (~15 transitive deps)	Minimal (~3 deps)	Minimal (~2 deps)
Setup complexity	High (broker config, result backend, serializer, etc.)	Low	Low
Rate limiting	Built-in (per-task)	Manual	Manual (but async makes it natural)
Retry logic	Built-in, configurable	Built-in, basic	Built-in, configurable
Monitoring	Flower (separate service)	rq-dashboard	arq’s built-in health checks
Task routing	Advanced (multiple queues, priority)	Basic (named queues)	Basic (named queues)
Periodic tasks	Celery Beat (separate process)	rq-scheduler (separate)	Built-in cron support
Learning curve	Steep	Gentle	Gentle

Here’s what the setup looks like for each:

# Celery — lots of configuration
from celery import Celery

app = Celery('tasks', broker='redis://localhost:6379/0')
app.conf.update(
    result_backend='redis://localhost:6379/0',
    task_serializer='json',
    result_serializer='json',
    accept_content=['json'],
    task_routes={'tasks.score': {'queue': 'llm'}},
    task_rate_limit='10/m',
)

@app.task(bind=True, max_retries=3, retry_backoff=True)
def score_response(self, text):
    # This is sync — Celery runs it in a subprocess
    result = openai_client.chat.completions.create(...)
    return result.choices[0].message.content

# RQ — simple and straightforward
from redis import Redis
from rq import Queue

redis_conn = Redis()
q = Queue('llm', connection=redis_conn)

def score_response(text):
    # Plain sync function
    result = openai_client.chat.completions.create(...)
    return result.choices[0].message.content

# Enqueue
job = q.enqueue(score_response, text, retry=Retry(max=3, interval=60))

# arq — async-native, fits naturally with FastAPI
from arq import create_pool
from arq.connections import RedisSettings

async def score_response(ctx, text):
    # Native async — no thread pool, no subprocess
    result = await async_openai_client.chat.completions.create(...)
    return result.choices[0].message.content

class WorkerSettings:
    functions = [score_response]
    redis_settings = RedisSettings()
    max_jobs = 50  # 50 concurrent async tasks in ONE process

Notice the difference: arq [1] runs 50 concurrent LLM calls in a single process because they’re all just awaiting network I/O. Celery [3] would need 50 processes for the same concurrency. RQ [2] would need 50 worker processes.

One important note: Celery still has no native async/await support as of 2025. The async support issue (GitHub #6552) has been open since 2020 and keeps getting deferred. You can use gevent or eventlet as workarounds, or third-party packages like celery-aio-pool, but these are hacks around a fundamentally sync architecture. arq was built async from day one — by Samuel Colvin, the same person behind Pydantic.

Memory Footprint

The memory difference is significant in practice:

Setup	Concurrency	Memory usage	Processes
Celery (prefork, default)	50 tasks	~2.5 GB (50 × ~50 MB)	50
Celery (gevent)	50 tasks	~500 MB (1 process + greenlets)	1
RQ	50 tasks	~2.5 GB (50 × ~50 MB)	50
arq	50 tasks	~80 MB (1 process, 50 coroutines)	1

These are rough numbers, but the order of magnitude is real. When you’re deploying on a single VPS or a small Kubernetes pod, this matters.

Rate Limiting LLM APIs

Every LLM provider has rate limits [4] — requests per minute, tokens per minute, sometimes both. If you blast 100 concurrent requests, you’ll get 429 errors. You need to throttle.

Celery has built-in rate limiting (rate_limit='10/m'), but it’s per-worker, not global. If you have 5 workers each set to 10/minute, you’re actually doing 50/minute. You need a separate mechanism for global rate limiting.

With arq, since everything runs in one process with async, you can use a simple semaphore or token bucket:

import asyncio
from collections import deque
import time

class RateLimiter:
    def __init__(self, max_per_minute: int):
        self.max_per_minute = max_per_minute
        self.semaphore = asyncio.Semaphore(max_per_minute)
        self.timestamps: deque = deque()

    async def acquire(self):
        await self.semaphore.acquire()
        now = time.monotonic()
        # Clean old timestamps
        while self.timestamps and self.timestamps[0] < now - 60:
            self.timestamps.popleft()
            self.semaphore.release()
        self.timestamps.append(now)

rate_limiter = RateLimiter(max_per_minute=50)

async def score_response(ctx, text):
    await rate_limiter.acquire()
    result = await async_openai_client.chat.completions.create(...)
    return result

Because arq workers are single-process async, this in-process rate limiter actually works. With Celery’s multiprocessing, you’d need Redis-based distributed rate limiting — more complexity.

Why I Use Both arq and RQ

arq is my default for LLM API calls — scoring, summarization, embeddings, anything that’s an async HTTP call to an LLM provider. The async-native design means I get high concurrency with minimal resources, and it fits perfectly with FastAPI’s async ecosystem.

RQ I use for simpler background tasks that are sync by nature — sending emails, generating PDF reports, running database migrations, cleanup jobs. Tasks where I don’t need high concurrency and the simplicity of “just write a regular function” is the priority.

graph LR
    API["FastAPI"] --> R["Redis"]
    R --> ARQ["arq Worker"]
    R --> RQW["RQ Worker"]
    ARQ --> LLM["LLM APIs"]
    RQW --> SYNC["Sync Tasks"]

    style API fill:#264653,stroke:#264653,color:#fff
    style R fill:#e76f51,stroke:#e76f51,color:#fff
    style ARQ fill:#2a9d8f,stroke:#2a9d8f,color:#fff
    style RQW fill:#e9c46a,stroke:#e9c46a,color:#000
    style LLM fill:#2d6a4f,stroke:#2d6a4f,color:#fff
    style SYNC fill:#f4a261,stroke:#f4a261,color:#000

Both share the same Redis instance. FastAPI enqueues to whichever queue fits the task. No RabbitMQ, no Celery Beat process, no Flower monitoring server. Just Redis, which I already need for caching and session storage.

FastAPI Integration

The integration with FastAPI [6] is clean:

from fastapi import FastAPI
from arq import create_pool
from arq.connections import RedisSettings

app = FastAPI()

@app.on_event("startup")
async def startup():
    app.state.arq = await create_pool(RedisSettings())

@app.post("/score")
async def score(text: str):
    job = await app.state.arq.enqueue_job("score_response", text)
    return {"job_id": job.job_id}

@app.get("/score/{job_id}")
async def get_score(job_id: str):
    job = await app.state.arq.job(job_id)
    if await job.status() == "complete":
        return {"score": await job.result()}
    return {"status": "processing"}

No sync/async bridge. No thread pool executor wrapping. The whole stack is async end-to-end: FastAPI → Redis → arq → async LLM client [7].

When to Actually Use Celery

Celery isn’t dead — it’s just not the right tool for every job:

Use case	Best choice	Why
LLM API calls (scoring, summarization)	arq	Async I/O, high concurrency, low memory
Simple background jobs (email, cleanup)	RQ	Dead simple, sync is fine
CPU-heavy tasks (image processing, ML training)	Celery	Multiprocessing isolates CPU work
Complex workflows (chaining, fan-out, chord)	Celery	Built-in primitives for task composition
Multi-broker (RabbitMQ + Redis + SQS)	Celery	Only option with multi-broker support
Enterprise with existing Celery infra	Celery	Migration cost isn’t worth it

The pattern I’ve settled on: arq for I/O-bound LLM work, RQ for simple sync tasks, and Celery only if I genuinely need its workflow primitives or multi-broker support.

The Bottom Line

If you’re already running FastAPI + Redis (which most LLM apps are), arq adds almost zero operational complexity. It’s just another async process reading from the same Redis. Compare that to Celery, which wants its own broker, result backend, Beat scheduler, and Flower dashboard.

The LLM ecosystem is I/O-bound by nature. Your tools should reflect that.

What task queue setup are you using for LLM workloads? Have you found Celery worth the overhead, or have you moved to something lighter?

References:

[1] “arq — Job queues and RPC in python with asyncio and redis.” Samuel Colvin.
[2] “RQ: Simple job queues for Python.” RQ Project.
[3] “Celery — Distributed Task Queue.” Celery Project.
[4] “Rate Limiting.” OpenAI.
[5] “Anthropic API Rate Limits.” Anthropic.
[6] “FastAPI Background Tasks.” FastAPI.
[7] “asyncio — Asynchronous I/O.” Python.