Handling Long-Running LLM Requests: A Backend Queuing Strategy

Handling Long-Running LLM Requests: A Backend Queuing Strategy

Quick Summary ⚡️

Integrating Large Language Models (LLMs) into production backends breaks traditional request-response paradigms. Standard HTTP timeouts, unpredictable generation latency (often exceeding 60 seconds), and high GPU costs make synchronous API calls dangerous at scale. This guide explores the architectural shift required to handle long-running inference tasks reliably. We will cover the transition from synchronous waiting to asynchronous job queues, compare result retrieval patterns like Server-Sent Events (SSE) vs. Webhooks, and dive into advanced strategies for priority queuing and idempotency to prevent expensive retry storms.

Table of Contents


The HTTP Timeout Problem

In traditional web development, we optimize for milliseconds. A database query taking 500ms is slow; an API response taking 2 seconds is often unacceptable. However, Generative AI flips this metric on its head. A complex GPT-4 query with a large context window and extensive output tokens can easily run for 45 to 90 seconds. If your backend architecture relies on a simple synchronous HTTP connection to an LLM provider, you are inviting failure.


Most load balancers (AWS ALB, Nginx) and client browsers have default timeouts ranging from 30 to 60 seconds. If the model is still "thinking" when that timer expires, the connection is severed. The client sees a 504 Gateway Timeout, but crucially the LLM provider is still generating text and billing you for the compute. This "ghost processing" wastes money and leaves your system in an inconsistent state.


Furthermore, standard retry logic exacerbates this. If a client times out at 30 seconds and immediately retries, you now have two expensive GPU jobs running for the same request. To solve this, we must decouple the request ingestion from the inference processing.


The Asynchronous Architecture Pattern

The solution is to move from a synchronous request-response cycle to an asynchronous job queue model. When a client requests an LLM generation, the API does not block waiting for the answer. Instead, it places a job on a message broker (like RabbitMQ, Redis, or SQS) and immediately returns a "Job ID" to the client.


This approach involves three distinct components:

  1. Producer API: Accepts the prompt, validates input, writes a "PENDING" record to the database, enqueues the job, and returns HTTP 202 Accepted.
  2. Message Broker: Acts as a buffer, absorbing burst traffic and protecting downstream GPU limits.
  3. Worker Service: Pulls jobs, manages the long-running LLM connection, and updates the database with the result or error state.

Below is a simplified implementation of the producer (API) layer using Python logic.


# Endpoint to accept LLM requests @router.post("/generate", status_code=202) async def submit_generation_job( request: GenerationRequest, db: Session = Depends(get_db) ): # 1. Create a unique Job ID job_id = str(uuid.uuid4()) # 2. Persist initial state to DB (Vital for observability) new_job = Job( id=job_id, user_id=request.user_id, prompt_hash=hash_prompt(request.prompt), status="QUEUED", created_at=datetime.utcnow() ) db.add(new_job) db.commit() # 3. Push to Queue (e.g., Redis/SQS) # We include the job_id so the worker knows what to update await queue_client.enqueue( queue_name="llm_inference_v1", message={ "job_id": job_id, "prompt": request.prompt, "params": request.model_parameters } ) # 4. Return immediately return { "job_id": job_id, "status": "QUEUED", "info": "Job submitted successfully. Poll /status/{job_id} for updates." }

By persisting the state to a primary database (like Postgres) before queuing, we ensure we have a record of every request, even if the queue fails or the worker crashes. This is a fundamental pattern in Distributed Systems known as the "Transactional Outbox" pattern (simplified here for clarity).


Retrieval Strategies: Polling vs. SSE vs. Webhooks

Once the backend has the job, how does the client get the result? For LLM workloads, the User Experience (UX) requirements dictate the technical choice. Users expect to see text streaming in real-time to perceive the system as "fast," even if the total generation time is long.

Strategy Pros Cons Best Use Case
Short Polling Easiest to implement. Works through all firewalls. High server load. "Chatty" network traffic. Delayed updates. Low-volume internal tools where latency isn't critical.
Webhooks Server pushes data only when done. highly efficient. Requires client to expose a public URL. Complex to debug. Machine-to-machine (B2B) API integrations.
Server-Sent Events (SSE) Native browser support. Unidirectional stream. Low overhead. Connection limit issues (6 per browser domain). The Gold Standard for LLM Chat UIs.

For most consumer-facing LLM applications, Server-Sent Events (SSE) is the superior choice. Unlike WebSockets, which are bidirectional and require complex handshake protocols, SSE is designed specifically for a server pushing updates (like token chunks) to a client.


In a queued architecture, implementing SSE requires a "Pub/Sub" mechanism. The Worker generates tokens and publishes them to a Redis channel subscribed to by the API layer, which then forwards them to the open HTTP connection with the client.


Idempotency & Cost-Aware Retries

Retry logic for LLMs differs significantly from standard microservices. If a database query fails, retrying is cheap. If an LLM generation fails after generating 800 tokens, retrying costs real money and doubles the latency.


We must classify failures into two categories:

  • Transient Network Errors: Connection resets, 503s from the provider. These are safe to retry with exponential backoff.
  • Deterministic Errors: Content policy violations, invalid prompt structure, or context length exceeded. These must never be retried.

Furthermore, we need strong Idempotency. If a worker picks up a job, crashes, and the job re-enters the queue (a standard visibility timeout mechanism), we must ensure we don't process it twice if the first attempt actually succeeded but failed to acknowledge.


Here is a robust worker implementation pattern:


async def process_job(job_payload): job_id = job_payload['job_id'] # 1. Atomic Lock / Idempotency Check # Ensure no other worker is processing this ID if not redis.set(f"lock:{job_id}", "LOCKED", nx=True, ex=300): logger.warning(f"Job {job_id} is already being processed.") return try: # 2. Check DB status to prevent re-processing completed jobs current_status = db.get_status(job_id) if current_status in ["COMPLETED", "FAILED"]: return # 3. Update State to PROCESSING db.update_status(job_id, "PROCESSING") # 4. Call LLM (The expensive part) response = await llm_provider.generate( prompt=job_payload['prompt'], # Intelligent timeout: longer than the HTTP timeout timeout=90 ) # 5. Save Result db.save_result(job_id, response) db.update_status(job_id, "COMPLETED") except RateLimitError: # Re-queue with delay (Backoff) queue.retry(job_payload, delay=60) except ContextLengthExceededError: # Do NOT retry. Fail permanently. db.update_status(job_id, "FAILED", error="Context too long") except Exception as e: # Unknown error -> Dead Letter Queue logger.error(f"Critical failure: {e}") queue.send_to_dlq(job_payload) finally: # Release lock redis.delete(f"lock:{job_id}")

Priority Management for GPU Resources

Not all requests are created equal. In a production SaaS, you likely have Free Tier users and Enterprise users. If a viral event causes 10,000 free users to flood your system, your Enterprise users, who pay the bills should not be stuck behind them in a First-In-First-Out (FIFO) queue.


Implementing a Priority Queue strategy is essential for Backend Architecture scaling. A common pattern is the "Fast Lane / Slow Lane" approach:

  • High Priority Queue: For paid users. Dedicated workers poll this queue more frequently (e.g., 80% of worker capacity).
  • Low Priority Queue: For free users. Remaining 20% of worker capacity.

However, simple priority queues can lead to "starvation", where free users never get processed. A better approach is Weighted Round Robin consumption. Configure your workers to pull from the High Priority queue 4 times for every 1 time they pull from the Low Priority queue.


Additionally, consider "Complexity Estimation." Before enqueuing, estimate the token count. A request asking for a 50 word summary is different from one asking for a 9,000 word blog post. Routing short tasks to a dedicated "Short-Task Queue" can significantly improve perceived throughput metrics.


Final Thoughts

Designing backend systems for AI requires unlearning the synchronous habits of the REST API era. By embracing asynchronous queues, implementing robust state management, and treating LLM interactions as expensive, fallible operations, you build a system that remains resilient under load.


The transition to this architecture adds complexity, you now have to manage workers, brokers, and state synchronization, but the tradeoff is a system that handles 60 second generation times as gracefully as a 50ms database lookup. For further reading on queue theory, I highly recommend exploring the Amazon SQS documentation on visibility timeouts and dead-letter queues, which applies to almost any broker you choose.

Post a Comment

Previous Post Next Post