Cracking the API Gateway & Rate Limiting Interview: Patterns, Pitfalls, and Production Realities

Quick Summary ⚡️

API Gateways and Rate Limiters are the gatekeepers of modern distributed systems. In a system design interview, merely knowing "what" they are is insufficient. You must demonstrate an understanding of distributed state challenges, latency impacts, and failure cascading. This guide explores advanced interview topics: implementing race-condition-free rate limiters using Redis/Lua, the "Thundering Herd" problem, the Backend-for-Frontend (BFF) pattern for security, and operational trade-offs between NGINX, Envoy, and Kong.

Architecture: Beyond the Reverse Proxy
The Physics of Rate Limiting: Algorithms & Trade-offs
Distributed State & The Race Condition Trap
War Stories: Thundering Herds & Cascades
Tool Selection: Envoy vs. NGINX vs. Kong
Final Thoughts

In senior backend interviews, questions about API Gateways often serve as a litmus test for operational experience. The interviewer isn't just asking "How do you route traffic?" They are asking, "How do you protect the system from itself?"

A naive implementation of a gateway can become a single point of failure, a latency bottleneck, or a security liability. Similarly, a poorly designed rate limiter can allow traffic spikes that bring down your database, or conversely, false-positive block legitimate users during a sale event.

Architecture: Beyond the Reverse Proxy

A common misconception is that an API Gateway is just a fancy NGINX reverse proxy. While they share DNA, the architectural responsibilities differ significantly. A gateway is an orchestrator of cross-cutting concerns: authentication, rate limiting, observability, and protocol translation.

The Backend-for-Frontend (BFF) Pattern

One of the strongest signals of a senior engineer is the ability to discuss the BFF (Backend-for-Frontend) pattern versus a monolithic General Purpose API Gateway.

In a generic setup, you have one massive gateway serving Mobile, Web, and IoT clients. This leads to "over-fetching" (sending 50 fields to a mobile app that only needs 3) and "under-fetching" (forcing the client to make 5 waterfalls API calls to render one screen).

Architectural diagram comparing a monolithic API Gateway vs the Backend-for-Frontend (BFF) pattern

Interview Question: "Why would you choose a BFF pattern over a single API Gateway, and what are the security implications?"

Answer: You choose BFF to decouple client-specific requirements. A Mobile BFF can aggregate three downstream calls into one optimized payload, reducing radio battery usage and latency.

From a security perspective, the BFF acts as a Token Handler. The frontend (SPA/Mobile) should ideally never hold the sensitive long-lived Access Token. Instead, the BFF maintains an encrypted session (via HTTP-only cookies) with the client, and the BFF holds the actual OAuth tokens to talk to backend services. This mitigates XSS risks where tokens stored in LocalStorage are stolen.

The Physics of Rate Limiting: Algorithms & Trade-offs

When asked to "Design a Rate Limiter," dropping the name of an algorithm isn't enough. You must explain why it fits the specific constraints (burstiness vs. smoothness).

Algorithm	How it Works	Pros	Cons
Token Bucket	Tokens are added at a fixed rate. Requests consume tokens.	Allows bursts (if bucket has tokens). Memory efficient.	Complex to implement distributed atomicity.
Leaky Bucket	Requests enter a queue and are processed at a constant rate.	Smoothens traffic spikes completely. Stable outflow.	Bursts are slowed down, potentially increasing latency for valid users.
Fixed Window Counter	Count requests in windows (e.g., 12:00-12:01). Reset at boundary.	Easiest to implement. Low memory.	Edge Case: Spikes at window edges (e.g., 12:00:59 and 12:01:01) can allow 2x traffic.
Sliding Window Log	Keep timestamps of every request. Remove old ones.	Perfectly accurate. No boundary issues.	Expensive: High memory footprint to store all timestamps.

Critical Insight: For most high-scale backend systems, Token Bucket or Sliding Window Counter (a hybrid approach) is preferred. Pure Sliding Window Log is too memory-intensive for systems handling millions of RPS.

Visualization of the Token Bucket rate limiting algorithm showing tokens refilling and being consumed

Distributed State & The Race Condition Trap

This is where the interview gets real. If you run 10 instances of your API Gateway, how do they share the rate limit counter? If you use a local in-memory counter, a user can hit Instance A 10 times and Instance B 10 times, bypassing the limit.

The standard solution is a centralized store like Redis. But generic "Read-Modify-Write" logic in Redis creates a Race Condition.

The Scenario: 1. Process A reads count (current: 9). 2. Process B reads count (current: 9). 3. Process A increments to 10 and writes. 4. Process B increments to 10 and writes.
Result: The real count should be 11, but it's 10. You've just allowed a leak.

The Solution: Redis Lua Scripts

To solve this, we use Lua scripts to make the "Get-Check-Increment" operation atomic on the Redis server side. No other command can run while the script executes.


-- Redis Lua Script for Atomic Token Bucket
-- KEYS[1]: The rate limit key (e.g., "ratelimit:user:123")
-- ARGV[1]: Refill rate (tokens per second)
-- ARGV[2]: Bucket capacity
-- ARGV[3]: Current timestamp
-- ARGV[4]: Tokens to consume (usually 1)

local key = KEYS[1]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

-- Get current state
local payload = redis.call("hmget", key, "tokens", "last_refilled")
local last_tokens = tonumber(payload[1]) or capacity
local last_refilled = tonumber(payload[2]) or now

-- Refill logic
local delta = math.max(0, now - last_refilled)
local filled_tokens = math.min(capacity, last_tokens + (delta * rate))

local allowed = false
local new_tokens = filled_tokens

if filled_tokens >= requested then
    allowed = true
    new_tokens = filled_tokens - requested
end

-- Save state with TTL (to auto-cleanup inactive users)
redis.call("hmset", key, "tokens", new_tokens, "last_refilled", now)
redis.call("expire", key, 60) 

return allowed

This script ensures that high-concurrency requests don't corrupt the token counter, maintaining strict enforcement even across a distributed cluster.

War Stories: Thundering Herds & Cascades

A senior engineer anticipates failure. Two common failure modes related to Gateways and Rate Limiters are the Thundering Herd and Cascading Failure.

1. The Thundering Herd Problem

Imagine your API Gateway goes down for 30 seconds. During this time, mobile clients are retrying connection failures. When the Gateway comes back up, 100,000 clients retry simultaneously.

The sudden spike (100x normal load) instantly crashes the Gateway again, or the database behind it. You are stuck in a reboot loop.

Mitigation: Jitter & Exponential Backoff

Clients must not retry at fixed intervals (1s, 2s, 3s). They must add randomness (Jitter).

Retry Interval = (Base * 2^n) + Random_Jitter

The Gateway itself should implement Load Shedding, rejecting excess requests immediately with a 503 rather than trying to queue them.

2. Latency Cascades

If your User Service becomes slow (e.g., 5s response time), the API Gateway holds the connection open waiting for it. Eventually, the Gateway runs out of ephemeral ports or file descriptors. Now, the Gateway cannot process requests for the Product Service either, because it's choked by the User Service connections.

Mitigation: Bulkheads & Timeouts

Use the Bulkhead Pattern. Isolate connection pools.

"User Service gets max 50 concurrent connections. If 50 are used, reject immediately. Do not impact Product Service."

Aggressive Timeouts are mandatory. Better to fail fast (500ms) than to hang the system.

Tool Selection: Envoy vs. NGINX vs. Kong

You might be asked to choose a technology stack. Avoid "fanboy" answers; focus on capability.

NGINX: The battle-tested standard. Incredible performance, but configuration (nginx.conf) is static. Reloading config for dynamic service discovery can be tricky (though NGINX Plus solves this). Best for simple, high-performance edge ingress.
Envoy Proxy: The cloud-native darling. Built for dynamic service discovery and "Service Mesh" (Istio) architectures. It has first-class support for gRPC, advanced load balancing (zone-aware), and observability. It is complex to manage manually.
Kong: Built on top of NGINX (OpenResty) but adds a management API and plugins (Lua). It bridges the gap—easier to use than raw Envoy, more dynamic than raw NGINX. Great for "API Management" (selling APIs, developer portals).

Final Thoughts

The difference between a junior and a senior engineer in an API Gateway interview is the focus on failure modes. A junior describes the "Happy Path" where the request flows through. A senior describes what happens when the Redis rate limiter acts slow, how to handle partial failures using circuit breakers, and how to prevent a retry storm from taking down the platform.

When designing your gateway, remember: The Gateway is the first line of defense, but also the first bottleneck. Keep logic there minimal. Offload heavy lifting (like complex auth logic) to sidecars or async workers where possible, and always assume your downstream services will fail.