The Rate Limiter Worked in Staging and Failed in Production: Identity, Replicas, and Shared Counters

One of my favorite engineering stories is when the code is technically correct and operationally wrong.

This gateway has one of those stories hiding in plain sight.

In src/app.js, the service uses express-rate-limit before apiKeyAuth. That means almost every request is subject to a single global limiter unless the key is whitelisted. It is simple, easy to add, and absolutely good enough to stop accidental overload in a small deployment.

It is also not the policy the product actually wants.

The mismatch

The business identity in this system is the API key. But because the limiter runs before auth, the effective identity is the default one used by the middleware, which is basically the client IP.

That seems minor until you think about what the product probably wants to enforce:

quotas per customer
different limits per plan
fair usage across a fleet of clients behind shared IPs
consistent budgets across multiple gateway replicas

The current limiter solves none of those exactly.

This is what makes it such a good lesson. The code is not broken. The architecture is misaligned with the real control point.

Why this happens so often

Rate limiting is one of those features teams add under pressure. You need something fast, you need abuse protection, and the middleware makes it trivial to get started. So you put the limiter near the top of the stack and move on.

But order matters.

Middleware placement decides what identity data exists when enforcement runs. If auth has not happened yet, then you cannot rate-limit by authenticated business identity. You can only rate-limit by what you know at that point: IP, connection, or some raw header value you have not validated.

This repo even documents the gap in README.md, which I love. That is exactly the kind of honest operational documentation mature teams should keep.

The scaling problem is separate and just as important

Even if the limiter were keyed correctly, in-memory counters do not scale across replicas. With multiple gateway instances, your effective quota becomes roughly the sum of the per-instance limits unless you share state.

The repo already includes rate-limit-redis as a dependency, but it is not wired into the running middleware. That is not a failure. It is evidence of an architecture in transition: from local protection to product-grade quota enforcement.

Those transitions are where most systems spend their lives.

What I would change

I would split rate limiting into two layers.

First, I would keep a relatively loose pre-auth limiter keyed by IP to defend against brute-force abuse and credential stuffing patterns.

Second, I would move the stricter quota limiter after apiKeyAuth and key it on validated API keys. In a multi-instance deployment, I would back that limiter with Redis so every replica shares the same counters.

That gets you a cleaner separation of concerns:

pre-auth protection for unknown callers
post-auth quotas for real customers
one identity model for business policy

Why this is a great blog topic

This is the kind of issue senior engineers recognize immediately because it sits at the boundary between code and operations. The middleware works. The bug is in the model.

A lot of production systems have these hidden mismatches:

caching keyed on transport details instead of business intent
auth checks that happen after expensive work
retries that amplify failure
quotas enforced at the wrong trust boundary

They are hard to spot because everything still “works” until scale or abuse teaches you otherwise.

The lesson

Rate limiting is not just an HTTP concern. It is an identity concern and a distributed systems concern.

If you enforce it before you know who the caller is, you get the wrong policy. If you enforce it without shared state, you get the wrong economics. And if you do both, you can still pass staging while failing the product reality of production.

That is why this code is interesting. It captures the exact moment many systems hit: the point where a protective middleware feature has to become a real platform policy.