Retries Are Not Resilience: Adding Per-Region Circuit Breakers to a Node Gateway

A retry loop is one of the first resilience features teams add. It is also one of the first ones they overtrust.

In this gateway, src/services/httpClient.js wraps upstream requests with timeouts and bounded retries. That is good hygiene. But the more interesting decision is in src/services/valhallaService.js, where the code adds a simple circuit breaker per region.

That distinction matters a lot in systems that depend on partitioned upstreams.

Why retries are not enough

Retries are useful when a failure is transient. They are harmful when a dependency is already degraded and every caller keeps asking it to work harder.

In a multi-region routing service, that risk is even more specific. One Valhalla region can be unhealthy while the others are fine. If your resilience policy only thinks in terms of “the routing backend” as one thing, you either overreact or underreact.

This code takes the right unit of failure: region.

callValhalla() checks breaker state before each request. If a region has crossed a failure threshold, the gateway opens the circuit for that region for a short window and fails fast with a 503. On success, the breaker resets. On repeated failure, it opens again.

That is not a complex breaker implementation, but it is exactly the level of complexity many services actually need.

What I like about this design

The best thing about this approach is that it matches the architecture. The upstreams are region-specific, so the failure domain is region-specific too.

That means:

a bad eu cluster does not poison requests for na
the gateway can preserve partial service availability
operators get a cleaner mental model when debugging incidents

I also like that the retry logic and breaker logic are separate concerns. Retries live in the HTTP client. Breaker state lives in the regional routing service. Those are different policies, and the code treats them that way.

The limitation is important

The breaker state is process-local. It lives in a Map inside the Node process.

That is an acceptable tradeoff for a lightweight gateway, but it has consequences:

different replicas may have different views of the same unhealthy region
recovery can look inconsistent across the fleet
a newly started instance has no failure memory

This is the kind of tradeoff I like writing about because it shows mature engineering judgment. Not every service needs a distributed breaker. Sometimes process-local state is the right place to start because it is cheap, predictable, and easy to debug. The mistake is not starting simple. The mistake is forgetting what simple stops buying you at scale.

What else resilience needs

A breaker is only part of the story. To operate this well, I would want:

breaker-open counters and durations by region
request latency histograms for each upstream
clear tagging for retry attempts and timeout failures
dashboards that separate region health from gateway health

The gateway already has Prometheus counters and upstream timing logs. That is a good base. But once you start making resilience decisions automatically, you want the evidence trail to be as clear as the policy.

What I would change later

If the service reached a point where inconsistent breaker state across replicas became painful, I would evaluate whether the breaker should stay in app code at all. Depending on the rest of the platform, this could move into a shared store, a service mesh, or an edge proxy layer.

But I would not jump there too early. Operational complexity is also a cost. A lot of teams adopt heavyweight resilience infrastructure before they have the traffic, failure modes, or observability maturity to justify it.

The lesson

Resilience is not a box you check by adding retries. It is about deciding when to keep trying and when to stop making things worse.

This gateway gets that right in a very pragmatic way. It retries when a transient failure might recover. It opens the circuit when one region is clearly unhealthy. And it scopes the policy to the real failure boundary instead of the imaginary one. That is not flashy engineering, but it is exactly the kind that keeps production systems upright.