Crossing Oceans Without a Global Router: How We Stitched Routes Across Regional Valhalla Clusters

Routing gets surprisingly hard the moment your data stops living in one place.

In this gateway, each Valhalla cluster is region-scoped. That is a reasonable operational choice: smaller datasets, clearer blast radius, easier regional ownership. But it creates a product problem. What happens when a request spans more than one region? A user does not care that New York and London live behind different clusters. They still expect one answer.

The solution in src/services/crossRegionService.js is a good example of engineering under constraints. Instead of pretending a global router exists, the gateway composes one at request time.

The basic idea

The gateway takes a route-like request and normalizes the coordinates from locations or shape. Then it walks each adjacent pair of points and samples intermediate coordinates using interpolateGreatCircle() from src/utils/coordinates.js. Each sample is reverse-geocoded into a region, and the service groups contiguous points into region-specific segments.

Once those segments exist, the gateway can send each segment to the right Valhalla cluster with callValhalla(), gather the partial responses, and merge them into a single payload.

That merge step is where the implementation stops looking like a proxy and starts looking like a routing engine facade.

The part that is easy to underestimate

It is straightforward to say, “send each segment to the right region.” It is much harder to make the output feel like one route.

The merge logic in mergeLegsAndShapes() has to:

decode and re-encode polylines
remove duplicated boundary points
concatenate legs in the right order
merge locations without duplicating transition points
sum route summary fields like time and length

If any of those details are wrong, the response may still be structurally valid while being semantically broken. That is the kind of bug that survives smoke tests and shows up later as navigation weirdness, bad ETA math, or geometry glitches in clients.

Why this approach is smart

I like this design because it accepts the actual operating constraint instead of fighting it. The backend is geo-sharded. The gateway does not try to erase that internally. It uses a composition strategy that is explicit, understandable, and adaptable.

It also limits where complexity lives. The route handlers do not know how region boundaries work. The upstream service does not know how responses are stitched. The cross-region logic stays in one place, which is exactly where a future engineer would go looking for it.

The tradeoffs are real

This is not a perfect solution, and that is what makes it worth writing about.

First, region boundary detection is approximate. The code samples points along a path; it does not compute exact polygon intersections or use a full spatial indexing system. That keeps the implementation simple, but it means accuracy depends on SAMPLE_BOUNDARY_POINTS and on how well the sampled points capture the transition.

Second, the segment requests are sequential. That makes sense because order matters during merge, but it adds latency compared with a clean single-region request.

Third, the code explicitly throws on REGION_UNKNOWN for unsupported or oceanic segments when fallback is not possible. That is a good failure mode, but it highlights the product truth: a stitched route is only as good as the regional coverage and boundary inference behind it.

What I would improve next

If I were taking this from good to great, I would look at three upgrades.

The first is better observability around segmentation itself: number of segments, region transition count, unknown-region frequency, and merge failures. Right now, those are the kinds of metrics that would help determine whether the algorithm is quietly succeeding or quietly getting lucky.

The second is a more precise boundary strategy. Even if I did not jump all the way to polygon math, I would consider adaptive sampling so long segments or ambiguous boundaries get more resolution automatically.

The third is focused tests. This service really wants golden tests for geometry stitching, summary aggregation, and known border-crossing scenarios. Cross-region logic is one of those places where correctness is obvious when it fails badly and much harder to verify when it is only slightly wrong.

The lesson

There is a recurring pattern in production systems: sometimes you cannot buy or build the globally correct primitive, so you compose a good-enough one at the edge of your system.

That is what this gateway does. It turns multiple region-limited routing engines into a single product experience, not by hiding the complexity from itself, but by owning the composition deliberately. That is the difference between a brittle workaround and an architecture decision.