<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-04-10T18:19:41+00:00</updated><id>/feed.xml</id><title type="html">Trekpoint</title><subtitle>Notes from the Trekpoint engineering team: architecture, infrastructure, APIs, and the tradeoffs behind the product.</subtitle><author><name>Bongani Mbigi</name></author><entry><title type="html">The Production Scheduler Footgun: Celery Beat in Config, Missing in Reality</title><link href="/engineering/2026/03/23/the-celery-beat-footgun.html" rel="alternate" type="text/html" title="The Production Scheduler Footgun: Celery Beat in Config, Missing in Reality" /><published>2026-03-23T10:00:00+00:00</published><updated>2026-03-23T10:00:00+00:00</updated><id>/engineering/2026/03/23/the-celery-beat-footgun</id><content type="html" xml:base="/engineering/2026/03/23/the-celery-beat-footgun.html"><![CDATA[<p>Some production problems are subtle scaling pathologies.</p>

<p>Others are simpler and more embarrassing:</p>

<p>the code says scheduled jobs exist, and the deployment does not actually run the scheduler.</p>

<p>Trek Point has the shape of that lesson. The application defines periodic Celery work for billing-related housekeeping, but the deployment configuration we ship prominently shows worker processes and not an obvious Beat process.</p>

<p>This is exactly the kind of issue that deserves to be written down because it teaches something bigger than Celery.</p>

<h2 id="why-this-kind-of-bug-is-dangerous">Why This Kind of Bug Is Dangerous</h2>

<p>It hides in plain sight.</p>

<p>The codebase can look perfectly healthy:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">beat_schedule</code> is defined</li>
  <li>task names are valid</li>
  <li>local assumptions feel fine</li>
  <li>everyone remembers that those jobs “exist”</li>
</ul>

<p>But the actual question is not whether the schedule exists in Python. The question is whether any production process is alive to interpret that schedule.</p>

<p>That is the difference between configured and operationally real.</p>

<h2 id="why-teams-miss-this">Why Teams Miss This</h2>

<p>Because modern apps spread operational truth across too many places:</p>

<ul>
  <li>app config</li>
  <li>Docker compose</li>
  <li>deployment manifests</li>
  <li>process managers</li>
  <li>cloud schedulers</li>
  <li>tribal knowledge</li>
</ul>

<p>Any mismatch between those layers can survive for a long time if the tasks are not obviously user-facing every day.</p>

<p>That is especially true for maintenance work like:</p>

<ul>
  <li>expiring old coupons</li>
  <li>marking soon-to-expire credit cards</li>
  <li>cleaning up stale rows</li>
  <li>synchronizing background state</li>
</ul>

<p>These jobs often fail quietly until a support issue exposes them.</p>

<h2 id="this-is-not-really-about-celery-beat">This Is Not Really About Celery Beat</h2>

<p>The deeper lesson is that infrastructure declarations are only as real as the process topology that backs them.</p>

<p>A lot of engineering teams have a weak spot here. We review code thoroughly, but we often review deploy semantics informally. That creates a blind spot where product assumptions live in code and operational assumptions live in memory.</p>

<p>Once you have multiple process types, you need to be explicit about:</p>

<ul>
  <li>which processes must exist in every environment</li>
  <li>which ones are optional locally</li>
  <li>which health checks or alerts prove they are alive</li>
  <li>which business workflows depend on them</li>
</ul>

<p>If those questions are not answered concretely, the scheduler is just a nice idea in <code class="language-plaintext highlighter-rouge">settings.py</code>.</p>

<h2 id="the-product-cost-of-quietly-missing-scheduled-work">The Product Cost of Quietly Missing Scheduled Work</h2>

<p>The reason this matters is not architectural neatness. It is product behavior.</p>

<p>Scheduled billing and maintenance jobs shape user trust even when users never see the jobs directly.</p>

<p>If a cleanup task does not run, the symptoms show up elsewhere:</p>

<ul>
  <li>stale billing state</li>
  <li>expired incentives lingering too long</li>
  <li>cards not being flagged when expected</li>
  <li>support investigations that take longer because everyone assumes the automation fired</li>
</ul>

<p>Those are indirect failures, which is why they are so easy to underestimate.</p>

<h2 id="how-id-guard-against-this">How I’d Guard Against This</h2>

<p>If I were hardening this part of Trek Point, I would want at least one of the following to be true:</p>

<ul>
  <li>Beat is a first-class deploy target with clear ownership</li>
  <li>periodic jobs move to an external scheduler with explicit invocation</li>
  <li>alerts exist for scheduler liveness and last-successful-run timestamps</li>
  <li>operational docs state exactly how periodic jobs run in each environment</li>
</ul>

<p>The main goal is not “use Celery Beat correctly.” The goal is to make background business time visible.</p>

<p>That phrase sounds abstract, but it is real. Schedulers are how products express time-dependent intent:</p>

<ul>
  <li>daily</li>
  <li>hourly</li>
  <li>after expiry</li>
  <li>before renewal</li>
</ul>

<p>If no process is actually keeping that time, your business logic has a hole in it.</p>

<h2 id="why-this-makes-a-great-engineering-story">Why This Makes a Great Engineering Story</h2>

<p>I like this kind of example because it is concrete, humble, and widely relatable.</p>

<p>Every experienced team has something like this in its history:</p>

<ul>
  <li>a cron defined but never deployed</li>
  <li>a worker running without the scheduler</li>
  <li>a scheduled task surviving in one environment but not another</li>
  <li>a maintenance process everyone assumes someone else owns</li>
</ul>

<p>These stories are useful because they cut through the fiction that “configured” means “running.”</p>

<h2 id="the-lesson-id-keep">The Lesson I’d Keep</h2>

<p>Application code can declare intent. Production systems still need a living process to honor it.</p>

<p>That is the whole lesson.</p>

<p>If I could turn that into one engineering reflex, it would be this:</p>

<p>whenever a codebase declares periodic work, immediately ask, “show me the process that runs it in production.”</p>

<p>If nobody can answer quickly, you have probably found a more important issue than the code review comments in front of you.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[Some production problems are subtle scaling pathologies.]]></summary></entry><entry><title type="html">Why We Chose Kamal Over Kubernetes for a Multi-Region Routing Gateway</title><link href="/engineering/2026/03/22/why-kamal-over-kubernetes.html" rel="alternate" type="text/html" title="Why We Chose Kamal Over Kubernetes for a Multi-Region Routing Gateway" /><published>2026-03-22T10:00:00+00:00</published><updated>2026-03-22T10:00:00+00:00</updated><id>/engineering/2026/03/22/why-kamal-over-kubernetes</id><content type="html" xml:base="/engineering/2026/03/22/why-kamal-over-kubernetes.html"><![CDATA[<p>Not every production service needs a platform story big enough to impress conference slides.</p>

<p>Sometimes the most senior infrastructure decision is choosing less.</p>

<p>This gateway is a good example. It is a stateless Node service with optional Redis, clear environment-driven configuration, and a Docker-based deployment flow. The deploy config in <code class="language-plaintext highlighter-rouge">config/deploy.yml</code> uses Kamal instead of reaching immediately for Kubernetes, and I think that is exactly the kind of pragmatic decision more teams should write about.</p>

<h2 id="start-with-the-workload-not-the-trend">Start with the workload, not the trend</h2>

<p>The service does not manage durable business state. It does not run a complicated job system. It does not require service discovery across dozens of internal components. It listens on one port, calls a few upstreams, and scales horizontally in a mostly boring way.</p>

<p>That is not an insult. It is a gift.</p>

<p>Workloads like this are perfect candidates for simple deployment models because the application already has the right shape:</p>

<ul>
  <li>stateless request handling</li>
  <li>config via environment variables</li>
  <li>easy containerization</li>
  <li>no tight coupling to node-local disk</li>
</ul>

<p>The Dockerfile reflects that simplicity. The image is straightforward, startup is explicit, and the deployment config focuses on the things that matter: image registry, target host, SSL proxying, secrets, and runtime env.</p>

<h2 id="why-a-smaller-platform-can-be-the-better-platform">Why a smaller platform can be the better platform</h2>

<p>Kubernetes solves real problems. It also introduces a lot of surface area: manifests, controllers, ingress decisions, secret management conventions, observability plumbing, rollout policy, and operational overhead that teams often underestimate.</p>

<p>For a service at this stage, Kamal offers a simpler path:</p>

<ul>
  <li>container-based deploys</li>
  <li>explicit host targeting</li>
  <li>manageable secret injection</li>
  <li>enough structure for repeatability without a full control plane</li>
</ul>

<p>That is often the right tradeoff when the app itself is still evolving. You get a production deployment workflow without committing to infrastructure complexity you may not need yet.</p>

<h2 id="the-design-still-leaves-room-to-grow">The design still leaves room to grow</h2>

<p>What I like about this repo is that the service is not painted into a corner. The app is already containerized. Redis is optional and can be externalized. Config is environment driven. Health and metrics endpoints exist. Those are all good portability decisions regardless of the scheduler.</p>

<p>In other words, choosing Kamal here does not mean choosing against future scale. It means deferring platform complexity until there is evidence you need it.</p>

<p>That distinction matters. A lot of teams frame these decisions as ambition versus simplicity. The better framing is timing versus cost.</p>

<h2 id="what-you-give-up">What you give up</h2>

<p>Of course there are tradeoffs.</p>

<p>A lighter deploy model may give you less built-in support for:</p>

<ul>
  <li>sophisticated autoscaling policies</li>
  <li>standardized multi-node orchestration patterns</li>
  <li>first-class service mesh integrations</li>
  <li>broad internal platform consistency if the rest of the org is already on Kubernetes</li>
</ul>

<p>Those are real considerations. But they are only benefits if your team will actually use them and support them well.</p>

<h2 id="what-i-would-watch-for">What I would watch for</h2>

<p>The triggers that would make me revisit the platform choice are pretty clear:</p>

<ul>
  <li>traffic growth that makes rollout and capacity management significantly harder</li>
  <li>more internal dependencies and sidecars around the gateway</li>
  <li>stronger requirements for multi-region active-active operations</li>
  <li>an organizational shift toward standardized platform tooling</li>
</ul>

<p>Until then, the simpler path is often the more responsible one.</p>

<h2 id="the-lesson">The lesson</h2>

<p>Infrastructure decisions should reflect the shape of the application and the maturity of the team operating it.</p>

<p>For a stateless routing gateway, choosing Kamal over Kubernetes is not “less serious.” It is a bet that operational focus matters more than platform fashion. In my experience, that is often the senior move: build the application so it can grow, but keep the deployment story as small as reality allows.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[Not every production service needs a platform story big enough to impress conference slides.]]></summary></entry><entry><title type="html">The Stripe Webhook Decision I’d Revisit: Re-Fetching Events by ID Instead of Verifying Signatures</title><link href="/engineering/2026/03/21/the-stripe-webhook-decision-id-revisit.html" rel="alternate" type="text/html" title="The Stripe Webhook Decision I’d Revisit: Re-Fetching Events by ID Instead of Verifying Signatures" /><published>2026-03-21T10:00:00+00:00</published><updated>2026-03-21T10:00:00+00:00</updated><id>/engineering/2026/03/21/the-stripe-webhook-decision-id-revisit</id><content type="html" xml:base="/engineering/2026/03/21/the-stripe-webhook-decision-id-revisit.html"><![CDATA[<p>I like writing about decisions that were reasonable, shipped, and still worth revisiting.</p>

<p>Our Stripe webhook path in Trek Point is one of those.</p>

<p>The implementation takes a pragmatic approach:</p>

<ul>
  <li>accept a JSON payload</li>
  <li>require an event id</li>
  <li>re-fetch the event from Stripe using our secret key</li>
  <li>process the safe copy we retrieved directly from Stripe</li>
</ul>

<p>That is not a crazy design. In fact, it has some real advantages. But if I were tightening the system now, this is one of the first places I would look again.</p>

<h2 id="why-we-did-it-this-way">Why We Did It This Way</h2>

<p>The original intuition was straightforward:</p>

<p>If an inbound webhook says it is event <code class="language-plaintext highlighter-rouge">evt_123</code>, do not trust the payload body. Ask Stripe for <code class="language-plaintext highlighter-rouge">evt_123</code> directly and process that canonical version instead.</p>

<p>The appeal is obvious:</p>

<ul>
  <li>the event data comes from Stripe over authenticated API access</li>
  <li>we do not depend on the request body contents except for the id</li>
  <li>we avoid processing obviously forged payload bodies</li>
</ul>

<p>For a product team moving quickly, that feels like a clean trust model.</p>

<h2 id="what-this-approach-gets-right">What This Approach Gets Right</h2>

<p>I still think there are legitimate strengths here.</p>

<h2 id="1-it-avoids-blind-trust-in-the-inbound-payload">1. It avoids blind trust in the inbound payload</h2>

<p>That is better than naively consuming whatever object arrived over HTTP.</p>

<h2 id="2-it-centralizes-on-stripes-current-view-of-the-event">2. It centralizes on Stripe’s current view of the event</h2>

<p>For some operational flows, that can be simpler than reasoning about every payload edge case locally.</p>

<h2 id="3-it-fit-our-existing-stripe-integration-model">3. It fit our existing Stripe integration model</h2>

<p>The app already talked to Stripe directly for billing operations, so the mental model was consistent.</p>

<p>Those are real advantages. This was not security negligence. It was a pragmatic trust strategy.</p>

<h2 id="why-id-still-revisit-it">Why I’d Still Revisit It</h2>

<p>The biggest reason is that “re-fetch by id” and “verify webhook authenticity” are not the same thing.</p>

<p>Webhook signature verification answers:</p>

<p>“Did Stripe send this exact request to us?”</p>

<p>Re-fetching by id answers:</p>

<p>“Does this event id exist in Stripe, and can we retrieve it with our account credentials?”</p>

<p>Those are related, but not identical, security properties.</p>

<p>The second concern is operational coupling. Our webhook handler now depends on live Stripe API retrieval during request handling. If Stripe’s API is degraded or our outbound access is impaired, the webhook path becomes more fragile than it needs to be.</p>

<p>That is not hypothetical. Production systems spend a lot of time in partial failure modes.</p>

<h2 id="the-more-questionable-choice-returning-200-on-generic-exceptions">The More Questionable Choice: Returning <code class="language-plaintext highlighter-rouge">200</code> on Generic Exceptions</h2>

<p>The design choice I am less comfortable with today is returning success on broad exceptions specifically to stop Stripe from retrying.</p>

<p>I understand why that happened. Runaway retries can amplify bad incidents, especially if the handler is crashing on a condition retries will not fix.</p>

<p>But the downside is serious:</p>

<ul>
  <li>you may acknowledge work you did not actually complete</li>
  <li>retries stop even if the failure was transient</li>
  <li>recovery becomes a manual reconciliation problem</li>
</ul>

<p>That is the kind of tradeoff teams make under production pressure, but it is also exactly the kind of behavior that deserves a second pass once the system matures.</p>

<h2 id="what-id-do-differently-now">What I’d Do Differently Now</h2>

<p>I would likely move toward:</p>

<ul>
  <li>verifying Stripe webhook signatures on ingress</li>
  <li>treating idempotency and replay handling as first-class concerns</li>
  <li>retrying or dead-lettering failures more deliberately instead of broadly suppressing them</li>
  <li>reserving <code class="language-plaintext highlighter-rouge">200</code>-on-error behavior for cases we can prove are non-recoverable and safely ignorable</li>
</ul>

<p>I might still keep the ability to re-fetch event details when useful, but I would not want that to be the primary authenticity model.</p>

<h2 id="why-this-is-a-good-example-of-real-engineering-tradeoffs">Why This Is a Good Example of Real Engineering Tradeoffs</h2>

<p>I like this case because it is not a “we were wrong and now we are wise” story.</p>

<p>It is a story about shipping under realistic constraints:</p>

<ul>
  <li>we needed a workable webhook trust model</li>
  <li>we wanted to reduce attack surface from untrusted payload bodies</li>
  <li>we did not want billing incidents to spiral under repeated retries</li>
</ul>

<p>Those are all reasonable concerns. The code reflects a team trying to make the system sturdy with the tools and time it had.</p>

<p>That is why these are the most useful kinds of postmortem-adjacent decisions to write about. The original implementation makes sense. It just no longer feels like the final version I would want.</p>

<h2 id="the-broader-lesson">The Broader Lesson</h2>

<p>Security and reliability decisions are often made in combination, not isolation.</p>

<p>Our Stripe webhook path is a good example:</p>

<ul>
  <li>trust model</li>
  <li>failure handling</li>
  <li>retry behavior</li>
  <li>billing consistency</li>
</ul>

<p>all show up in one small endpoint.</p>

<p>That is why I think webhook handling is a great place to study a team’s maturity. It forces you to reveal what you value more when the design is imperfect: simplicity, authenticity, resilience, or operational containment.</p>

<p>In Trek Point’s case, we picked a pragmatic path that shipped. I am glad we did. I also think it is one of the clearest places where “good enough to launch” and “what I want in the long term” are not the same answer.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[I like writing about decisions that were reasonable, shipped, and still worth revisiting.]]></summary></entry><entry><title type="html">Redis Quietly Became Our Tiny Control Plane</title><link href="/engineering/2026/03/20/redis-became-our-control-plane.html" rel="alternate" type="text/html" title="Redis Quietly Became Our Tiny Control Plane" /><published>2026-03-20T10:00:00+00:00</published><updated>2026-03-20T10:00:00+00:00</updated><id>/engineering/2026/03/20/redis-became-our-control-plane</id><content type="html" xml:base="/engineering/2026/03/20/redis-became-our-control-plane.html"><![CDATA[<p>Redis starts innocently in most web apps.</p>

<p>You add it for one thing:</p>

<ul>
  <li>a task queue broker</li>
  <li>a cache</li>
  <li>maybe rate limiting</li>
</ul>

<p>Then enough practical needs pile up and suddenly Redis is not just an infrastructure dependency. It is the place where your product stores operational intent.</p>

<p>That happened in Trek Point.</p>

<p>Redis ended up backing:</p>

<ul>
  <li>Celery broker and result backend</li>
  <li>rate limiting</li>
  <li>feature flags</li>
  <li>runtime product settings</li>
</ul>

<p>At that point it is fair to say Redis became a small control plane for the application.</p>

<h2 id="why-this-happened">Why This Happened</h2>

<p>Because it was useful.</p>

<p>That is the honest answer.</p>

<p>There are a lot of low-friction product and operational decisions that do not justify a new table, admin surface, or deployment just to change one value. Redis made those decisions easy to externalize.</p>

<p>Examples:</p>

<ul>
  <li>toggling feature availability across workers</li>
  <li>changing a free-tier GPX export limit without a code deploy</li>
  <li>keeping rate limiting shared across processes</li>
  <li>running async jobs without introducing another moving part</li>
</ul>

<p>For a small team shipping quickly, that is a great trade.</p>

<h2 id="the-best-part-of-this-pattern">The Best Part of This Pattern</h2>

<p>Redis let us centralize controls that benefit from being:</p>

<ul>
  <li>shared across processes</li>
  <li>fast to read</li>
  <li>easy to mutate operationally</li>
  <li>resilient to deploy boundaries</li>
</ul>

<p>I especially like runtime settings in this category. Being able to change something like a free-tier threshold without redeploying is not glamorous, but it is exactly the kind of leverage product teams need when they are learning.</p>

<p>That is the difference between configuration as code and configuration as a live product control.</p>

<h2 id="the-risk-is-blast-radius">The Risk Is Blast Radius</h2>

<p>The downside is also obvious once you say it out loud:</p>

<p>if one Redis dependency backs jobs, rate limits, feature flags, and runtime settings, then Redis trouble can degrade multiple unrelated parts of the product at once.</p>

<p>That is the real trade.</p>

<p>It is not a problem when Redis is healthy. It is a design characteristic when Redis is not.</p>

<p>In Trek Point, that means one dependency influences:</p>

<ul>
  <li>whether Celery work flows</li>
  <li>whether API clients hit limits properly</li>
  <li>whether admin-controlled feature availability is honored</li>
  <li>whether runtime quota settings fall back to defaults</li>
</ul>

<p>That is more than “just caching.”</p>

<h2 id="failure-defaults-matter-a-lot">Failure Defaults Matter a Lot</h2>

<p>One thing I appreciated in this codebase is that some Redis-backed behaviors degrade intentionally.</p>

<p>For example:</p>

<ul>
  <li>feature flags default to enabled when Redis cannot be read</li>
  <li>runtime settings fall back to code defaults</li>
</ul>

<p>Those defaults are not arbitrary. They tell you what kind of failure the team considered safer:</p>

<ul>
  <li>keep the product broadly available</li>
  <li>avoid hard failures on transient control-plane issues</li>
</ul>

<p>That is a reasonable bias for user-facing product behavior, even if it means you temporarily lose some operational precision.</p>

<h2 id="why-i-still-like-this-design">Why I Still Like This Design</h2>

<p>I would not dismiss this as a hack. For a product at Trek Point’s stage, this is exactly the kind of internal platform decision that keeps velocity high.</p>

<p>Redis is doing real work here:</p>

<ul>
  <li>distributing state cheaply</li>
  <li>giving product and ops levers without schema churn</li>
  <li>supporting both infrastructure and business behavior</li>
</ul>

<p>That is not architectural laziness. That is a practical platform layer.</p>

<h2 id="where-id-draw-the-line-later">Where I’d Draw the Line Later</h2>

<p>I would keep this pattern for a while, but I would watch for signals that it is time to split responsibilities:</p>

<ul>
  <li>different availability expectations for jobs versus feature controls</li>
  <li>operational confusion about which Redis failures affect which subsystems</li>
  <li>more auditability required around settings changes</li>
  <li>a need for richer administrative history or validation</li>
</ul>

<p>That is when a tiny control plane starts wanting stronger product boundaries of its own.</p>

<h2 id="the-bigger-lesson">The Bigger Lesson</h2>

<p>Infrastructure choices often become product choices gradually.</p>

<p>Redis is a good example because it is so easy to reach for. Over time it becomes the place where your system stores shared decisions, not just shared data.</p>

<p>That is what happened for Trek Point. We did not set out to build a control plane. We set out to solve a handful of practical problems quickly and ended up with a lightweight operational substrate that the rest of the app now relies on.</p>

<p>That is worth recognizing explicitly, because once a dependency is carrying control-plane semantics, you should operate it with much more respect than a generic cache.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[Redis starts innocently in most web apps.]]></summary></entry><entry><title type="html">Designing an OpenAPI Contract for a Product That Is Not Just an API</title><link href="/engineering/2026/03/19/openapi-contract-for-a-product-not-just-an-api.html" rel="alternate" type="text/html" title="Designing an OpenAPI Contract for a Product That Is Not Just an API" /><published>2026-03-19T10:00:00+00:00</published><updated>2026-03-19T10:00:00+00:00</updated><id>/engineering/2026/03/19/openapi-contract-for-a-product-not-just-an-api</id><content type="html" xml:base="/engineering/2026/03/19/openapi-contract-for-a-product-not-just-an-api.html"><![CDATA[<p>A lot of engineering teams treat OpenAPI as paperwork. The spec gets updated at the end, the examples go stale, and everyone agrees the code is the real source of truth.</p>

<p>I think that view misses what good API contracts actually do. In gateway products, the spec is often the clearest expression of what the platform wants developers to believe.</p>

<p>That is why <code class="language-plaintext highlighter-rouge">openapi.yaml</code> in this repo is more interesting than it looks.</p>

<h2 id="this-service-is-documenting-two-products-at-once">This service is documenting two products at once</h2>

<p>The gateway exposes a native Valhalla-style API surface and a Mapbox-compatible Directions endpoint. Those are not just two routes. They are two integration stories.</p>

<p>The OpenAPI spec tells developers:</p>

<ul>
  <li>this is one hosted service</li>
  <li>here is how auth works</li>
  <li>here are the public system endpoints</li>
  <li>here is the compatibility surface you can target if you already speak Mapbox</li>
</ul>

<p>That matters because a gateway like this is not just forwarding traffic. It is trying to become the stable contract in front of a more complicated backend topology.</p>

<h2 id="why-this-is-product-work">Why this is product work</h2>

<p>Once you publish a contract, you are making a promise about more than field names.</p>

<p>You are promising things like:</p>

<ul>
  <li>which endpoints are public and which are protected</li>
  <li>what shape errors take</li>
  <li>which query parameters are worth supporting long-term</li>
  <li>how much compatibility developers can rely on</li>
</ul>

<p>In other words, your spec is part of your distribution strategy.</p>

<p>For a product trying to attract developers, that is huge. A clear contract lowers trial friction, makes integration easier, and gives people confidence that the service was designed intentionally.</p>

<h2 id="what-i-like-in-this-spec">What I like in this spec</h2>

<p>I like that <code class="language-plaintext highlighter-rouge">/health</code> and <code class="language-plaintext highlighter-rouge">/metrics</code> are explicitly marked with empty security requirements. I like that the Mapbox-style route is documented with parameter enums and examples. I like that the API key scheme is described clearly instead of being buried in prose.</p>

<p>Those details matter because developers do not experience your architecture directly. They experience your contract.</p>

<p>I also like that the existence of a compatibility endpoint is visible in the spec itself. That sends a strong message: this service understands migration and interoperability as first-class concerns.</p>

<h2 id="where-specs-usually-drift">Where specs usually drift</h2>

<p>The hard part is keeping documentation honest when runtime behavior has nuance.</p>

<p>In this codebase, some of the most interesting nuances live outside the raw endpoint descriptions:</p>

<ul>
  <li>cross-region routing is orchestrated dynamically</li>
  <li>some Mapbox-like features are best-effort translations, not exact equivalence</li>
  <li>auth enforcement can move between app code and deployment boundary</li>
  <li>rate limiting is documented as an operational topic, not just a response code</li>
</ul>

<p>That is normal. A spec cannot hold every architectural detail. But when the gap between documented contract and lived behavior gets too large, developers stop trusting the spec.</p>

<h2 id="what-i-would-add">What I would add</h2>

<p>If I were polishing this for external adoption, I would add three things.</p>

<p>First, a short compatibility note describing which parts of the Mapbox surface are exact, which are approximate, and which are intentionally unsupported.</p>

<p>Second, clearer documentation around degraded states and upstream dependency behavior. Health endpoints are documented, but operational semantics are where production APIs often earn or lose trust.</p>

<p>Third, examples for common cross-region requests, because those are the requests that make this gateway special.</p>

<h2 id="the-lesson">The lesson</h2>

<p>OpenAPI is not just a generated artifact. It is one of the sharpest ways to communicate engineering intent.</p>

<p>In a system like this, the contract is doing real strategic work. It turns a multi-region routing architecture into something developers can adopt without learning the topology behind it. That is not paperwork. That is architecture translated into a usable product surface.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[A lot of engineering teams treat OpenAPI as paperwork. The spec gets updated at the end, the examples go stale, and everyone agrees the code is the real source of truth.]]></summary></entry><entry><title type="html">Observability in a Flask + Celery App Is Easy Until You Instrument It Twice</title><link href="/engineering/2026/03/18/observability-in-a-flask-plus-celery-app.html" rel="alternate" type="text/html" title="Observability in a Flask + Celery App Is Easy Until You Instrument It Twice" /><published>2026-03-18T10:00:00+00:00</published><updated>2026-03-18T10:00:00+00:00</updated><id>/engineering/2026/03/18/observability-in-a-flask-plus-celery-app</id><content type="html" xml:base="/engineering/2026/03/18/observability-in-a-flask-plus-celery-app.html"><![CDATA[<p>Most observability tutorials assume a simpler world than the one production Python apps actually live in.</p>

<p>They assume:</p>

<ul>
  <li>one app process</li>
  <li>one startup path</li>
  <li>one instrumentation moment</li>
  <li>one idea of request lifecycle</li>
</ul>

<p>Trek Point is not that world.</p>

<p>We have:</p>

<ul>
  <li>a Flask app factory</li>
  <li>Gunicorn-style web processes</li>
  <li>Celery workers</li>
  <li>SQLAlchemy engines that should be instrumented once</li>
  <li>requests, Redis, and task execution crossing process boundaries</li>
</ul>

<p>That means the hard part of observability is not “how do we emit spans?” It is “how do we avoid producing a noisy, misleading mess?”</p>

<h2 id="why-we-used-both-sentry-and-opentelemetry">Why We Used Both Sentry and OpenTelemetry</h2>

<p>I do not believe one tool cleanly solves all observability needs in most product teams.</p>

<p>For Trek Point:</p>

<ul>
  <li>Sentry gives us application error visibility and a familiar debugging workflow</li>
  <li>OpenTelemetry gives us a path for traces and logs across Flask, SQLAlchemy, Celery, Redis, and outbound HTTP</li>
</ul>

<p>Those tools are not redundant. They answer different questions.</p>

<p>When a request crashes, Sentry is often the fastest route to the error. When a request is merely slow, fragmented across services, or degraded somewhere in a queue-backed path, tracing becomes more valuable.</p>

<p>That division of labor is healthy.</p>

<h2 id="the-real-problem-was-instrumentation-lifecycle">The Real Problem Was Instrumentation Lifecycle</h2>

<p>What bit us conceptually was not how to turn tracing on. It was when instrumentation happens.</p>

<p>In an app-factory world, <code class="language-plaintext highlighter-rouge">create_app()</code> may run more often than you think:</p>

<ul>
  <li>once for the web app</li>
  <li>again in worker contexts</li>
  <li>sometimes twice per process depending on boot paths and imports</li>
</ul>

<p>That makes “instrument everything during startup” trickier than it sounds. If you patch SQLAlchemy, Flask, Celery, requests, or Redis repeatedly, you can end up with warnings, duplicate hooks, or inconsistent runtime behavior.</p>

<p>That is why I liked the discipline in our telemetry setup: treat cross-cutting instrumentors as per-process singletons, guard them carefully, and only instrument the app itself when needed.</p>

<p>This is the kind of detail that does not show up in architecture diagrams but absolutely matters in production.</p>

<h2 id="sqlalchemy-was-a-good-example">SQLAlchemy Was a Good Example</h2>

<p>Database instrumentation is often deceptively stateful.</p>

<p>If you instrument after engines are already created, you can miss things. If you instrument too broadly on every app startup, you can get duplicate instrumentation warnings. In a codebase with an app factory and worker imports, the timing matters.</p>

<p>That is why observability code deserves the same design care as business logic. It is not just config.</p>

<h2 id="logs-traces-and-errors-need-a-shared-mental-model">Logs, Traces, and Errors Need a Shared Mental Model</h2>

<p>One thing I try to avoid is collecting every possible signal without deciding how engineers should use them.</p>

<p>The better question is:</p>

<p>“What debugging story are we trying to support?”</p>

<p>For Trek Point, the useful story looked something like this:</p>

<ul>
  <li>an exception reaches Sentry</li>
  <li>traces show the request path, SQL timing, Redis behavior, and outbound requests</li>
  <li>task execution can be correlated when work moves from request thread to Celery</li>
  <li>logs can be exported with the same service identity into the same telemetry backend</li>
</ul>

<p>That is much better than a tool-by-tool rollout where each signal exists in isolation.</p>

<h2 id="production-deployment-details-matter">Production Deployment Details Matter</h2>

<p>Telemetry setup is one of those areas where local success tells you almost nothing.</p>

<p>A production-ready setup has to account for:</p>

<ul>
  <li>exporter configuration</li>
  <li>service naming</li>
  <li>sampling strategy</li>
  <li>process model</li>
  <li>whether instrumentation is safe under repeated boot</li>
</ul>

<p>I have seen plenty of teams “add OpenTelemetry” and still end up blind because the lifecycle assumptions were wrong. Instrumentation is code. It needs to be reviewed with runtime behavior in mind.</p>

<h2 id="what-id-encourage-more-teams-to-do">What I’d Encourage More Teams to Do</h2>

<p>Treat observability setup as a first-class subsystem, not a wrapper around environment variables.</p>

<p>That means:</p>

<ul>
  <li>document how each process type is instrumented</li>
  <li>guard singleton patchers carefully</li>
  <li>decide what each telemetry tool is responsible for</li>
  <li>trace the paths users actually care about, not just happy-path web requests</li>
</ul>

<p>In products like Trek Point, some of the most interesting failures happen between the request and the worker, or between the upload and the derived media. If your observability story stops at Flask requests, you are missing half the product.</p>

<h2 id="the-main-lesson">The Main Lesson</h2>

<p>The difficulty of observability in Python is rarely “can we install the package?” The difficulty is making instrumentation reflect the real execution model of the app.</p>

<p>In Trek Point, the good work was not just turning on tracing. It was being explicit about repeated startup paths, singleton instrumentation, and how web requests, database work, outbound calls, Redis, and Celery should appear as one understandable system.</p>

<p>That is what observability should do: make a multi-part product feel legible when it misbehaves.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[Most observability tutorials assume a simpler world than the one production Python apps actually live in.]]></summary></entry><entry><title type="html">Observability Before `app.listen()`: Preloading OpenTelemetry, Sentry, and Pino in Node</title><link href="/engineering/2026/03/17/observability-before-app-listen.html" rel="alternate" type="text/html" title="Observability Before `app.listen()`: Preloading OpenTelemetry, Sentry, and Pino in Node" /><published>2026-03-17T10:00:00+00:00</published><updated>2026-03-17T10:00:00+00:00</updated><id>/engineering/2026/03/17/observability-before-app-listen</id><content type="html" xml:base="/engineering/2026/03/17/observability-before-app-listen.html"><![CDATA[<p>There are two ways teams usually add observability.</p>

<p>The first is deliberate: initialize telemetry before the application boots so you capture startup, request lifecycle, and error context from the beginning.</p>

<p>The second is common: bolt things on later, discover blind spots during incidents, and slowly fill them in while promising to clean it up “next sprint.”</p>

<p>This gateway leans toward the first path, and that is one of the strongest signals that it was built with production in mind.</p>

<h2 id="the-key-design-choice">The key design choice</h2>

<p>In <code class="language-plaintext highlighter-rouge">package.json</code>, both <code class="language-plaintext highlighter-rouge">start</code> and <code class="language-plaintext highlighter-rouge">dev</code> run Node with <code class="language-plaintext highlighter-rouge">--import ./src/instrument.js</code> before starting <code class="language-plaintext highlighter-rouge">src/server.js</code>.</p>

<p>That small startup detail does a lot of work. It ensures OpenTelemetry and Sentry are initialized before the Express app is imported and before any requests are handled. If you care about capturing the full request path and startup behavior, initialization order matters.</p>

<p>Too many services get this wrong and then wonder why their traces start halfway through the stack.</p>

<h2 id="what-this-setup-includes">What this setup includes</h2>

<p><code class="language-plaintext highlighter-rouge">src/instrument.js</code> wires together three observability layers:</p>

<ul>
  <li>OpenTelemetry auto-instrumentation with an OTLP exporter to Honeycomb</li>
  <li>Sentry error and log ingestion</li>
  <li>Pino-based application and HTTP logging through <code class="language-plaintext highlighter-rouge">src/services/logger.js</code></li>
</ul>

<p>I like this combination because each tool is doing a distinct job:</p>

<ul>
  <li>traces explain request flow and dependency timing</li>
  <li>logs explain local events and debugging detail</li>
  <li>Sentry captures failures and gives operators an incident workflow</li>
</ul>

<p>The service is not trying to force one tool to solve every problem.</p>

<h2 id="the-privacy-detail-i-was-happy-to-see">The privacy detail I was happy to see</h2>

<p><code class="language-plaintext highlighter-rouge">beforeSend()</code> removes <code class="language-plaintext highlighter-rouge">x-api-key</code> and <code class="language-plaintext highlighter-rouge">authorization</code> headers before events leave the process. That is one of those details that separates “we installed Sentry” from “we thought about operating this safely.”</p>

<p>Telemetry systems are where sensitive data goes to become permanent if you are careless. Scrubbing secrets at the boundary is not glamorous work, but it is the sort of thing mature teams automate early.</p>

<h2 id="what-is-good-and-what-is-missing">What is good and what is missing</h2>

<p>The code already captures a lot of value with relatively little machinery. Prometheus metrics exist in <code class="language-plaintext highlighter-rouge">src/services/metrics.js</code>, request IDs are added in <code class="language-plaintext highlighter-rouge">src/middleware/requestContext.js</code>, upstream attempts are logged with timing in <code class="language-plaintext highlighter-rouge">src/services/httpClient.js</code>, and Express error handling is connected to Sentry.</p>

<p>But the observability story is not finished, which makes it interesting.</p>

<p>The custom metrics expose a request counter labeled by endpoint, mode, and status. That is useful, but it is only the start of a RED-style metrics view. I would want latency histograms, cache hit ratios, breaker-open counters, and region-level upstream health metrics before calling this fully mature.</p>

<h2 id="why-i-still-like-the-current-approach">Why I still like the current approach</h2>

<p>This code is a good example of choosing the right first 80 percent. It does not try to build an internal observability platform. It uses standard tools, initializes them early, and keeps the integration close to the app lifecycle.</p>

<p>That is exactly the sort of restraint I want in a service like this. Instrumentation should make the system clearer, not become a second system that also needs constant care.</p>

<h2 id="what-i-would-improve-next">What I would improve next</h2>

<p>I would add:</p>

<ul>
  <li>Prometheus histograms for request duration and upstream latency</li>
  <li>explicit cache metrics for LRU and Redis tiers</li>
  <li>region labels for upstream error counts</li>
  <li>a small dashboard that correlates health probes, breaker events, and request latency</li>
</ul>

<p>With those additions, operators could answer the important questions faster: is the gateway slow, is one region unhealthy, or is the cache just cold?</p>

<h2 id="the-lesson">The lesson</h2>

<p>Good observability is not about collecting everything. It is about collecting the right signals early enough that you can explain the system under stress.</p>

<p>Preloading telemetry before <code class="language-plaintext highlighter-rouge">app.listen()</code> is a strong architectural move because it says observability is part of the runtime contract, not an afterthought. The tools may change later. That design instinct should not.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[There are two ways teams usually add observability.]]></summary></entry><entry><title type="html">The Hidden Complexity of GPX and FIT Support in a Route-Planning Product</title><link href="/engineering/2026/03/15/the-hidden-complexity-of-gpx-and-fit.html" rel="alternate" type="text/html" title="The Hidden Complexity of GPX and FIT Support in a Route-Planning Product" /><published>2026-03-15T10:00:00+00:00</published><updated>2026-03-15T10:00:00+00:00</updated><id>/engineering/2026/03/15/the-hidden-complexity-of-gpx-and-fit</id><content type="html" xml:base="/engineering/2026/03/15/the-hidden-complexity-of-gpx-and-fit.html"><![CDATA[<p>“Supports GPX and FIT” is the kind of bullet point that looks tiny on a pricing page.</p>

<p>In a route-planning product, it is not tiny at all.</p>

<p>By the time we had Trek Point handling real uploads and exports, that one feature implied:</p>

<ul>
  <li>file validation</li>
  <li>size limits</li>
  <li>parsing multiple formats with different quirks</li>
  <li>coordinate chain extraction</li>
  <li>activity metric derivation</li>
  <li>playback data generation</li>
  <li>route enrichment through external services</li>
  <li>export constraints tied to both product and plan</li>
</ul>

<p>The feature sounds like file handling. It is really a pipeline.</p>

<h2 id="import-is-more-than-can-we-parse-the-file">Import Is More Than “Can We Parse the File”</h2>

<p>The first job is obvious: read GPX and FIT and turn them into a usable coordinate chain.</p>

<p>The second job is where the product value appears:</p>

<ul>
  <li>derive distance and duration</li>
  <li>preserve timestamps where available</li>
  <li>build playback-friendly <code class="language-plaintext highlighter-rouge">track_series</code></li>
  <li>compute start and end points</li>
  <li>normalize a route representation the rest of the app can use</li>
</ul>

<p>At that point you are no longer just accepting a file format. You are translating an external representation into your internal activity model.</p>

<p>That translation layer becomes a product boundary in its own right.</p>

<h2 id="good-route-products-always-end-up-caring-about-derived-data">Good Route Products Always End Up Caring About Derived Data</h2>

<p>A raw track is rarely enough.</p>

<p>Users expect more than “we stored your points.” They expect:</p>

<ul>
  <li>route playback</li>
  <li>charts</li>
  <li>distance summaries</li>
  <li>meaningful previews</li>
  <li>export behavior that feels consistent across imported and native data</li>
</ul>

<p>That is why Trek Point’s processing pipeline computes secondary structures instead of leaving the file as the source of truth for every UI.</p>

<p>This is one of those choices that adds work early but saves pain later. If every screen has to reinterpret GPX or FIT raw data differently, the product stops being coherent.</p>

<h2 id="decimation-was-not-just-an-optimization">Decimation Was Not Just an Optimization</h2>

<p>One of the more interesting details in the code is that we decimate coordinate chains before sending them to Valhalla to derive ordered way-type segments.</p>

<p>You could describe that as a performance tweak. It is also a product decision.</p>

<p>External services have practical input limits, latency costs, and diminishing returns. We do not need every point in a dense track to classify the route meaningfully. But if we decimate too aggressively, the route loses fidelity and downstream labeling gets worse.</p>

<p>That tension is common in map products:</p>

<ul>
  <li>too much fidelity is expensive</li>
  <li>too little fidelity is misleading</li>
</ul>

<p>There is no perfect answer. There is only a product-informed threshold.</p>

<h2 id="export-adds-a-second-layer-of-constraints">Export Adds a Second Layer of Constraints</h2>

<p>Supporting export sounds like the inverse of import. It is not.</p>

<p>Now you care about:</p>

<ul>
  <li>output format expectations</li>
  <li>coordinate count caps</li>
  <li>attachment naming</li>
  <li>browser download behavior</li>
  <li>whether the current user is allowed to export that format at all</li>
</ul>

<p>In Trek Point, FIT export from the planner is gated both by entitlement and by coordinate count. That is the right shape of guardrail. It protects the product from abuse and the user from asking the system to do something we know will be brittle.</p>

<h2 id="billing-and-file-formats-end-up-intertwined">Billing and File Formats End Up Intertwined</h2>

<p>This is one of the more “real product” aspects of the implementation.</p>

<p>Not every export format is equal from a product perspective. GPX can be part of a free-tier experience with limits. FIT and GeoJSON are stronger signals of advanced use and can sit behind premium access.</p>

<p>People sometimes criticize this kind of gating as arbitrary, but it is often a good reflection of actual cost and value:</p>

<ul>
  <li>advanced exports attract advanced users</li>
  <li>those users stress the product differently</li>
  <li>those formats often require more support and correctness guarantees</li>
</ul>

<p>The important thing is to gate capabilities consistently and transparently, not to pretend all bytes are equal.</p>

<h2 id="testing-the-edge-cases-was-worth-it">Testing the Edge Cases Was Worth It</h2>

<p>One of the tests I like in Trek Point verifies that GPX processing emits <code class="language-plaintext highlighter-rouge">time_dist</code>, because route playback depends on it. That is a perfect example of a high-value test.</p>

<p>It is not just checking parser output mechanically. It is protecting an actual user-visible capability.</p>

<p>The best tests in import/export systems are rarely “did function X return dict Y.” They are “does this uploaded artifact still support the product behavior users depend on?”</p>

<h2 id="what-id-tell-teams-building-similar-features">What I’d Tell Teams Building Similar Features</h2>

<p>Do not let file-format support get trapped in a storage mindset.</p>

<p>Ask these questions early:</p>

<ul>
  <li>what is the normalized internal representation?</li>
  <li>what derived data should be computed once instead of repeatedly?</li>
  <li>what external enrichment is worth the latency and failure modes?</li>
  <li>what are the safe limits for upload and export?</li>
  <li>which tests protect product behavior, not just parser mechanics?</li>
</ul>

<p>If you answer those well, GPX and FIT support becomes a capability. If you do not, it stays a pile of parsers.</p>

<h2 id="the-lesson">The Lesson</h2>

<p>Import/export features are often marketed like interoperability checkboxes. In reality they expose how mature your product model is.</p>

<p>Trek Point’s GPX and FIT support became valuable when we stopped thinking of the files as endpoints and started treating them as inputs and outputs of a richer route and activity pipeline.</p>

<p>That is the hidden complexity. The files matter, but the product meaning you derive from them matters more.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[“Supports GPX and FIT” is the kind of bullet point that looks tiny on a pricing page.]]></summary></entry><entry><title type="html">The Rate Limiter Worked in Staging and Failed in Production: Identity, Replicas, and Shared Counters</title><link href="/engineering/2026/03/14/rate-limiter-worked-in-staging-failed-in-production.html" rel="alternate" type="text/html" title="The Rate Limiter Worked in Staging and Failed in Production: Identity, Replicas, and Shared Counters" /><published>2026-03-14T10:00:00+00:00</published><updated>2026-03-14T10:00:00+00:00</updated><id>/engineering/2026/03/14/rate-limiter-worked-in-staging-failed-in-production</id><content type="html" xml:base="/engineering/2026/03/14/rate-limiter-worked-in-staging-failed-in-production.html"><![CDATA[<p>One of my favorite engineering stories is when the code is technically correct and operationally wrong.</p>

<p>This gateway has one of those stories hiding in plain sight.</p>

<p>In <code class="language-plaintext highlighter-rouge">src/app.js</code>, the service uses <code class="language-plaintext highlighter-rouge">express-rate-limit</code> before <code class="language-plaintext highlighter-rouge">apiKeyAuth</code>. That means almost every request is subject to a single global limiter unless the key is whitelisted. It is simple, easy to add, and absolutely good enough to stop accidental overload in a small deployment.</p>

<p>It is also not the policy the product actually wants.</p>

<h2 id="the-mismatch">The mismatch</h2>

<p>The business identity in this system is the API key. But because the limiter runs before auth, the effective identity is the default one used by the middleware, which is basically the client IP.</p>

<p>That seems minor until you think about what the product probably wants to enforce:</p>

<ul>
  <li>quotas per customer</li>
  <li>different limits per plan</li>
  <li>fair usage across a fleet of clients behind shared IPs</li>
  <li>consistent budgets across multiple gateway replicas</li>
</ul>

<p>The current limiter solves none of those exactly.</p>

<p>This is what makes it such a good lesson. The code is not broken. The architecture is misaligned with the real control point.</p>

<h2 id="why-this-happens-so-often">Why this happens so often</h2>

<p>Rate limiting is one of those features teams add under pressure. You need something fast, you need abuse protection, and the middleware makes it trivial to get started. So you put the limiter near the top of the stack and move on.</p>

<p>But order matters.</p>

<p>Middleware placement decides what identity data exists when enforcement runs. If auth has not happened yet, then you cannot rate-limit by authenticated business identity. You can only rate-limit by what you know at that point: IP, connection, or some raw header value you have not validated.</p>

<p>This repo even documents the gap in <code class="language-plaintext highlighter-rouge">README.md</code>, which I love. That is exactly the kind of honest operational documentation mature teams should keep.</p>

<h2 id="the-scaling-problem-is-separate-and-just-as-important">The scaling problem is separate and just as important</h2>

<p>Even if the limiter were keyed correctly, in-memory counters do not scale across replicas. With multiple gateway instances, your effective quota becomes roughly the sum of the per-instance limits unless you share state.</p>

<p>The repo already includes <code class="language-plaintext highlighter-rouge">rate-limit-redis</code> as a dependency, but it is not wired into the running middleware. That is not a failure. It is evidence of an architecture in transition: from local protection to product-grade quota enforcement.</p>

<p>Those transitions are where most systems spend their lives.</p>

<h2 id="what-i-would-change">What I would change</h2>

<p>I would split rate limiting into two layers.</p>

<p>First, I would keep a relatively loose pre-auth limiter keyed by IP to defend against brute-force abuse and credential stuffing patterns.</p>

<p>Second, I would move the stricter quota limiter after <code class="language-plaintext highlighter-rouge">apiKeyAuth</code> and key it on validated API keys. In a multi-instance deployment, I would back that limiter with Redis so every replica shares the same counters.</p>

<p>That gets you a cleaner separation of concerns:</p>

<ul>
  <li>pre-auth protection for unknown callers</li>
  <li>post-auth quotas for real customers</li>
  <li>one identity model for business policy</li>
</ul>

<h2 id="why-this-is-a-great-blog-topic">Why this is a great blog topic</h2>

<p>This is the kind of issue senior engineers recognize immediately because it sits at the boundary between code and operations. The middleware works. The bug is in the model.</p>

<p>A lot of production systems have these hidden mismatches:</p>

<ul>
  <li>caching keyed on transport details instead of business intent</li>
  <li>auth checks that happen after expensive work</li>
  <li>retries that amplify failure</li>
  <li>quotas enforced at the wrong trust boundary</li>
</ul>

<p>They are hard to spot because everything still “works” until scale or abuse teaches you otherwise.</p>

<h2 id="the-lesson">The lesson</h2>

<p>Rate limiting is not just an HTTP concern. It is an identity concern and a distributed systems concern.</p>

<p>If you enforce it before you know who the caller is, you get the wrong policy. If you enforce it without shared state, you get the wrong economics. And if you do both, you can still pass staging while failing the product reality of production.</p>

<p>That is why this code is interesting. It captures the exact moment many systems hit: the point where a protective middleware feature has to become a real platform policy.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[One of my favorite engineering stories is when the code is technically correct and operationally wrong.]]></summary></entry><entry><title type="html">If the Queue Is Down, the Upload Still Has to Work</title><link href="/engineering/2026/03/13/if-the-queue-is-down-the-upload-still-has-to-work.html" rel="alternate" type="text/html" title="If the Queue Is Down, the Upload Still Has to Work" /><published>2026-03-13T10:00:00+00:00</published><updated>2026-03-13T10:00:00+00:00</updated><id>/engineering/2026/03/13/if-the-queue-is-down-the-upload-still-has-to-work</id><content type="html" xml:base="/engineering/2026/03/13/if-the-queue-is-down-the-upload-still-has-to-work.html"><![CDATA[<p>One of the clearest signals that a product has seen real users is how it behaves when infrastructure is unhealthy.</p>

<p>In a perfect diagram, a user uploads a GPX or FIT file, the web app stores it, a job is queued, a worker processes it, and the UI updates later. That is exactly how Trek Point is designed when everything is healthy.</p>

<p>But “when everything is healthy” is not a user experience strategy.</p>

<p>At some point Redis is unavailable, Celery workers are stale, a deploy is mid-rollout, or the queue path throws an exception at the worst moment. The question becomes:</p>

<p><strong>does the product fail like infrastructure, or degrade like software that still wants to help the user?</strong></p>

<p>We chose degradation.</p>

<h2 id="the-default-path-is-still-async">The Default Path Is Still Async</h2>

<p>For activity uploads, async processing is the right architecture.</p>

<p>Parsing files, deriving playback data, enriching route metadata, and generating previews are not work I want to do inline on every request if I can avoid it. The background path gives us:</p>

<ul>
  <li>better request latency</li>
  <li>more resilient retries and batching patterns</li>
  <li>a cleaner way to isolate heavier processing</li>
  <li>room to evolve previews and enrichment without blocking the upload flow</li>
</ul>

<p>That part is straightforward.</p>

<h2 id="the-important-part-was-the-fallback">The Important Part Was the Fallback</h2>

<p>The code that matters most is not the Celery task declaration. It is the behavior when enqueueing fails.</p>

<p>In Trek Point, if the app cannot enqueue activity processing, it falls back to processing the uploaded activity in-process. That is a deeply pragmatic choice. It is also the kind of choice some teams are initially uncomfortable with because it introduces two execution paths.</p>

<p>I still think it was the right call.</p>

<p>Why? Because the user’s mental model is:</p>

<p>“I uploaded my activity. Did it work?”</p>

<p>They are not asking:</p>

<p>“Was this operation successfully delegated to our preferred background execution substrate?”</p>

<h2 id="the-tradeoff-is-real">The Tradeoff Is Real</h2>

<p>I do not want to oversell this pattern. Fallback-to-sync is not magic.</p>

<p>It can introduce:</p>

<ul>
  <li>slower requests under failure</li>
  <li>inconsistent latency</li>
  <li>duplicated execution paths to maintain</li>
  <li>harder reasoning about partial failure</li>
  <li>the possibility that web and worker code drift if discipline slips</li>
</ul>

<p>That is all true.</p>

<p>But there is another failure mode that is often worse in product terms:</p>

<ul>
  <li>the upload looks accepted</li>
  <li>nothing processes</li>
  <li>the user gets no useful outcome</li>
  <li>support gets the ticket later</li>
</ul>

<p>For a consumer-facing or prosumer-facing product, that is often the worse trade.</p>

<h2 id="repair-paths-matter-too">Repair Paths Matter Too</h2>

<p>One thing I like in Trek Point is that we did not stop at fallback logic. We also left ourselves a repair path.</p>

<p>There is a user-facing reprocess path for activities, and there is even playback-debug UI that helps explain when a worker may be running old code or when derived playback data is missing. That is not pretty from a purist architecture perspective, but it is excellent product engineering.</p>

<p>It says:</p>

<ul>
  <li>we know distributed processing can drift</li>
  <li>we know deploys are not always perfectly synchronized</li>
  <li>we want a safe way to recover without turning every issue into an ops incident</li>
</ul>

<p>That is the kind of realism I trust in production systems.</p>

<h2 id="why-this-pattern-fit-trek-point">Why This Pattern Fit Trek Point</h2>

<p>File upload and route activity processing sit in an awkward middle ground:</p>

<ul>
  <li>too heavy to always do inline</li>
  <li>too user-visible to silently fail behind a queue</li>
</ul>

<p>That is exactly the zone where graceful degradation earns its keep.</p>

<p>If this were a purely internal analytics pipeline, I would care more about strict queue guarantees and less about inline fallback. But this is user content. People expect an immediate path from “I uploaded a file” to “I can use it.”</p>

<p>Product context should influence reliability design.</p>

<h2 id="what-i-would-tighten-over-time">What I Would Tighten Over Time</h2>

<p>The risk with these pragmatic fallbacks is not the first implementation. It is drift.</p>

<p>Over time I would want:</p>

<ul>
  <li>stronger observability around fallback frequency</li>
  <li>explicit alerts if sync fallback starts happening too often</li>
  <li>tests that assert behavioral parity between web-triggered and worker-triggered processing</li>
  <li>clear runbooks for stale workers and partial deploys</li>
</ul>

<p>The fallback should be a resilience feature, not a hidden permanent execution mode.</p>

<h2 id="the-broader-lesson">The Broader Lesson</h2>

<p>Queue-centric architecture can make teams think in infrastructure terms. Users think in outcome terms.</p>

<p>When the queue is down, the question is not whether your async design was elegant. The question is whether the product still did the most reasonable thing for the person who just tried to use it.</p>

<p>For Trek Point, that often meant this:</p>

<p>if we can still safely process the upload, do it.</p>

<p>That is not the cleanest diagram. It is a better product.</p>]]></content><author><name>Bongani Mbigi</name></author><category term="engineering" /><summary type="html"><![CDATA[One of the clearest signals that a product has seen real users is how it behaves when infrastructure is unhealthy.]]></summary></entry></feed>