Trekpoint

The Production Scheduler Footgun: Celery Beat in Config, Missing in Reality

2026-03-23T10:00:00+00:00

Some production problems are subtle scaling pathologies.

Others are simpler and more embarrassing:

the code says scheduled jobs exist, and the deployment does not actually run the scheduler.

Trek Point has the shape of that lesson. The application defines periodic Celery work for billing-related housekeeping, but the deployment configuration we ship prominently shows worker processes and not an obvious Beat process.

This is exactly the kind of issue that deserves to be written down because it teaches something bigger than Celery.

Why This Kind of Bug Is Dangerous

It hides in plain sight.

The codebase can look perfectly healthy:

beat_schedule is defined
task names are valid
local assumptions feel fine
everyone remembers that those jobs “exist”

But the actual question is not whether the schedule exists in Python. The question is whether any production process is alive to interpret that schedule.

That is the difference between configured and operationally real.

Why Teams Miss This

Because modern apps spread operational truth across too many places:

app config
Docker compose
deployment manifests
process managers
cloud schedulers
tribal knowledge

Any mismatch between those layers can survive for a long time if the tasks are not obviously user-facing every day.

That is especially true for maintenance work like:

expiring old coupons
marking soon-to-expire credit cards
cleaning up stale rows
synchronizing background state

These jobs often fail quietly until a support issue exposes them.

This Is Not Really About Celery Beat

The deeper lesson is that infrastructure declarations are only as real as the process topology that backs them.

A lot of engineering teams have a weak spot here. We review code thoroughly, but we often review deploy semantics informally. That creates a blind spot where product assumptions live in code and operational assumptions live in memory.

Once you have multiple process types, you need to be explicit about:

which processes must exist in every environment
which ones are optional locally
which health checks or alerts prove they are alive
which business workflows depend on them

If those questions are not answered concretely, the scheduler is just a nice idea in settings.py.

The Product Cost of Quietly Missing Scheduled Work

The reason this matters is not architectural neatness. It is product behavior.

Scheduled billing and maintenance jobs shape user trust even when users never see the jobs directly.

If a cleanup task does not run, the symptoms show up elsewhere:

stale billing state
expired incentives lingering too long
cards not being flagged when expected
support investigations that take longer because everyone assumes the automation fired

Those are indirect failures, which is why they are so easy to underestimate.

How I’d Guard Against This

If I were hardening this part of Trek Point, I would want at least one of the following to be true:

Beat is a first-class deploy target with clear ownership
periodic jobs move to an external scheduler with explicit invocation
alerts exist for scheduler liveness and last-successful-run timestamps
operational docs state exactly how periodic jobs run in each environment

The main goal is not “use Celery Beat correctly.” The goal is to make background business time visible.

That phrase sounds abstract, but it is real. Schedulers are how products express time-dependent intent:

daily
hourly
after expiry
before renewal

If no process is actually keeping that time, your business logic has a hole in it.

Why This Makes a Great Engineering Story

I like this kind of example because it is concrete, humble, and widely relatable.

Every experienced team has something like this in its history:

a cron defined but never deployed
a worker running without the scheduler
a scheduled task surviving in one environment but not another
a maintenance process everyone assumes someone else owns

These stories are useful because they cut through the fiction that “configured” means “running.”

The Lesson I’d Keep

Application code can declare intent. Production systems still need a living process to honor it.

That is the whole lesson.

If I could turn that into one engineering reflex, it would be this:

whenever a codebase declares periodic work, immediately ask, “show me the process that runs it in production.”

If nobody can answer quickly, you have probably found a more important issue than the code review comments in front of you.

Why We Chose Kamal Over Kubernetes for a Multi-Region Routing Gateway

2026-03-22T10:00:00+00:00

Not every production service needs a platform story big enough to impress conference slides.

Sometimes the most senior infrastructure decision is choosing less.

This gateway is a good example. It is a stateless Node service with optional Redis, clear environment-driven configuration, and a Docker-based deployment flow. The deploy config in config/deploy.yml uses Kamal instead of reaching immediately for Kubernetes, and I think that is exactly the kind of pragmatic decision more teams should write about.

Start with the workload, not the trend

The service does not manage durable business state. It does not run a complicated job system. It does not require service discovery across dozens of internal components. It listens on one port, calls a few upstreams, and scales horizontally in a mostly boring way.

That is not an insult. It is a gift.

Workloads like this are perfect candidates for simple deployment models because the application already has the right shape:

stateless request handling
config via environment variables
easy containerization
no tight coupling to node-local disk

The Dockerfile reflects that simplicity. The image is straightforward, startup is explicit, and the deployment config focuses on the things that matter: image registry, target host, SSL proxying, secrets, and runtime env.

Why a smaller platform can be the better platform

Kubernetes solves real problems. It also introduces a lot of surface area: manifests, controllers, ingress decisions, secret management conventions, observability plumbing, rollout policy, and operational overhead that teams often underestimate.

For a service at this stage, Kamal offers a simpler path:

container-based deploys
explicit host targeting
manageable secret injection
enough structure for repeatability without a full control plane

That is often the right tradeoff when the app itself is still evolving. You get a production deployment workflow without committing to infrastructure complexity you may not need yet.

The design still leaves room to grow

What I like about this repo is that the service is not painted into a corner. The app is already containerized. Redis is optional and can be externalized. Config is environment driven. Health and metrics endpoints exist. Those are all good portability decisions regardless of the scheduler.

In other words, choosing Kamal here does not mean choosing against future scale. It means deferring platform complexity until there is evidence you need it.

That distinction matters. A lot of teams frame these decisions as ambition versus simplicity. The better framing is timing versus cost.

What you give up

Of course there are tradeoffs.

A lighter deploy model may give you less built-in support for:

sophisticated autoscaling policies
standardized multi-node orchestration patterns
first-class service mesh integrations
broad internal platform consistency if the rest of the org is already on Kubernetes

Those are real considerations. But they are only benefits if your team will actually use them and support them well.

What I would watch for

The triggers that would make me revisit the platform choice are pretty clear:

traffic growth that makes rollout and capacity management significantly harder
more internal dependencies and sidecars around the gateway
stronger requirements for multi-region active-active operations
an organizational shift toward standardized platform tooling

Until then, the simpler path is often the more responsible one.

The lesson

Infrastructure decisions should reflect the shape of the application and the maturity of the team operating it.

For a stateless routing gateway, choosing Kamal over Kubernetes is not “less serious.” It is a bet that operational focus matters more than platform fashion. In my experience, that is often the senior move: build the application so it can grow, but keep the deployment story as small as reality allows.

The Stripe Webhook Decision I’d Revisit: Re-Fetching Events by ID Instead of Verifying Signatures

2026-03-21T10:00:00+00:00

I like writing about decisions that were reasonable, shipped, and still worth revisiting.

Our Stripe webhook path in Trek Point is one of those.

The implementation takes a pragmatic approach:

accept a JSON payload
require an event id
re-fetch the event from Stripe using our secret key
process the safe copy we retrieved directly from Stripe

That is not a crazy design. In fact, it has some real advantages. But if I were tightening the system now, this is one of the first places I would look again.

Why We Did It This Way

The original intuition was straightforward:

If an inbound webhook says it is event evt_123, do not trust the payload body. Ask Stripe for evt_123 directly and process that canonical version instead.

The appeal is obvious:

the event data comes from Stripe over authenticated API access
we do not depend on the request body contents except for the id
we avoid processing obviously forged payload bodies

For a product team moving quickly, that feels like a clean trust model.

What This Approach Gets Right

I still think there are legitimate strengths here.

That is better than naively consuming whatever object arrived over HTTP.

2. It centralizes on Stripe’s current view of the event

For some operational flows, that can be simpler than reasoning about every payload edge case locally.

3. It fit our existing Stripe integration model

The app already talked to Stripe directly for billing operations, so the mental model was consistent.

Those are real advantages. This was not security negligence. It was a pragmatic trust strategy.

Why I’d Still Revisit It

The biggest reason is that “re-fetch by id” and “verify webhook authenticity” are not the same thing.

Webhook signature verification answers:

“Did Stripe send this exact request to us?”

Re-fetching by id answers:

“Does this event id exist in Stripe, and can we retrieve it with our account credentials?”

Those are related, but not identical, security properties.

The second concern is operational coupling. Our webhook handler now depends on live Stripe API retrieval during request handling. If Stripe’s API is degraded or our outbound access is impaired, the webhook path becomes more fragile than it needs to be.

That is not hypothetical. Production systems spend a lot of time in partial failure modes.

The More Questionable Choice: Returning `200` on Generic Exceptions

The design choice I am less comfortable with today is returning success on broad exceptions specifically to stop Stripe from retrying.

I understand why that happened. Runaway retries can amplify bad incidents, especially if the handler is crashing on a condition retries will not fix.

But the downside is serious:

you may acknowledge work you did not actually complete
retries stop even if the failure was transient
recovery becomes a manual reconciliation problem

That is the kind of tradeoff teams make under production pressure, but it is also exactly the kind of behavior that deserves a second pass once the system matures.

What I’d Do Differently Now

I would likely move toward:

verifying Stripe webhook signatures on ingress
treating idempotency and replay handling as first-class concerns
retrying or dead-lettering failures more deliberately instead of broadly suppressing them
reserving 200-on-error behavior for cases we can prove are non-recoverable and safely ignorable

I might still keep the ability to re-fetch event details when useful, but I would not want that to be the primary authenticity model.

Why This Is a Good Example of Real Engineering Tradeoffs

I like this case because it is not a “we were wrong and now we are wise” story.

It is a story about shipping under realistic constraints:

we needed a workable webhook trust model
we wanted to reduce attack surface from untrusted payload bodies
we did not want billing incidents to spiral under repeated retries

Those are all reasonable concerns. The code reflects a team trying to make the system sturdy with the tools and time it had.

That is why these are the most useful kinds of postmortem-adjacent decisions to write about. The original implementation makes sense. It just no longer feels like the final version I would want.

The Broader Lesson

Security and reliability decisions are often made in combination, not isolation.

Our Stripe webhook path is a good example:

trust model
failure handling
retry behavior
billing consistency

all show up in one small endpoint.

That is why I think webhook handling is a great place to study a team’s maturity. It forces you to reveal what you value more when the design is imperfect: simplicity, authenticity, resilience, or operational containment.

In Trek Point’s case, we picked a pragmatic path that shipped. I am glad we did. I also think it is one of the clearest places where “good enough to launch” and “what I want in the long term” are not the same answer.

Redis Quietly Became Our Tiny Control Plane

2026-03-20T10:00:00+00:00

Redis starts innocently in most web apps.

You add it for one thing:

a task queue broker
a cache
maybe rate limiting

Then enough practical needs pile up and suddenly Redis is not just an infrastructure dependency. It is the place where your product stores operational intent.

That happened in Trek Point.

Redis ended up backing:

Celery broker and result backend
rate limiting
feature flags
runtime product settings

At that point it is fair to say Redis became a small control plane for the application.

Why This Happened

Because it was useful.

That is the honest answer.

There are a lot of low-friction product and operational decisions that do not justify a new table, admin surface, or deployment just to change one value. Redis made those decisions easy to externalize.

Examples:

toggling feature availability across workers
changing a free-tier GPX export limit without a code deploy
keeping rate limiting shared across processes
running async jobs without introducing another moving part

For a small team shipping quickly, that is a great trade.

The Best Part of This Pattern

Redis let us centralize controls that benefit from being:

shared across processes
fast to read
easy to mutate operationally
resilient to deploy boundaries

I especially like runtime settings in this category. Being able to change something like a free-tier threshold without redeploying is not glamorous, but it is exactly the kind of leverage product teams need when they are learning.

That is the difference between configuration as code and configuration as a live product control.

The Risk Is Blast Radius

The downside is also obvious once you say it out loud:

if one Redis dependency backs jobs, rate limits, feature flags, and runtime settings, then Redis trouble can degrade multiple unrelated parts of the product at once.

That is the real trade.

It is not a problem when Redis is healthy. It is a design characteristic when Redis is not.

In Trek Point, that means one dependency influences:

whether Celery work flows
whether API clients hit limits properly
whether admin-controlled feature availability is honored
whether runtime quota settings fall back to defaults

That is more than “just caching.”

Failure Defaults Matter a Lot

One thing I appreciated in this codebase is that some Redis-backed behaviors degrade intentionally.

For example:

feature flags default to enabled when Redis cannot be read
runtime settings fall back to code defaults

Those defaults are not arbitrary. They tell you what kind of failure the team considered safer:

keep the product broadly available
avoid hard failures on transient control-plane issues

That is a reasonable bias for user-facing product behavior, even if it means you temporarily lose some operational precision.

Why I Still Like This Design

I would not dismiss this as a hack. For a product at Trek Point’s stage, this is exactly the kind of internal platform decision that keeps velocity high.

Redis is doing real work here:

distributing state cheaply
giving product and ops levers without schema churn
supporting both infrastructure and business behavior

That is not architectural laziness. That is a practical platform layer.

Where I’d Draw the Line Later

I would keep this pattern for a while, but I would watch for signals that it is time to split responsibilities:

different availability expectations for jobs versus feature controls
operational confusion about which Redis failures affect which subsystems
more auditability required around settings changes
a need for richer administrative history or validation

That is when a tiny control plane starts wanting stronger product boundaries of its own.

The Bigger Lesson

Infrastructure choices often become product choices gradually.

Redis is a good example because it is so easy to reach for. Over time it becomes the place where your system stores shared decisions, not just shared data.

That is what happened for Trek Point. We did not set out to build a control plane. We set out to solve a handful of practical problems quickly and ended up with a lightweight operational substrate that the rest of the app now relies on.

That is worth recognizing explicitly, because once a dependency is carrying control-plane semantics, you should operate it with much more respect than a generic cache.

Designing an OpenAPI Contract for a Product That Is Not Just an API

2026-03-19T10:00:00+00:00

A lot of engineering teams treat OpenAPI as paperwork. The spec gets updated at the end, the examples go stale, and everyone agrees the code is the real source of truth.

I think that view misses what good API contracts actually do. In gateway products, the spec is often the clearest expression of what the platform wants developers to believe.

That is why openapi.yaml in this repo is more interesting than it looks.

This service is documenting two products at once

The gateway exposes a native Valhalla-style API surface and a Mapbox-compatible Directions endpoint. Those are not just two routes. They are two integration stories.

The OpenAPI spec tells developers:

this is one hosted service
here is how auth works
here are the public system endpoints
here is the compatibility surface you can target if you already speak Mapbox

That matters because a gateway like this is not just forwarding traffic. It is trying to become the stable contract in front of a more complicated backend topology.

Why this is product work

Once you publish a contract, you are making a promise about more than field names.

You are promising things like:

which endpoints are public and which are protected
what shape errors take
which query parameters are worth supporting long-term
how much compatibility developers can rely on

In other words, your spec is part of your distribution strategy.

For a product trying to attract developers, that is huge. A clear contract lowers trial friction, makes integration easier, and gives people confidence that the service was designed intentionally.

What I like in this spec

I like that /health and /metrics are explicitly marked with empty security requirements. I like that the Mapbox-style route is documented with parameter enums and examples. I like that the API key scheme is described clearly instead of being buried in prose.

Those details matter because developers do not experience your architecture directly. They experience your contract.

I also like that the existence of a compatibility endpoint is visible in the spec itself. That sends a strong message: this service understands migration and interoperability as first-class concerns.

Where specs usually drift

The hard part is keeping documentation honest when runtime behavior has nuance.

In this codebase, some of the most interesting nuances live outside the raw endpoint descriptions:

cross-region routing is orchestrated dynamically
some Mapbox-like features are best-effort translations, not exact equivalence
auth enforcement can move between app code and deployment boundary
rate limiting is documented as an operational topic, not just a response code

That is normal. A spec cannot hold every architectural detail. But when the gap between documented contract and lived behavior gets too large, developers stop trusting the spec.

What I would add

If I were polishing this for external adoption, I would add three things.

First, a short compatibility note describing which parts of the Mapbox surface are exact, which are approximate, and which are intentionally unsupported.

Second, clearer documentation around degraded states and upstream dependency behavior. Health endpoints are documented, but operational semantics are where production APIs often earn or lose trust.

Third, examples for common cross-region requests, because those are the requests that make this gateway special.

The lesson

OpenAPI is not just a generated artifact. It is one of the sharpest ways to communicate engineering intent.

In a system like this, the contract is doing real strategic work. It turns a multi-region routing architecture into something developers can adopt without learning the topology behind it. That is not paperwork. That is architecture translated into a usable product surface.

Observability in a Flask + Celery App Is Easy Until You Instrument It Twice

2026-03-18T10:00:00+00:00

Most observability tutorials assume a simpler world than the one production Python apps actually live in.

They assume:

one app process
one startup path
one instrumentation moment
one idea of request lifecycle

Trek Point is not that world.

We have:

a Flask app factory
Gunicorn-style web processes
Celery workers
SQLAlchemy engines that should be instrumented once
requests, Redis, and task execution crossing process boundaries

That means the hard part of observability is not “how do we emit spans?” It is “how do we avoid producing a noisy, misleading mess?”

Why We Used Both Sentry and OpenTelemetry

I do not believe one tool cleanly solves all observability needs in most product teams.

For Trek Point:

Sentry gives us application error visibility and a familiar debugging workflow
OpenTelemetry gives us a path for traces and logs across Flask, SQLAlchemy, Celery, Redis, and outbound HTTP

Those tools are not redundant. They answer different questions.

When a request crashes, Sentry is often the fastest route to the error. When a request is merely slow, fragmented across services, or degraded somewhere in a queue-backed path, tracing becomes more valuable.

That division of labor is healthy.

The Real Problem Was Instrumentation Lifecycle

What bit us conceptually was not how to turn tracing on. It was when instrumentation happens.

In an app-factory world, create_app() may run more often than you think:

once for the web app
again in worker contexts
sometimes twice per process depending on boot paths and imports

That makes “instrument everything during startup” trickier than it sounds. If you patch SQLAlchemy, Flask, Celery, requests, or Redis repeatedly, you can end up with warnings, duplicate hooks, or inconsistent runtime behavior.

That is why I liked the discipline in our telemetry setup: treat cross-cutting instrumentors as per-process singletons, guard them carefully, and only instrument the app itself when needed.

This is the kind of detail that does not show up in architecture diagrams but absolutely matters in production.

SQLAlchemy Was a Good Example

Database instrumentation is often deceptively stateful.

If you instrument after engines are already created, you can miss things. If you instrument too broadly on every app startup, you can get duplicate instrumentation warnings. In a codebase with an app factory and worker imports, the timing matters.

That is why observability code deserves the same design care as business logic. It is not just config.

Logs, Traces, and Errors Need a Shared Mental Model

One thing I try to avoid is collecting every possible signal without deciding how engineers should use them.

The better question is:

“What debugging story are we trying to support?”

For Trek Point, the useful story looked something like this:

an exception reaches Sentry
traces show the request path, SQL timing, Redis behavior, and outbound requests
task execution can be correlated when work moves from request thread to Celery
logs can be exported with the same service identity into the same telemetry backend

That is much better than a tool-by-tool rollout where each signal exists in isolation.

Production Deployment Details Matter

Telemetry setup is one of those areas where local success tells you almost nothing.

A production-ready setup has to account for:

exporter configuration
service naming
sampling strategy
process model
whether instrumentation is safe under repeated boot

I have seen plenty of teams “add OpenTelemetry” and still end up blind because the lifecycle assumptions were wrong. Instrumentation is code. It needs to be reviewed with runtime behavior in mind.

What I’d Encourage More Teams to Do

Treat observability setup as a first-class subsystem, not a wrapper around environment variables.

That means:

document how each process type is instrumented
guard singleton patchers carefully
decide what each telemetry tool is responsible for
trace the paths users actually care about, not just happy-path web requests

In products like Trek Point, some of the most interesting failures happen between the request and the worker, or between the upload and the derived media. If your observability story stops at Flask requests, you are missing half the product.

The Main Lesson

The difficulty of observability in Python is rarely “can we install the package?” The difficulty is making instrumentation reflect the real execution model of the app.

In Trek Point, the good work was not just turning on tracing. It was being explicit about repeated startup paths, singleton instrumentation, and how web requests, database work, outbound calls, Redis, and Celery should appear as one understandable system.

That is what observability should do: make a multi-part product feel legible when it misbehaves.

Observability Before `app.listen()`: Preloading OpenTelemetry, Sentry, and Pino in Node

2026-03-17T10:00:00+00:00

There are two ways teams usually add observability.

The first is deliberate: initialize telemetry before the application boots so you capture startup, request lifecycle, and error context from the beginning.

The second is common: bolt things on later, discover blind spots during incidents, and slowly fill them in while promising to clean it up “next sprint.”

This gateway leans toward the first path, and that is one of the strongest signals that it was built with production in mind.

The key design choice

In package.json, both start and dev run Node with --import ./src/instrument.js before starting src/server.js.

That small startup detail does a lot of work. It ensures OpenTelemetry and Sentry are initialized before the Express app is imported and before any requests are handled. If you care about capturing the full request path and startup behavior, initialization order matters.

Too many services get this wrong and then wonder why their traces start halfway through the stack.

What this setup includes

src/instrument.js wires together three observability layers:

OpenTelemetry auto-instrumentation with an OTLP exporter to Honeycomb
Sentry error and log ingestion
Pino-based application and HTTP logging through src/services/logger.js

I like this combination because each tool is doing a distinct job:

traces explain request flow and dependency timing
logs explain local events and debugging detail
Sentry captures failures and gives operators an incident workflow

The service is not trying to force one tool to solve every problem.

The privacy detail I was happy to see

beforeSend() removes x-api-key and authorization headers before events leave the process. That is one of those details that separates “we installed Sentry” from “we thought about operating this safely.”

Telemetry systems are where sensitive data goes to become permanent if you are careless. Scrubbing secrets at the boundary is not glamorous work, but it is the sort of thing mature teams automate early.

What is good and what is missing

The code already captures a lot of value with relatively little machinery. Prometheus metrics exist in src/services/metrics.js, request IDs are added in src/middleware/requestContext.js, upstream attempts are logged with timing in src/services/httpClient.js, and Express error handling is connected to Sentry.

But the observability story is not finished, which makes it interesting.

The custom metrics expose a request counter labeled by endpoint, mode, and status. That is useful, but it is only the start of a RED-style metrics view. I would want latency histograms, cache hit ratios, breaker-open counters, and region-level upstream health metrics before calling this fully mature.

Why I still like the current approach

This code is a good example of choosing the right first 80 percent. It does not try to build an internal observability platform. It uses standard tools, initializes them early, and keeps the integration close to the app lifecycle.

That is exactly the sort of restraint I want in a service like this. Instrumentation should make the system clearer, not become a second system that also needs constant care.

What I would improve next

I would add:

Prometheus histograms for request duration and upstream latency
explicit cache metrics for LRU and Redis tiers
region labels for upstream error counts
a small dashboard that correlates health probes, breaker events, and request latency

With those additions, operators could answer the important questions faster: is the gateway slow, is one region unhealthy, or is the cache just cold?

The lesson

Good observability is not about collecting everything. It is about collecting the right signals early enough that you can explain the system under stress.

Preloading telemetry before app.listen() is a strong architectural move because it says observability is part of the runtime contract, not an afterthought. The tools may change later. That design instinct should not.

The Hidden Complexity of GPX and FIT Support in a Route-Planning Product

2026-03-15T10:00:00+00:00

“Supports GPX and FIT” is the kind of bullet point that looks tiny on a pricing page.

In a route-planning product, it is not tiny at all.

By the time we had Trek Point handling real uploads and exports, that one feature implied:

file validation
size limits
parsing multiple formats with different quirks
coordinate chain extraction
activity metric derivation
playback data generation
route enrichment through external services
export constraints tied to both product and plan

The feature sounds like file handling. It is really a pipeline.

Import Is More Than “Can We Parse the File”

The first job is obvious: read GPX and FIT and turn them into a usable coordinate chain.

The second job is where the product value appears:

derive distance and duration
preserve timestamps where available
build playback-friendly track_series
compute start and end points
normalize a route representation the rest of the app can use

At that point you are no longer just accepting a file format. You are translating an external representation into your internal activity model.

That translation layer becomes a product boundary in its own right.

Good Route Products Always End Up Caring About Derived Data

A raw track is rarely enough.

Users expect more than “we stored your points.” They expect:

route playback
charts
distance summaries
meaningful previews
export behavior that feels consistent across imported and native data

That is why Trek Point’s processing pipeline computes secondary structures instead of leaving the file as the source of truth for every UI.

This is one of those choices that adds work early but saves pain later. If every screen has to reinterpret GPX or FIT raw data differently, the product stops being coherent.

Decimation Was Not Just an Optimization

One of the more interesting details in the code is that we decimate coordinate chains before sending them to Valhalla to derive ordered way-type segments.

You could describe that as a performance tweak. It is also a product decision.

External services have practical input limits, latency costs, and diminishing returns. We do not need every point in a dense track to classify the route meaningfully. But if we decimate too aggressively, the route loses fidelity and downstream labeling gets worse.

That tension is common in map products:

too much fidelity is expensive
too little fidelity is misleading

There is no perfect answer. There is only a product-informed threshold.

Export Adds a Second Layer of Constraints

Supporting export sounds like the inverse of import. It is not.

Now you care about:

output format expectations
coordinate count caps
attachment naming
browser download behavior
whether the current user is allowed to export that format at all

In Trek Point, FIT export from the planner is gated both by entitlement and by coordinate count. That is the right shape of guardrail. It protects the product from abuse and the user from asking the system to do something we know will be brittle.

Billing and File Formats End Up Intertwined

This is one of the more “real product” aspects of the implementation.

Not every export format is equal from a product perspective. GPX can be part of a free-tier experience with limits. FIT and GeoJSON are stronger signals of advanced use and can sit behind premium access.

People sometimes criticize this kind of gating as arbitrary, but it is often a good reflection of actual cost and value:

advanced exports attract advanced users
those users stress the product differently
those formats often require more support and correctness guarantees

The important thing is to gate capabilities consistently and transparently, not to pretend all bytes are equal.

Testing the Edge Cases Was Worth It

One of the tests I like in Trek Point verifies that GPX processing emits time_dist, because route playback depends on it. That is a perfect example of a high-value test.

It is not just checking parser output mechanically. It is protecting an actual user-visible capability.

The best tests in import/export systems are rarely “did function X return dict Y.” They are “does this uploaded artifact still support the product behavior users depend on?”

What I’d Tell Teams Building Similar Features

Do not let file-format support get trapped in a storage mindset.

Ask these questions early:

what is the normalized internal representation?
what derived data should be computed once instead of repeatedly?
what external enrichment is worth the latency and failure modes?
what are the safe limits for upload and export?
which tests protect product behavior, not just parser mechanics?

If you answer those well, GPX and FIT support becomes a capability. If you do not, it stays a pile of parsers.

The Lesson

Import/export features are often marketed like interoperability checkboxes. In reality they expose how mature your product model is.

Trek Point’s GPX and FIT support became valuable when we stopped thinking of the files as endpoints and started treating them as inputs and outputs of a richer route and activity pipeline.

That is the hidden complexity. The files matter, but the product meaning you derive from them matters more.

The Rate Limiter Worked in Staging and Failed in Production: Identity, Replicas, and Shared Counters

2026-03-14T10:00:00+00:00

One of my favorite engineering stories is when the code is technically correct and operationally wrong.

This gateway has one of those stories hiding in plain sight.

In src/app.js, the service uses express-rate-limit before apiKeyAuth. That means almost every request is subject to a single global limiter unless the key is whitelisted. It is simple, easy to add, and absolutely good enough to stop accidental overload in a small deployment.

It is also not the policy the product actually wants.

The mismatch

The business identity in this system is the API key. But because the limiter runs before auth, the effective identity is the default one used by the middleware, which is basically the client IP.

That seems minor until you think about what the product probably wants to enforce:

quotas per customer
different limits per plan
fair usage across a fleet of clients behind shared IPs
consistent budgets across multiple gateway replicas

The current limiter solves none of those exactly.

This is what makes it such a good lesson. The code is not broken. The architecture is misaligned with the real control point.

Why this happens so often

Rate limiting is one of those features teams add under pressure. You need something fast, you need abuse protection, and the middleware makes it trivial to get started. So you put the limiter near the top of the stack and move on.

But order matters.

Middleware placement decides what identity data exists when enforcement runs. If auth has not happened yet, then you cannot rate-limit by authenticated business identity. You can only rate-limit by what you know at that point: IP, connection, or some raw header value you have not validated.

This repo even documents the gap in README.md, which I love. That is exactly the kind of honest operational documentation mature teams should keep.

The scaling problem is separate and just as important

Even if the limiter were keyed correctly, in-memory counters do not scale across replicas. With multiple gateway instances, your effective quota becomes roughly the sum of the per-instance limits unless you share state.

The repo already includes rate-limit-redis as a dependency, but it is not wired into the running middleware. That is not a failure. It is evidence of an architecture in transition: from local protection to product-grade quota enforcement.

Those transitions are where most systems spend their lives.

What I would change

I would split rate limiting into two layers.

First, I would keep a relatively loose pre-auth limiter keyed by IP to defend against brute-force abuse and credential stuffing patterns.

Second, I would move the stricter quota limiter after apiKeyAuth and key it on validated API keys. In a multi-instance deployment, I would back that limiter with Redis so every replica shares the same counters.

That gets you a cleaner separation of concerns:

pre-auth protection for unknown callers
post-auth quotas for real customers
one identity model for business policy

Why this is a great blog topic

This is the kind of issue senior engineers recognize immediately because it sits at the boundary between code and operations. The middleware works. The bug is in the model.

A lot of production systems have these hidden mismatches:

caching keyed on transport details instead of business intent
auth checks that happen after expensive work
retries that amplify failure
quotas enforced at the wrong trust boundary

They are hard to spot because everything still “works” until scale or abuse teaches you otherwise.

The lesson

Rate limiting is not just an HTTP concern. It is an identity concern and a distributed systems concern.

If you enforce it before you know who the caller is, you get the wrong policy. If you enforce it without shared state, you get the wrong economics. And if you do both, you can still pass staging while failing the product reality of production.

That is why this code is interesting. It captures the exact moment many systems hit: the point where a protective middleware feature has to become a real platform policy.

If the Queue Is Down, the Upload Still Has to Work

2026-03-13T10:00:00+00:00

One of the clearest signals that a product has seen real users is how it behaves when infrastructure is unhealthy.

In a perfect diagram, a user uploads a GPX or FIT file, the web app stores it, a job is queued, a worker processes it, and the UI updates later. That is exactly how Trek Point is designed when everything is healthy.

But “when everything is healthy” is not a user experience strategy.

At some point Redis is unavailable, Celery workers are stale, a deploy is mid-rollout, or the queue path throws an exception at the worst moment. The question becomes:

does the product fail like infrastructure, or degrade like software that still wants to help the user?

We chose degradation.

The Default Path Is Still Async

For activity uploads, async processing is the right architecture.

Parsing files, deriving playback data, enriching route metadata, and generating previews are not work I want to do inline on every request if I can avoid it. The background path gives us:

better request latency
more resilient retries and batching patterns
a cleaner way to isolate heavier processing
room to evolve previews and enrichment without blocking the upload flow

That part is straightforward.

The Important Part Was the Fallback

The code that matters most is not the Celery task declaration. It is the behavior when enqueueing fails.

In Trek Point, if the app cannot enqueue activity processing, it falls back to processing the uploaded activity in-process. That is a deeply pragmatic choice. It is also the kind of choice some teams are initially uncomfortable with because it introduces two execution paths.

I still think it was the right call.

Why? Because the user’s mental model is:

“I uploaded my activity. Did it work?”

They are not asking:

“Was this operation successfully delegated to our preferred background execution substrate?”

The Tradeoff Is Real

I do not want to oversell this pattern. Fallback-to-sync is not magic.

It can introduce:

slower requests under failure
inconsistent latency
duplicated execution paths to maintain
harder reasoning about partial failure
the possibility that web and worker code drift if discipline slips

That is all true.

But there is another failure mode that is often worse in product terms:

the upload looks accepted
nothing processes
the user gets no useful outcome
support gets the ticket later

For a consumer-facing or prosumer-facing product, that is often the worse trade.

Repair Paths Matter Too

One thing I like in Trek Point is that we did not stop at fallback logic. We also left ourselves a repair path.

There is a user-facing reprocess path for activities, and there is even playback-debug UI that helps explain when a worker may be running old code or when derived playback data is missing. That is not pretty from a purist architecture perspective, but it is excellent product engineering.

It says:

we know distributed processing can drift
we know deploys are not always perfectly synchronized
we want a safe way to recover without turning every issue into an ops incident

That is the kind of realism I trust in production systems.

Why This Pattern Fit Trek Point

File upload and route activity processing sit in an awkward middle ground:

too heavy to always do inline
too user-visible to silently fail behind a queue

That is exactly the zone where graceful degradation earns its keep.

If this were a purely internal analytics pipeline, I would care more about strict queue guarantees and less about inline fallback. But this is user content. People expect an immediate path from “I uploaded a file” to “I can use it.”

Product context should influence reliability design.

What I Would Tighten Over Time

The risk with these pragmatic fallbacks is not the first implementation. It is drift.

Over time I would want:

stronger observability around fallback frequency
explicit alerts if sync fallback starts happening too often
tests that assert behavioral parity between web-triggered and worker-triggered processing
clear runbooks for stale workers and partial deploys

The fallback should be a resilience feature, not a hidden permanent execution mode.

The Broader Lesson

Queue-centric architecture can make teams think in infrastructure terms. Users think in outcome terms.

When the queue is down, the question is not whether your async design was elegant. The question is whether the product still did the most reasonable thing for the person who just tried to use it.

For Trek Point, that often meant this:

if we can still safely process the upload, do it.

That is not the cleanest diagram. It is a better product.

Trekpoint

The Production Scheduler Footgun: Celery Beat in Config, Missing in Reality

Why This Kind of Bug Is Dangerous

Why Teams Miss This

This Is Not Really About Celery Beat

The Product Cost of Quietly Missing Scheduled Work

How I’d Guard Against This

Why This Makes a Great Engineering Story

The Lesson I’d Keep

Why We Chose Kamal Over Kubernetes for a Multi-Region Routing Gateway

Start with the workload, not the trend

Why a smaller platform can be the better platform

The design still leaves room to grow

What you give up

What I would watch for

The lesson

The Stripe Webhook Decision I’d Revisit: Re-Fetching Events by ID Instead of Verifying Signatures

Why We Did It This Way

What This Approach Gets Right

1. It avoids blind trust in the inbound payload

2. It centralizes on Stripe’s current view of the event

3. It fit our existing Stripe integration model

Why I’d Still Revisit It

The More Questionable Choice: Returning 200 on Generic Exceptions

What I’d Do Differently Now

Why This Is a Good Example of Real Engineering Tradeoffs

The Broader Lesson

Redis Quietly Became Our Tiny Control Plane

Why This Happened

The Best Part of This Pattern

The Risk Is Blast Radius

Failure Defaults Matter a Lot

Why I Still Like This Design

Where I’d Draw the Line Later

The Bigger Lesson

Designing an OpenAPI Contract for a Product That Is Not Just an API

This service is documenting two products at once

Why this is product work

What I like in this spec

Where specs usually drift

What I would add

The lesson

Observability in a Flask + Celery App Is Easy Until You Instrument It Twice

Why We Used Both Sentry and OpenTelemetry

The Real Problem Was Instrumentation Lifecycle

SQLAlchemy Was a Good Example

Logs, Traces, and Errors Need a Shared Mental Model

Production Deployment Details Matter

What I’d Encourage More Teams to Do

The Main Lesson

Observability Before `app.listen()`: Preloading OpenTelemetry, Sentry, and Pino in Node

The key design choice

What this setup includes

The privacy detail I was happy to see

What is good and what is missing

Why I still like the current approach

What I would improve next

The lesson

The Hidden Complexity of GPX and FIT Support in a Route-Planning Product

Import Is More Than “Can We Parse the File”

Good Route Products Always End Up Caring About Derived Data

Decimation Was Not Just an Optimization

Export Adds a Second Layer of Constraints

Billing and File Formats End Up Intertwined

Testing the Edge Cases Was Worth It

What I’d Tell Teams Building Similar Features

The Lesson

The Rate Limiter Worked in Staging and Failed in Production: Identity, Replicas, and Shared Counters

The mismatch

Why this happens so often

The scaling problem is separate and just as important

What I would change

Why this is a great blog topic

The lesson

If the Queue Is Down, the Upload Still Has to Work

The Default Path Is Still Async

The Important Part Was the Fallback

The Tradeoff Is Real

Repair Paths Matter Too

Why This Pattern Fit Trek Point

The More Questionable Choice: Returning `200` on Generic Exceptions