If the Queue Is Down, the Upload Still Has to Work
One of the clearest signals that a product has seen real users is how it behaves when infrastructure is unhealthy.
In a perfect diagram, a user uploads a GPX or FIT file, the web app stores it, a job is queued, a worker processes it, and the UI updates later. That is exactly how Trek Point is designed when everything is healthy.
But “when everything is healthy” is not a user experience strategy.
At some point Redis is unavailable, Celery workers are stale, a deploy is mid-rollout, or the queue path throws an exception at the worst moment. The question becomes:
does the product fail like infrastructure, or degrade like software that still wants to help the user?
We chose degradation.
The Default Path Is Still Async
For activity uploads, async processing is the right architecture.
Parsing files, deriving playback data, enriching route metadata, and generating previews are not work I want to do inline on every request if I can avoid it. The background path gives us:
- better request latency
- more resilient retries and batching patterns
- a cleaner way to isolate heavier processing
- room to evolve previews and enrichment without blocking the upload flow
That part is straightforward.
The Important Part Was the Fallback
The code that matters most is not the Celery task declaration. It is the behavior when enqueueing fails.
In Trek Point, if the app cannot enqueue activity processing, it falls back to processing the uploaded activity in-process. That is a deeply pragmatic choice. It is also the kind of choice some teams are initially uncomfortable with because it introduces two execution paths.
I still think it was the right call.
Why? Because the user’s mental model is:
“I uploaded my activity. Did it work?”
They are not asking:
“Was this operation successfully delegated to our preferred background execution substrate?”
The Tradeoff Is Real
I do not want to oversell this pattern. Fallback-to-sync is not magic.
It can introduce:
- slower requests under failure
- inconsistent latency
- duplicated execution paths to maintain
- harder reasoning about partial failure
- the possibility that web and worker code drift if discipline slips
That is all true.
But there is another failure mode that is often worse in product terms:
- the upload looks accepted
- nothing processes
- the user gets no useful outcome
- support gets the ticket later
For a consumer-facing or prosumer-facing product, that is often the worse trade.
Repair Paths Matter Too
One thing I like in Trek Point is that we did not stop at fallback logic. We also left ourselves a repair path.
There is a user-facing reprocess path for activities, and there is even playback-debug UI that helps explain when a worker may be running old code or when derived playback data is missing. That is not pretty from a purist architecture perspective, but it is excellent product engineering.
It says:
- we know distributed processing can drift
- we know deploys are not always perfectly synchronized
- we want a safe way to recover without turning every issue into an ops incident
That is the kind of realism I trust in production systems.
Why This Pattern Fit Trek Point
File upload and route activity processing sit in an awkward middle ground:
- too heavy to always do inline
- too user-visible to silently fail behind a queue
That is exactly the zone where graceful degradation earns its keep.
If this were a purely internal analytics pipeline, I would care more about strict queue guarantees and less about inline fallback. But this is user content. People expect an immediate path from “I uploaded a file” to “I can use it.”
Product context should influence reliability design.
What I Would Tighten Over Time
The risk with these pragmatic fallbacks is not the first implementation. It is drift.
Over time I would want:
- stronger observability around fallback frequency
- explicit alerts if sync fallback starts happening too often
- tests that assert behavioral parity between web-triggered and worker-triggered processing
- clear runbooks for stale workers and partial deploys
The fallback should be a resilience feature, not a hidden permanent execution mode.
The Broader Lesson
Queue-centric architecture can make teams think in infrastructure terms. Users think in outcome terms.
When the queue is down, the question is not whether your async design was elegant. The question is whether the product still did the most reasonable thing for the person who just tried to use it.
For Trek Point, that often meant this:
if we can still safely process the upload, do it.
That is not the cleanest diagram. It is a better product.