If the Queue Is Down, the Upload Still Has to Work

One of the clearest signals that a product has seen real users is how it behaves when infrastructure is unhealthy.

In a perfect diagram, a user uploads a GPX or FIT file, the web app stores it, a job is queued, a worker processes it, and the UI updates later. That is exactly how Trek Point is designed when everything is healthy.

But “when everything is healthy” is not a user experience strategy.

At some point Redis is unavailable, Celery workers are stale, a deploy is mid-rollout, or the queue path throws an exception at the worst moment. The question becomes:

does the product fail like infrastructure, or degrade like software that still wants to help the user?

We chose degradation.

The Default Path Is Still Async

For activity uploads, async processing is the right architecture.

Parsing files, deriving playback data, enriching route metadata, and generating previews are not work I want to do inline on every request if I can avoid it. The background path gives us:

better request latency
more resilient retries and batching patterns
a cleaner way to isolate heavier processing
room to evolve previews and enrichment without blocking the upload flow

That part is straightforward.

The Important Part Was the Fallback

The code that matters most is not the Celery task declaration. It is the behavior when enqueueing fails.

In Trek Point, if the app cannot enqueue activity processing, it falls back to processing the uploaded activity in-process. That is a deeply pragmatic choice. It is also the kind of choice some teams are initially uncomfortable with because it introduces two execution paths.

I still think it was the right call.

Why? Because the user’s mental model is:

“I uploaded my activity. Did it work?”

They are not asking:

“Was this operation successfully delegated to our preferred background execution substrate?”

The Tradeoff Is Real

I do not want to oversell this pattern. Fallback-to-sync is not magic.

It can introduce:

slower requests under failure
inconsistent latency
duplicated execution paths to maintain
harder reasoning about partial failure
the possibility that web and worker code drift if discipline slips

That is all true.

But there is another failure mode that is often worse in product terms:

the upload looks accepted
nothing processes
the user gets no useful outcome
support gets the ticket later

For a consumer-facing or prosumer-facing product, that is often the worse trade.

Repair Paths Matter Too

One thing I like in Trek Point is that we did not stop at fallback logic. We also left ourselves a repair path.

There is a user-facing reprocess path for activities, and there is even playback-debug UI that helps explain when a worker may be running old code or when derived playback data is missing. That is not pretty from a purist architecture perspective, but it is excellent product engineering.

It says:

we know distributed processing can drift
we know deploys are not always perfectly synchronized
we want a safe way to recover without turning every issue into an ops incident

That is the kind of realism I trust in production systems.

Why This Pattern Fit Trek Point

File upload and route activity processing sit in an awkward middle ground:

too heavy to always do inline
too user-visible to silently fail behind a queue

That is exactly the zone where graceful degradation earns its keep.

If this were a purely internal analytics pipeline, I would care more about strict queue guarantees and less about inline fallback. But this is user content. People expect an immediate path from “I uploaded a file” to “I can use it.”

Product context should influence reliability design.

What I Would Tighten Over Time

The risk with these pragmatic fallbacks is not the first implementation. It is drift.

Over time I would want:

stronger observability around fallback frequency
explicit alerts if sync fallback starts happening too often
tests that assert behavioral parity between web-triggered and worker-triggered processing
clear runbooks for stale workers and partial deploys

The fallback should be a resilience feature, not a hidden permanent execution mode.

The Broader Lesson

Queue-centric architecture can make teams think in infrastructure terms. Users think in outcome terms.

When the queue is down, the question is not whether your async design was elegant. The question is whether the product still did the most reasonable thing for the person who just tried to use it.

For Trek Point, that often meant this:

if we can still safely process the upload, do it.

That is not the cleanest diagram. It is a better product.