The Stripe Webhook Decision I’d Revisit: Re-Fetching Events by ID Instead of Verifying Signatures
I like writing about decisions that were reasonable, shipped, and still worth revisiting.
Our Stripe webhook path in Trek Point is one of those.
The implementation takes a pragmatic approach:
- accept a JSON payload
- require an event id
- re-fetch the event from Stripe using our secret key
- process the safe copy we retrieved directly from Stripe
That is not a crazy design. In fact, it has some real advantages. But if I were tightening the system now, this is one of the first places I would look again.
Why We Did It This Way
The original intuition was straightforward:
If an inbound webhook says it is event evt_123, do not trust the payload body. Ask Stripe for evt_123 directly and process that canonical version instead.
The appeal is obvious:
- the event data comes from Stripe over authenticated API access
- we do not depend on the request body contents except for the id
- we avoid processing obviously forged payload bodies
For a product team moving quickly, that feels like a clean trust model.
What This Approach Gets Right
I still think there are legitimate strengths here.
1. It avoids blind trust in the inbound payload
That is better than naively consuming whatever object arrived over HTTP.
2. It centralizes on Stripe’s current view of the event
For some operational flows, that can be simpler than reasoning about every payload edge case locally.
3. It fit our existing Stripe integration model
The app already talked to Stripe directly for billing operations, so the mental model was consistent.
Those are real advantages. This was not security negligence. It was a pragmatic trust strategy.
Why I’d Still Revisit It
The biggest reason is that “re-fetch by id” and “verify webhook authenticity” are not the same thing.
Webhook signature verification answers:
“Did Stripe send this exact request to us?”
Re-fetching by id answers:
“Does this event id exist in Stripe, and can we retrieve it with our account credentials?”
Those are related, but not identical, security properties.
The second concern is operational coupling. Our webhook handler now depends on live Stripe API retrieval during request handling. If Stripe’s API is degraded or our outbound access is impaired, the webhook path becomes more fragile than it needs to be.
That is not hypothetical. Production systems spend a lot of time in partial failure modes.
The More Questionable Choice: Returning 200 on Generic Exceptions
The design choice I am less comfortable with today is returning success on broad exceptions specifically to stop Stripe from retrying.
I understand why that happened. Runaway retries can amplify bad incidents, especially if the handler is crashing on a condition retries will not fix.
But the downside is serious:
- you may acknowledge work you did not actually complete
- retries stop even if the failure was transient
- recovery becomes a manual reconciliation problem
That is the kind of tradeoff teams make under production pressure, but it is also exactly the kind of behavior that deserves a second pass once the system matures.
What I’d Do Differently Now
I would likely move toward:
- verifying Stripe webhook signatures on ingress
- treating idempotency and replay handling as first-class concerns
- retrying or dead-lettering failures more deliberately instead of broadly suppressing them
- reserving
200-on-error behavior for cases we can prove are non-recoverable and safely ignorable
I might still keep the ability to re-fetch event details when useful, but I would not want that to be the primary authenticity model.
Why This Is a Good Example of Real Engineering Tradeoffs
I like this case because it is not a “we were wrong and now we are wise” story.
It is a story about shipping under realistic constraints:
- we needed a workable webhook trust model
- we wanted to reduce attack surface from untrusted payload bodies
- we did not want billing incidents to spiral under repeated retries
Those are all reasonable concerns. The code reflects a team trying to make the system sturdy with the tools and time it had.
That is why these are the most useful kinds of postmortem-adjacent decisions to write about. The original implementation makes sense. It just no longer feels like the final version I would want.
The Broader Lesson
Security and reliability decisions are often made in combination, not isolation.
Our Stripe webhook path is a good example:
- trust model
- failure handling
- retry behavior
- billing consistency
all show up in one small endpoint.
That is why I think webhook handling is a great place to study a team’s maturity. It forces you to reveal what you value more when the design is imperfect: simplicity, authenticity, resilience, or operational containment.
In Trek Point’s case, we picked a pragmatic path that shipped. I am glad we did. I also think it is one of the clearest places where “good enough to launch” and “what I want in the long term” are not the same answer.