// Pay on day one with disciplined retry semantics, or pay double on day 800 when a duplicate webhook bankrupts a customer.
You can ship a distributed system without idempotency. You cannot operate one without it. The gap between those two facts is where most production fires live.
The idempotency tax is the amount of engineering work you have to spend, eventually, to make sure that doing the same thing twice produces the same result as doing it once. You can pay it up front, when the system is small and the semantics are tractable, or you can pay it on the day you do a partial rollback in the middle of a payment run and double-bill a customer.
The interest rate is roughly 10x.
What goes wrong without it
Three categories of incident, in roughly the order they show up:
-
Webhook delivery. Stripe, GitHub, Slack — they all retry. They all warn you in their docs. They all eventually deliver the same event twice. If your handler isn't idempotent, your downstream state diverges from theirs in ways that are very hard to recover from after the fact.
-
Background jobs. Your queue retries on failure. Your queue retries on worker timeout. Your queue retries on the day someone deploys a fix and the in-flight job gets re-enqueued. None of these are bugs. They are the queue working correctly. Your handler is the part that has to be ready.
-
User-initiated actions. Someone clicks "Submit Order" twice because the spinner was slow. Someone refreshes a POST. A mobile client retries because the radio dropped. If submitting an order is not idempotent, you have a refund-team problem instead of an engineering problem, and the refund team is more expensive.
How to pay the tax cheaply, on day one
The pattern is boring and well-understood:
- Every operation that mutates state takes an
idempotency_key. The client generates it. The server stores it. - The server stores the result of the first successful execution against that key. On retry, it returns the stored result without re-executing.
- The key has a TTL — minutes for webhook delivery, hours for user-initiated, days for batch jobs.
- The key lives in the same transaction as the mutation, or it doesn't count.
That's the entire pattern. It takes about a day to retrofit into a small service. It takes about a quarter to retrofit into a large one. It takes the rest of your career to retrofit into a system that's been running for three years without it.
The senior-engineer version of this take
I have never once regretted adding idempotency. I have repeatedly regretted not adding it sooner. There is no clever architecture that gets you out of paying this tax. There is only the version of you that pays it on day one and the version of you that pays it on the day a customer's checkbook is on the line.
Pay it on day one.