Retry, backoff, and saga compensation: anti-patterns and the real shape

Naive retry loops are how workflows take down their own dependencies

The most common workflow failure mode we see in audits: a step calls a downstream service, gets a timeout or a 5xx, retries immediately, fails again, retries again. The downstream service was already overloaded; the retry storm finishes the job. Cascading failures from naive retry are the workflow engine's most prolific way to take production down.

Closely behind is the retry-with-fixed-delay pattern, which is a slower version of the same problem. Every workflow waits the same number of seconds, retries simultaneously, and creates a thundering herd against the recovering service. The shape of the load is wrong; the recovery is delayed because the recovery itself triggers another spike.

Exponential backoff with jitter is the only retry shape that actually recovers

Exponential backoff increases the delay between retries — 1s, 2s, 4s, 8s, 16s — so the load on the recovering service drops over time. Jitter adds randomization to each delay so concurrent retries spread out instead of stacking. The combination is well-studied (the AWS architecture documentation has been preaching this for years) and remains the right answer.

The configuration that matters: a maximum retry count (typically 5–8), a maximum total delay (typically 5–15 minutes), and a per-step override for steps that need different behavior. Retrying an external API call 100 times over 24 hours is rarely correct; the workflow should declare failure and route to a dead-letter queue long before that.

Retry success rate
~94% transient failures recover by attempt 5
Dead-letter queue rate
~3% human review required
Compensation success
> 99% saga unwinds cleanly
Workflow steps with idempotency keys
100% enforced in code review

Idempotency keys are the only thing standing between retry and double-charging the customer

Retry implies the step might run twice. The downstream side has to be ready for that. An idempotency key — a deterministic identifier the workflow generates per logical operation — lets the downstream service detect a duplicate request and return the same answer without re-executing the side effect. Without idempotency keys, retry creates duplicate orders, double charges, or resent emails.

Every workflow step that calls a service with side effects has to pass an idempotency key. We enforce this in code review; a step that doesn't generate or pass the key is rejected. The downstream service has to honor it; if it doesn't, the workflow either wraps the call with deduplication logic or the operations team accepts the side-effect risk and documents it.

Three failure classes need three different responses

Transient failures (timeouts, 5xx errors, network blips) recover with retry. Persistent failures (4xx errors that won't resolve, missing records, validation errors) won't recover with retry; they need to fail fast and route to a dead-letter queue. Partial failures (the call succeeded downstream but the response didn't reach the workflow) are the dangerous middle and need idempotency keys plus reconciliation logic.

Treating all three with the same retry policy is the source of most workflow bugs. A 401 Unauthorized retried 8 times still returns 401 — the credentials need to be fixed, not retried. A 502 Bad Gateway will probably resolve in a minute. The workflow has to inspect the failure type and respond accordingly, not blindly back off.

Saga compensation is how multi-system workflows undo themselves

When a workflow performs steps across multiple systems and a later step fails, the earlier steps' side effects need to be unwound. The saga pattern handles this: each step has a defined compensation action that the engine calls in reverse order on failure. Reserve the inventory; if the later payment fails, release the inventory. Charge the card; if the later shipping setup fails, refund the charge.

Sagas are not transactions. They cannot guarantee atomicity; they can only guarantee that the workflow attempts to leave the system in a consistent state after failure. Compensation actions can themselves fail, which means the saga needs its own dead-letter queue for failed compensation. The architecture is honest about partial-failure complexity rather than hiding it.

Dead-letter queues are mandatory, not optional

When retries exhaust or compensation fails, the workflow has to land somewhere. The dead-letter queue is where stuck workflows wait for human review with the full state — the input, the failure history, the partial side effects, and the recommended next action. Without a DLQ, failed workflows are silent and the operations team doesn't know they exist.

We instrument every workflow with DLQ routing on terminal failure. Each DLQ has an owner (the team responsible for the workflow's domain), an expected resolution time, and a runbook. Workflows that sit in DLQ longer than the expected resolution time generate alerts. The system surfaces stuck work; humans handle the residual.

Compensation has to work across releases, which is the actually hard part

Workflows are long-lived. A workflow started today might be mid-execution six weeks from now when version 1.5 of the workflow definition has been deployed. Compensation actions for steps from version 1.3 have to still work. Versioning the workflow engine and pinning in-flight executions to their starting version is what keeps this honest.

We treat workflow versioning the same way we treat schema migrations. Old versions stay deployed until in-flight executions complete. New versions take new executions. Compensation logic for both versions is maintained until the older version drains. The engineering discipline is the architecture; without it, the saga pattern breaks at the first migration.

We had a payment workflow that retried 47 times against a healthy service that just happened to time out once. The customer got 47 charges. After we added idempotency keys and exponential backoff with a max retry of 6, the same workflow shape stopped causing incidents. The fix was an architecture pattern, not a code change in any one step.

— Workflow platform engineer, marketplace deployment

Frequently asked

Why isn't a simple retry loop sufficient?

Because naive retries take down dependencies. A failing service that's already overloaded will be finished off by an immediate retry storm, and concurrent fixed-delay retries create thundering herds that prevent recovery. Exponential backoff with jitter is the only retry shape that actually lets a struggling downstream service recover. Maximum retry count and maximum total delay caps prevent infinite loops dressed up as retry policies.

What is an idempotency key and why is it required?

A deterministic identifier the workflow generates per logical operation, passed to downstream services so they can detect duplicate requests and return the same result without re-executing side effects. Without idempotency keys, retry creates duplicate orders, double charges, or resent emails. Every workflow step that calls a side-effecting service must pass an idempotency key — we enforce this in code review and reject steps that don't.

How is the saga pattern different from a database transaction?

Sagas don't guarantee atomicity; they guarantee that the workflow attempts to leave the system in a consistent state after failure by running compensation actions in reverse order. Database transactions are atomic within one database. Sagas span multiple systems and have to handle the case where compensation itself fails. The architecture is honest about partial-failure complexity instead of pretending transactions extend across system boundaries.

When should a workflow give up and route to dead-letter?

When retries exhaust on the configured policy (typically 5–8 attempts over 5–15 minutes), when failure type is persistent (4xx errors, validation failures), or when compensation itself fails. Each DLQ has an owner, expected resolution time, and runbook. Stuck work is surfaced rather than silent. Without DLQ routing, failed workflows just disappear — and disappear is the worst kind of failure for operations to handle.

How are workflow versions and in-flight executions handled?

Like schema migrations. Old versions stay deployed until in-flight executions complete. New versions take new executions. Compensation logic for both versions is maintained until the older version drains. Without explicit versioning, in-flight executions break when the workflow definition changes mid-flight, which is common in production. The discipline is the architecture.

What's the most common retry mistake in production?

Treating all failures as if they're transient. A 401 Unauthorized retried 8 times still returns 401 — the credentials need to be rotated, not retried. A persistent validation error retried with backoff just delays the inevitable failure and creates noise. The workflow has to inspect failure type and respond accordingly: transient gets retry, persistent gets fast-fail to DLQ, partial gets idempotent reconciliation.