
Timeouts, Retries, Idempotency, And Backpressure
Series
System Architecture Field Guide
10 of 12 in the series
A field guide for engineers moving into system ownership, focused on the decisions that make systems safer to change, easier to understand, and less fragile under real product pressure.
Article 1
What Architects Actually Decide
Article 2
Architecture Is Mostly Tradeoffs: Naming What A Decision Costs
Article 3
Monoliths, Modular Monoliths, And Services Without Hype
Article 4
Finding Service Boundaries That Teams Can Own
Article 5
API Design As Architecture
Article 6
Synchronous vs Asynchronous Communication
Article 7
SDK Architecture For Systems Other Developers Depend On
Article 8
Mobile And Backend Architecture Are One System
Article 9
Database Migrations Without Breaking Production
Article 10
Timeouts, Retries, Idempotency, And Backpressure
Article 11
Observability That Changes Architecture Decisions
Article 12
Change Safety: Testing Systems You Cannot Fully Stage
Failure handling is architecture. Timeouts, retries, idempotency, and backpressure decide whether a system degrades safely or turns a small problem into a wider incident.
Many production incidents start with a small failure.
One dependency gets slow. One queue falls behind. One provider returns intermittent errors. One database table starts locking more than usual.
The system does not fail because something went wrong.
Something always goes wrong.
The system fails because every caller reacts in a way that makes the original problem bigger.
That is why timeouts, retries, idempotency, and backpressure are architecture.
They decide whether failure stays local or spreads.
Timeouts Are A Budget, Not A Guess
A timeout says how long one part of the system is allowed to wait for another.
Too short, and the system fails healthy requests. Too long, and callers pile up, threads stay occupied, queues grow, and users wait for work that may never finish.
The common mistake is giving every dependency a generous timeout because it feels safer:
Checkout request budget: 2 seconds
Payment timeout: 10 seconds
Inventory timeout: 10 seconds
Fraud timeout: 10 secondsThat does not create reliability.
It creates a request that can never meet its user-facing budget.
A better model starts from the outside:
User-facing checkout budget: 2 seconds
Cart validation: 150ms
Inventory check: 250ms
Payment authorization: 800ms
Fraud decision: 300ms
Application overhead: 500msThe numbers are examples, not universal guidance.
Timeout decisions should fit inside the experience the product promises.
If a dependency cannot reliably respond inside the budget, the architecture has to change. Maybe the workflow needs a pending state. Maybe the dependency needs caching. Maybe the operation should become async. Maybe the product cannot promise an immediate answer.
Timeouts reveal product truth.
Retries Can Heal Or Harm
Retries are useful when failure is temporary.
A network blip, a short provider outage, a transient database conflict, or a leader election can recover if the caller tries again.
Retries are harmful when every caller retries at the same time and overloads a struggling dependency.
This is the retry storm:
Provider slows down
Callers timeout
Callers retry immediately
Provider receives more traffic
Provider gets slower
More callers timeout
More callers retryThe system turns a slowdown into a self-inflicted outage.
Retries need rules:
- retry only errors that are safe to retry
- use backoff instead of immediate loops
- add jitter so callers do not retry together
- cap the number of attempts
- respect
Retry-Afterwhen the provider gives it - stop retrying when the caller's timeout budget is exhausted
- observe retry rate as a production signal
A retry is not a harmless second chance.
It is extra load sent into a system that may already be unhealthy.
Idempotency Makes Repetition Safe
A retry repeats an operation.
That is safe only if repeating the operation does not repeat the side effect.
This is where idempotency matters.
Suppose a mobile client sends a payment request:
POST /payments
Idempotency-Key: checkout_789_attempt_1The server creates a payment attempt and stores the key with the result:
idempotency_key result
checkout_789_attempt_1 payment_authorizedA small table is often enough to make the behavior explicit:
CREATE TABLE idempotency_keys (
key TEXT PRIMARY KEY,
operation TEXT NOT NULL,
request_hash TEXT NOT NULL,
response_body JSONB,
status TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL
);If the client times out and sends the same request again, the server should not create a second charge. It should return the stored result for the same key.
That is the difference between retrying safely and charging a customer twice.
Idempotency should be designed around the business operation, not around the HTTP request alone.
Ask:
- what side effect must not happen twice?
- who creates the idempotency key?
- how long is the key retained?
- what request fields must match for the same key?
- what result should be returned on duplicate attempts?
- how do we detect conflicting reuse of a key?
- how does this work across mobile retry, server retry, and provider retry?
If the answer is unclear, retries are not safe yet.
Duplicate Handling Is Not Only For Payments
Payments make idempotency obvious because duplicate charges are painful.
The same principle appears everywhere:
- creating an order
- reserving inventory
- sending a receipt
- applying a coupon
- publishing an event
- processing a webhook
- consuming a queue message
- creating a support ticket
Many systems accidentally rely on "this probably won't happen twice."
In distributed systems, that is a weak promise.
The caller may retry. The message broker may redeliver. The mobile client may come back online and resend. A webhook provider may deliver the same event again. A job may restart after partial work.
Idempotency is how the system says:
"Repeated delivery is allowed. Repeated side effects are not."
Backpressure Protects The System From Itself
Backpressure means the system can tell callers, producers, or upstream components:
"Slow down. I cannot safely accept more work right now."
Without backpressure, overload spreads silently.
Queues grow. Memory fills. Latency rises. Retries increase. Workers fall further behind. Eventually users experience failures far away from the original bottleneck.
Backpressure can show up as:
- rate limits
- queue limits
- connection limits
- worker concurrency limits
- load shedding
429 Too Many Requests503 Service Unavailable- producer throttling
- circuit breakers
Backpressure is not about rejecting work casually.
It is about rejecting, delaying, or shedding work intentionally before the system collapses.
An honest failure is often safer than accepting work the system cannot complete.
Circuit Breakers Are A Coordination Tool
A circuit breaker stops calling a dependency when failures cross a threshold.
That can protect both sides:
- the caller stops wasting time on calls likely to fail
- the dependency gets a chance to recover
- the system can use fallback behavior where appropriate
But circuit breakers are not magic.
They require product decisions.
If payment authorization is unavailable, can checkout continue? Probably not.
If recommendation service is unavailable, can the page render without recommendations? Probably yes.
If receipt email is unavailable, can the order still complete? Usually yes, if receipt delivery is retried later.
The fallback is the architecture decision.
The circuit breaker only enforces it.
A Practical Failure-Control Table
Use this table to avoid treating every failure with the same tool.
| Problem | Useful Control | Main Risk If Missing |
|---|---|---|
| Dependency is slow | Timeout budget | Callers pile up and user requests exceed their budget. |
| Dependency has transient errors | Retry with backoff and jitter | Temporary failures become user-visible too quickly. |
| Operation may be repeated | Idempotency key or deduplication store | Duplicate side effects corrupt user or business state. |
| Queue is growing faster than workers can process | Backpressure or autoscaling | Latency grows until the system falls behind. |
| Dependency is failing heavily | Circuit breaker | Callers keep sending load into an unhealthy system. |
| Provider asks callers to slow down | Respect Retry-After or rate-limit headers | The client worsens overload and may be throttled harder. |
| Non-critical work competes with critical work | Priority or load shedding | Important traffic fails because optional work consumed capacity. |
The control is not the architecture by itself.
The architecture is deciding which control fits the business behavior.
Make Failure Visible
Failure controls are only useful if the team can see them.
Important signals include:
- timeout rate by dependency
- retry rate by caller and endpoint
- idempotency-key reuse and conflict rate
- queue depth and queue age
- circuit breaker open count
- rate-limit responses
- rejected work count
- downstream latency percentiles
- user-visible failure rate
The most important signal is often the ratio between original work and repeated work.
If retries become a large share of traffic, the system may be fighting itself.
A Small Example: Webhook Processing
Webhook processing is a quiet place where retry behavior often becomes dangerous.
A payment provider may send the same webhook more than once. The first delivery may timeout. The second may arrive while the first is still being processed. A later replay may arrive hours later.
A safe shape is:
Provider sends payment.succeeded webhook
Webhook endpoint validates signature
Endpoint stores event_id before side effects
Worker processes event idempotently
Worker records processed, duplicate, or failed state
Operations can see stuck webhook eventsIf the provider retries, the system checks event_id before creating another side effect. If processing fails after the event is stored, the worker can retry from the stored record. If the event is malformed, it can be rejected without poisoning the whole queue.
The system needs visibility around the failure controls:
Operators should know:
- how many webhook events are pending
- how old the oldest pending event is
- how many retries are happening
- how many duplicate events were ignored
- which event types fail most often
- whether processing lag affects user-visible payment state
This is the pattern: accept repeated delivery, prevent repeated side effects, and make delayed work visible.
Where To Go Deeper
The synchronous vs asynchronous communication article helps decide whether the work belongs in the request path or later.
The API design article covers rate-limit contracts and retryable error shapes.
The observability article covers the signals needed to operate failure behavior.
The change-safety article covers how to test risky behavior when staging cannot perfectly copy production.
The Kafka Mastery branch goes deeper into delivery semantics, consumer-side idempotency, retries, dead-letter topics, and the outbox pattern.
Summary
Timeouts, retries, idempotency, and backpressure are not library settings.
They are system behavior.
Timeouts define how long the system is willing to wait. Retries decide when failure gets another chance. Idempotency makes repeated attempts safe. Backpressure protects the system from accepting work it cannot handle.
Used well, they keep failure contained.
Used carelessly, they turn small failures into large incidents.