Timeouts, Retries, Idempotency, And Backpressure

Many production incidents start with a small failure.

One dependency gets slow. One queue falls behind. One provider returns intermittent errors. One database table starts locking more than usual.

The system does not fail because something went wrong.

Something always goes wrong.

The system fails because every caller reacts in a way that makes the original problem bigger.

That is why timeouts, retries, idempotency, and backpressure are architecture.

They decide whether failure stays local or spreads.

Timeouts Are A Budget, Not A Guess

A timeout says how long one part of the system is allowed to wait for another.

Too short, and the system fails healthy requests. Too long, and callers pile up, threads stay occupied, queues grow, and users wait for work that may never finish.

The common mistake is giving every dependency a generous timeout because it feels safer:

plaintext

Checkout request budget: 2 seconds
Payment timeout: 10 seconds
Inventory timeout: 10 seconds
Fraud timeout: 10 seconds

That does not create reliability.

It creates a request that can never meet its user-facing budget.

A better model starts from the outside:

plaintext

User-facing checkout budget: 2 seconds
  Cart validation: 150ms
  Inventory check: 250ms
  Payment authorization: 800ms
  Fraud decision: 300ms
  Application overhead: 500ms

The numbers are examples, not universal guidance.

Timeout decisions should fit inside the experience the product promises.

If a dependency cannot reliably respond inside the budget, the architecture has to change. Maybe the workflow needs a pending state. Maybe the dependency needs caching. Maybe the operation should become async. Maybe the product cannot promise an immediate answer.

Timeouts reveal product truth.

Retries Can Heal Or Harm

Retries are useful when failure is temporary.

A network blip, a short provider outage, a transient database conflict, or a leader election can recover if the caller tries again.

Retries are harmful when every caller retries at the same time and overloads a struggling dependency.

This is the retry storm:

plaintext

Provider slows down
Callers timeout
Callers retry immediately
Provider receives more traffic
Provider gets slower
More callers timeout
More callers retry

The system turns a slowdown into a self-inflicted outage.

Retries need rules:

retry only errors that are safe to retry
use backoff instead of immediate loops
add jitter so callers do not retry together
cap the number of attempts
respect Retry-After when the provider gives it
stop retrying when the caller's timeout budget is exhausted
observe retry rate as a production signal

A retry is not a harmless second chance.

It is extra load sent into a system that may already be unhealthy.

Idempotency Makes Repetition Safe

A retry repeats an operation.

That is safe only if repeating the operation does not repeat the side effect.

This is where idempotency matters.

Suppose a mobile client sends a payment request:

http

POST /payments
Idempotency-Key: checkout_789_attempt_1

The server creates a payment attempt and stores the key with the result:

plaintext

idempotency_key           result
checkout_789_attempt_1    payment_authorized

A small table is often enough to make the behavior explicit:

sql

CREATE TABLE idempotency_keys (
  key TEXT PRIMARY KEY,
  operation TEXT NOT NULL,
  request_hash TEXT NOT NULL,
  response_body JSONB,
  status TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL
);

If the client times out and sends the same request again, the server should not create a second charge. It should return the stored result for the same key.

That is the difference between retrying safely and charging a customer twice.

Idempotency should be designed around the business operation, not around the HTTP request alone.

Ask:

what side effect must not happen twice?
who creates the idempotency key?
how long is the key retained?
what request fields must match for the same key?
what result should be returned on duplicate attempts?
how do we detect conflicting reuse of a key?
how does this work across mobile retry, server retry, and provider retry?

If the answer is unclear, retries are not safe yet.

Duplicate Handling Is Not Only For Payments

Payments make idempotency obvious because duplicate charges are painful.

The same principle appears everywhere:

creating an order
reserving inventory
sending a receipt
applying a coupon
publishing an event
processing a webhook
consuming a queue message
creating a support ticket

Many systems accidentally rely on "this probably won't happen twice."

In distributed systems, that is a weak promise.

The caller may retry. The message broker may redeliver. The mobile client may come back online and resend. A webhook provider may deliver the same event again. A job may restart after partial work.

Idempotency is how the system says:

"Repeated delivery is allowed. Repeated side effects are not."

Backpressure Protects The System From Itself

Backpressure means the system can tell callers, producers, or upstream components:

"Slow down. I cannot safely accept more work right now."

Without backpressure, overload spreads silently.

Queues grow. Memory fills. Latency rises. Retries increase. Workers fall further behind. Eventually users experience failures far away from the original bottleneck.

Backpressure can show up as:

rate limits
queue limits
connection limits
worker concurrency limits
load shedding
429 Too Many Requests
503 Service Unavailable
producer throttling
circuit breakers

Backpressure is not about rejecting work casually.

It is about rejecting, delaying, or shedding work intentionally before the system collapses.

An honest failure is often safer than accepting work the system cannot complete.

Circuit Breakers Are A Coordination Tool

A circuit breaker stops calling a dependency when failures cross a threshold.

That can protect both sides:

the caller stops wasting time on calls likely to fail
the dependency gets a chance to recover
the system can use fallback behavior where appropriate

But circuit breakers are not magic.

They require product decisions.

If payment authorization is unavailable, can checkout continue? Probably not.

If recommendation service is unavailable, can the page render without recommendations? Probably yes.

If receipt email is unavailable, can the order still complete? Usually yes, if receipt delivery is retried later.

The fallback is the architecture decision.

The circuit breaker only enforces it.

A Practical Failure-Control Table

Use this table to avoid treating every failure with the same tool.

Problem	Useful Control	Main Risk If Missing
Dependency is slow	Timeout budget	Callers pile up and user requests exceed their budget.
Dependency has transient errors	Retry with backoff and jitter	Temporary failures become user-visible too quickly.
Operation may be repeated	Idempotency key or deduplication store	Duplicate side effects corrupt user or business state.
Queue is growing faster than workers can process	Backpressure or autoscaling	Latency grows until the system falls behind.
Dependency is failing heavily	Circuit breaker	Callers keep sending load into an unhealthy system.
Provider asks callers to slow down	Respect `Retry-After` or rate-limit headers	The client worsens overload and may be throttled harder.
Non-critical work competes with critical work	Priority or load shedding	Important traffic fails because optional work consumed capacity.

The control is not the architecture by itself.

The architecture is deciding which control fits the business behavior.

Make Failure Visible

Failure controls are only useful if the team can see them.

Important signals include:

timeout rate by dependency
retry rate by caller and endpoint
idempotency-key reuse and conflict rate
queue depth and queue age
circuit breaker open count
rate-limit responses
rejected work count
downstream latency percentiles
user-visible failure rate

The most important signal is often the ratio between original work and repeated work.

If retries become a large share of traffic, the system may be fighting itself.

A Small Example: Webhook Processing

Webhook processing is a quiet place where retry behavior often becomes dangerous.

A payment provider may send the same webhook more than once. The first delivery may timeout. The second may arrive while the first is still being processed. A later replay may arrive hours later.

A safe shape is:

plaintext

Provider sends payment.succeeded webhook
Webhook endpoint validates signature
Endpoint stores event_id before side effects
Worker processes event idempotently
Worker records processed, duplicate, or failed state
Operations can see stuck webhook events

If the provider retries, the system checks event_id before creating another side effect. If processing fails after the event is stored, the worker can retry from the stored record. If the event is malformed, it can be rejected without poisoning the whole queue.

The system needs visibility around the failure controls:

Operators should know:

how many webhook events are pending
how old the oldest pending event is
how many retries are happening
how many duplicate events were ignored
which event types fail most often
whether processing lag affects user-visible payment state

This is the pattern: accept repeated delivery, prevent repeated side effects, and make delayed work visible.

Where To Go Deeper

The synchronous vs asynchronous communication article helps decide whether the work belongs in the request path or later.

The API design article covers rate-limit contracts and retryable error shapes.

The observability article covers the signals needed to operate failure behavior.

The change-safety article covers how to test risky behavior when staging cannot perfectly copy production.

The Kafka Mastery branch goes deeper into delivery semantics, consumer-side idempotency, retries, dead-letter topics, and the outbox pattern.

Summary

Timeouts, retries, idempotency, and backpressure are not library settings.

They are system behavior.

Timeouts define how long the system is willing to wait. Retries decide when failure gets another chance. Idempotency makes repeated attempts safe. Backpressure protects the system from accepting work it cannot handle.

Used well, they keep failure contained.

Used carelessly, they turn small failures into large incidents.

Morteza Taghdisi

Timeouts, Retries, Idempotency, And Backpressure

System Architecture Field Guide

Timeouts Are A Budget, Not A Guess

Retries Can Heal Or Harm

Idempotency Makes Repetition Safe

Duplicate Handling Is Not Only For Payments

Backpressure Protects The System From Itself

Circuit Breakers Are A Coordination Tool

A Practical Failure-Control Table

Make Failure Visible

A Small Example: Webhook Processing

Where To Go Deeper

Summary