Morteza Taghdisi

Writing9 min read
Abstract technical illustration representing timeout budgets, retries, idempotency, and backpressure under failure
Architecture & Platform ThinkingMay 29, 2026

Timeouts, Retries, Idempotency, And Backpressure

Series

System Architecture Field Guide

10 of 12 in the series

Article 10 of 12

Failure handling is architecture. Timeouts, retries, idempotency, and backpressure decide whether a system degrades safely or turns a small problem into a wider incident.

architecturereliabilitydistributed-systemsidempotencyproduction

Many production incidents start with a small failure.

One dependency gets slow. One queue falls behind. One provider returns intermittent errors. One database table starts locking more than usual.

The system does not fail because something went wrong.

Something always goes wrong.

The system fails because every caller reacts in a way that makes the original problem bigger.

That is why timeouts, retries, idempotency, and backpressure are architecture.

They decide whether failure stays local or spreads.

Timeouts Are A Budget, Not A Guess

A timeout says how long one part of the system is allowed to wait for another.

Too short, and the system fails healthy requests. Too long, and callers pile up, threads stay occupied, queues grow, and users wait for work that may never finish.

The common mistake is giving every dependency a generous timeout because it feels safer:

plaintext
Checkout request budget: 2 seconds
Payment timeout: 10 seconds
Inventory timeout: 10 seconds
Fraud timeout: 10 seconds

That does not create reliability.

It creates a request that can never meet its user-facing budget.

A better model starts from the outside:

plaintext
User-facing checkout budget: 2 seconds
  Cart validation: 150ms
  Inventory check: 250ms
  Payment authorization: 800ms
  Fraud decision: 300ms
  Application overhead: 500ms

The numbers are examples, not universal guidance.

Timeout decisions should fit inside the experience the product promises.

If a dependency cannot reliably respond inside the budget, the architecture has to change. Maybe the workflow needs a pending state. Maybe the dependency needs caching. Maybe the operation should become async. Maybe the product cannot promise an immediate answer.

Timeouts reveal product truth.

Retries Can Heal Or Harm

Retries are useful when failure is temporary.

A network blip, a short provider outage, a transient database conflict, or a leader election can recover if the caller tries again.

Retries are harmful when every caller retries at the same time and overloads a struggling dependency.

This is the retry storm:

plaintext
Provider slows down
Callers timeout
Callers retry immediately
Provider receives more traffic
Provider gets slower
More callers timeout
More callers retry

The system turns a slowdown into a self-inflicted outage.

Retries need rules:

  • retry only errors that are safe to retry
  • use backoff instead of immediate loops
  • add jitter so callers do not retry together
  • cap the number of attempts
  • respect Retry-After when the provider gives it
  • stop retrying when the caller's timeout budget is exhausted
  • observe retry rate as a production signal

A retry is not a harmless second chance.

It is extra load sent into a system that may already be unhealthy.

Idempotency Makes Repetition Safe

A retry repeats an operation.

That is safe only if repeating the operation does not repeat the side effect.

This is where idempotency matters.

Suppose a mobile client sends a payment request:

http
POST /payments
Idempotency-Key: checkout_789_attempt_1

The server creates a payment attempt and stores the key with the result:

plaintext
idempotency_key           result
checkout_789_attempt_1    payment_authorized

A small table is often enough to make the behavior explicit:

sql
CREATE TABLE idempotency_keys (
  key TEXT PRIMARY KEY,
  operation TEXT NOT NULL,
  request_hash TEXT NOT NULL,
  response_body JSONB,
  status TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL
);

If the client times out and sends the same request again, the server should not create a second charge. It should return the stored result for the same key.

That is the difference between retrying safely and charging a customer twice.

Idempotency should be designed around the business operation, not around the HTTP request alone.

Ask:

  • what side effect must not happen twice?
  • who creates the idempotency key?
  • how long is the key retained?
  • what request fields must match for the same key?
  • what result should be returned on duplicate attempts?
  • how do we detect conflicting reuse of a key?
  • how does this work across mobile retry, server retry, and provider retry?

If the answer is unclear, retries are not safe yet.

Duplicate Handling Is Not Only For Payments

Payments make idempotency obvious because duplicate charges are painful.

The same principle appears everywhere:

  • creating an order
  • reserving inventory
  • sending a receipt
  • applying a coupon
  • publishing an event
  • processing a webhook
  • consuming a queue message
  • creating a support ticket

Many systems accidentally rely on "this probably won't happen twice."

In distributed systems, that is a weak promise.

The caller may retry. The message broker may redeliver. The mobile client may come back online and resend. A webhook provider may deliver the same event again. A job may restart after partial work.

Idempotency is how the system says:

"Repeated delivery is allowed. Repeated side effects are not."

Backpressure Protects The System From Itself

Backpressure means the system can tell callers, producers, or upstream components:

"Slow down. I cannot safely accept more work right now."

Without backpressure, overload spreads silently.

Queues grow. Memory fills. Latency rises. Retries increase. Workers fall further behind. Eventually users experience failures far away from the original bottleneck.

Backpressure can show up as:

  • rate limits
  • queue limits
  • connection limits
  • worker concurrency limits
  • load shedding
  • 429 Too Many Requests
  • 503 Service Unavailable
  • producer throttling
  • circuit breakers

Backpressure is not about rejecting work casually.

It is about rejecting, delaying, or shedding work intentionally before the system collapses.

An honest failure is often safer than accepting work the system cannot complete.

Circuit Breakers Are A Coordination Tool

A circuit breaker stops calling a dependency when failures cross a threshold.

That can protect both sides:

  • the caller stops wasting time on calls likely to fail
  • the dependency gets a chance to recover
  • the system can use fallback behavior where appropriate

But circuit breakers are not magic.

They require product decisions.

If payment authorization is unavailable, can checkout continue? Probably not.

If recommendation service is unavailable, can the page render without recommendations? Probably yes.

If receipt email is unavailable, can the order still complete? Usually yes, if receipt delivery is retried later.

The fallback is the architecture decision.

The circuit breaker only enforces it.

A Practical Failure-Control Table

Use this table to avoid treating every failure with the same tool.

ProblemUseful ControlMain Risk If Missing
Dependency is slowTimeout budgetCallers pile up and user requests exceed their budget.
Dependency has transient errorsRetry with backoff and jitterTemporary failures become user-visible too quickly.
Operation may be repeatedIdempotency key or deduplication storeDuplicate side effects corrupt user or business state.
Queue is growing faster than workers can processBackpressure or autoscalingLatency grows until the system falls behind.
Dependency is failing heavilyCircuit breakerCallers keep sending load into an unhealthy system.
Provider asks callers to slow downRespect Retry-After or rate-limit headersThe client worsens overload and may be throttled harder.
Non-critical work competes with critical workPriority or load sheddingImportant traffic fails because optional work consumed capacity.

The control is not the architecture by itself.

The architecture is deciding which control fits the business behavior.

Make Failure Visible

Failure controls are only useful if the team can see them.

Important signals include:

  • timeout rate by dependency
  • retry rate by caller and endpoint
  • idempotency-key reuse and conflict rate
  • queue depth and queue age
  • circuit breaker open count
  • rate-limit responses
  • rejected work count
  • downstream latency percentiles
  • user-visible failure rate

The most important signal is often the ratio between original work and repeated work.

If retries become a large share of traffic, the system may be fighting itself.

A Small Example: Webhook Processing

Webhook processing is a quiet place where retry behavior often becomes dangerous.

A payment provider may send the same webhook more than once. The first delivery may timeout. The second may arrive while the first is still being processed. A later replay may arrive hours later.

A safe shape is:

plaintext
Provider sends payment.succeeded webhook
Webhook endpoint validates signature
Endpoint stores event_id before side effects
Worker processes event idempotently
Worker records processed, duplicate, or failed state
Operations can see stuck webhook events

If the provider retries, the system checks event_id before creating another side effect. If processing fails after the event is stored, the worker can retry from the stored record. If the event is malformed, it can be rejected without poisoning the whole queue.

The system needs visibility around the failure controls:

Operators should know:

  • how many webhook events are pending
  • how old the oldest pending event is
  • how many retries are happening
  • how many duplicate events were ignored
  • which event types fail most often
  • whether processing lag affects user-visible payment state

This is the pattern: accept repeated delivery, prevent repeated side effects, and make delayed work visible.

Where To Go Deeper

The synchronous vs asynchronous communication article helps decide whether the work belongs in the request path or later.

The API design article covers rate-limit contracts and retryable error shapes.

The observability article covers the signals needed to operate failure behavior.

The change-safety article covers how to test risky behavior when staging cannot perfectly copy production.

The Kafka Mastery branch goes deeper into delivery semantics, consumer-side idempotency, retries, dead-letter topics, and the outbox pattern.

Summary

Timeouts, retries, idempotency, and backpressure are not library settings.

They are system behavior.

Timeouts define how long the system is willing to wait. Retries decide when failure gets another chance. Idempotency makes repeated attempts safe. Backpressure protects the system from accepting work it cannot handle.

Used well, they keep failure contained.

Used carelessly, they turn small failures into large incidents.