
Observability That Changes Architecture Decisions
Series
System Architecture Field Guide
11 of 12 in the series
A field guide for engineers moving into system ownership, focused on the decisions that make systems safer to change, easier to understand, and less fragile under real product pressure.
Article 1
What Architects Actually Decide
Article 2
Architecture Is Mostly Tradeoffs: Naming What A Decision Costs
Article 3
Monoliths, Modular Monoliths, And Services Without Hype
Article 4
Finding Service Boundaries That Teams Can Own
Article 5
API Design As Architecture
Article 6
Synchronous vs Asynchronous Communication
Article 7
SDK Architecture For Systems Other Developers Depend On
Article 8
Mobile And Backend Architecture Are One System
Article 9
Database Migrations Without Breaking Production
Article 10
Timeouts, Retries, Idempotency, And Backpressure
Article 11
Observability That Changes Architecture Decisions
Article 12
Change Safety: Testing Systems You Cannot Fully Stage
Observability is not something you add after implementation. It changes what architecture is safe to ship because it defines what the team can see, debug, and operate.
Observability is often treated like a layer added after the system is built.
The feature ships. Then someone adds logs. Then someone builds a dashboard. Then an incident happens, and the team discovers the dashboard answers the wrong questions.
That is the wrong order.
Observability should change architecture decisions before the system ships.
Not because every system needs a giant monitoring setup. It does not. Observability matters because it defines what the team can know when the system is slow, partially broken, expensive, or confusing users.
If a team cannot see the behavior that matters, it cannot safely own the system.
The Real Question Is Not "Do We Have Logs?"
Logs are useful, but they are not the goal.
The real question is:
"Can we understand what users are experiencing and why the system is behaving that way?"
That question is harder than "do we have logs?" because production problems rarely stay inside one process.
A checkout flow may involve:
- mobile app
- API gateway
- order service
- payment service
- fraud provider
- notification service
- database
- queue
- analytics pipeline
If checkout gets slow, one service log may not tell the story.
The user may see a spinner. The API may return 200 OK. The payment provider may be slow. The queue may be building up. The mobile app may time out before the backend finishes. Support may only know that customers are complaining.
Observability has to connect those pieces enough for the team to reason.
Start With The Questions Operators Need To Answer
Before choosing tools or dashboards, write the questions the team must answer during stress.
For a payment flow, those questions might be:
- Are users failing to complete checkout?
- Is the failure affecting all users or one tenant, region, app version, or payment method?
- Is the backend failing, or are clients timing out?
- Which dependency is slow or unavailable?
- Are retries helping or making pressure worse?
- Are payments being charged without receipts?
- Can support find the state of a specific payment?
- Is this a new deploy, traffic spike, vendor issue, or data problem?
Those questions shape the architecture.
If the team needs to know whether one tenant is affected, telemetry needs tenant context. If the team needs to connect client timeouts to backend work, requests need correlation IDs. If support needs payment state, the system needs stable identifiers and searchable operational data.
Observability is not only a dashboard. It is information design for operating the system.
Signals Change The Boundary
When a system boundary is created, observability has to cross it.
Suppose an API accepts an order and publishes a message:
POST /orders
-> create order
-> publish OrderCreated
-> return 202 AcceptedThat design may be correct. But now the user-visible behavior continues after the request returns.
The architecture needs signals for both parts:
- request accepted
- message published
- consumer received it
- downstream work completed
- downstream work failed
- user-visible status updated
If the only metric is POST /orders success rate, the system looks healthy while the order workflow may be stuck behind the queue.
The boundary changed the observability requirement.
That is why observability belongs in architecture review. A new queue, cache, dependency, SDK, mobile release, or service boundary changes what the team must be able to see.
Logs, Metrics, And Traces Do Different Jobs
A useful observability design does not treat logs, metrics, and traces as interchangeable.
Metrics answer trend questions:
- is latency rising?
- is error rate above normal?
- is queue age growing?
- is one tenant using more capacity than expected?
Logs answer event questions:
- what happened for this request?
- what state did this worker see?
- why was this payment rejected?
- which validation failed?
Traces answer path questions:
- where did the request spend time?
- which dependency was slow?
- which service called which service?
- where did the workflow stop?
The architecture decision is not "use all three everywhere."
The decision is which signal is needed for the failure mode you care about.
If a background job silently falls behind, queue age may matter more than request traces. If a mobile checkout fails only for one app version, client-side error categories may matter more than backend CPU. If a tenant is noisy, tenant-level metrics may matter more than aggregate dashboards.
Good observability starts from failure modes, not from tool checklists.
SLOs Turn Observability Into A Product Conversation
Dashboards can become wallpaper.
SLOs force a sharper question:
"What level of behavior do users need from this system?"
For example:
99.5% of checkout attempts should either complete or return a user-actionable failure within 8 seconds.This is more useful than "payment service uptime is 99.9%" because it describes a user-facing outcome.
It also changes architecture decisions.
If checkout must complete or fail clearly within 8 seconds, then dependencies need timeout budgets. Retries must fit inside the user flow. Slow paths may need fallback states. Non-critical work should move out of the critical path. Client and backend telemetry need to agree on what a checkout attempt means.
An SLO is not only an alerting rule. It is a design constraint.
Observability Should Include Cost
Production behavior is not only reliability.
Cost is also a system signal.
A design can be functionally correct and financially surprising. A cache miss pattern can increase database load. A retry loop can multiply external API calls. A debug log can increase storage costs. A search index can grow faster than the primary data. A tenant can consume capacity far beyond the plan.
Architectural observability should help answer:
- what does this feature cost per tenant, request, job, or workflow?
- did a deploy change usage patterns?
- are retries increasing downstream spend?
- is one customer creating noisy-neighbor pressure?
- is observability itself becoming expensive?
Cost signals do not need to be perfect on day one. But if a design has obvious cost risk, waiting for the bill is not observability.
It is archaeology.
Design For Debuggability, Not Just Alerting
Alerting tells the team something needs attention.
Debuggability helps the team understand what to do next.
An alert like this is weak:
Payment errors are high.A better alert gives direction:
Checkout user failures above SLO.
Most failures are PAYMENT_PROVIDER_TIMEOUT.
Affected region: eu-west.
Started after deploy 2026.05.28.4.
Top impacted app version: 4.18.0.That alert is possible only if the system was designed to emit useful dimensions.
Good debug signals often include:
- stable error codes
- correlation IDs
- app or client version
- tenant or account ID where safe
- dependency name
- region
- deploy version
- retry count
- queue age
- user-visible state
The goal is not to tag everything. The goal is to make the first 15 minutes of an incident less blind.
A Small Example: Receipt Delivery
Suppose an order system accepts payment and sends a receipt asynchronously.
The naive dashboard shows:
- API requests
- payment errors
- worker errors
That is a start, but it misses the user outcome.
The better observability design tracks the workflow:
order.accepted
payment.confirmed
receipt.job.enqueued
receipt.delivery.succeeded
receipt.delivery.failed
receipt.visible_to_userNow the team can answer:
- are orders accepted but receipts not enqueued?
- are receipts enqueued but not delivered?
- is the provider failing?
- are retries growing?
- are users contacting support before delivery completes?
This changes the architecture. The system needs stable order IDs, receipt IDs, failure categories, and workflow state transitions. It may need a repair path if receipts fail after payment succeeds.
The observability decision revealed a product behavior the architecture had to own.
For a useful first dashboard, the team does not need fifty charts. It needs signals that answer decisions:
| Signal | Question It Answers | Decision It Enables |
|---|---|---|
receipt.job.oldest_age_seconds | Are receipts stuck or merely delayed? | Pause rollout, scale workers, or investigate provider failures. |
receipt.delivery.failure_rate by provider | Is one provider causing the issue? | Fail over, throttle, or open provider incident. |
receipt.visible_to_user.missing_count | Did backend success become user confusion? | Prioritize repair and support communication. |
receipt.retry_count by error code | Are retries helping or amplifying load? | Tune retry policy or move failures to manual repair. |
Support contacts tagged missing_receipt | Is the technical delay becoming a user problem? | Change user messaging or shorten escalation path. |
This is the line between observability and chart collection. A chart earns its place when it changes what the team does next.
A Practical Observability Checklist
Before shipping a system boundary or workflow, ask:
- What user-visible outcome are we trying to protect?
- What questions will support and on-call need to answer?
- Which boundaries does the workflow cross?
- What stable identifiers connect those boundaries?
- What are the critical success, failure, and delayed states?
- Which metrics show health over time?
- Which logs explain individual failures?
- Which traces or workflow events show where time was spent?
- Which dimensions matter: tenant, region, app version, dependency, deploy?
- What should alert, and what should only be visible for debugging?
- What cost signals could surprise us later?
This checklist is not about buying an observability platform.
It is about designing a system the team can understand when it matters.
Where To Go Deeper
The Kafka Mastery series goes deeper into Kafka-specific observability: lag, rebalancing, consumer behavior, and debugging event-driven systems.
Use that branch when the architecture decision involves Kafka and the implementation details matter.
Summary
Observability is not an afterthought.
It is part of the architecture because it decides what the team can know about the system under pressure.
If a design creates a new boundary, dependency, queue, cache, client behavior, or asynchronous workflow, it also creates new questions. The architecture is incomplete until the system can answer the important ones.
Good observability does not make failure disappear.
It makes failure understandable enough that teams can act.