
What Architects Actually Decide
Series
System Architecture Field Guide
1 of 12 in the series
A field guide for engineers moving into system ownership, focused on the decisions that make systems safer to change, easier to understand, and less fragile under real product pressure.
Article 1
What Architects Actually Decide
Article 2
Architecture Is Mostly Tradeoffs: Naming What A Decision Costs
Article 3
Monoliths, Modular Monoliths, And Services Without Hype
Article 4
Finding Service Boundaries That Teams Can Own
Article 5
API Design As Architecture
Article 6
Synchronous vs Asynchronous Communication
Article 7
SDK Architecture For Systems Other Developers Depend On
Article 8
Mobile And Backend Architecture Are One System
Article 9
Database Migrations Without Breaking Production
Article 10
Timeouts, Retries, Idempotency, And Backpressure
Article 11
Observability That Changes Architecture Decisions
Article 12
Change Safety: Testing Systems You Cannot Fully Stage
Architecture is not mainly about picking technologies. It is about deciding boundaries, tradeoffs, ownership, risk, and how a system can keep changing safely.
Most developers first meet architecture as a diagram.
There are boxes for services, arrows for calls, a database at the bottom, maybe a queue somewhere in the middle. It looks clean. It also hides most of the real work.
Architecture is not mainly about drawing the boxes. It is about deciding what the boxes mean, who owns them, how they fail, how they change, and what tradeoffs the team is accepting by drawing the boundary there.
That is the shift from developer thinking to architect thinking.
A developer often asks, "How do I implement this feature?"
An architect has to ask a second question: "What system behavior does this feature create over time?"
Both questions matter. The first one ships the feature. The second one decides whether the feature becomes easy to operate, hard to change, expensive to scale, risky to secure, or painful for the next team that touches it.
| Developer Question | Architect Question |
|---|---|
| How do I implement this feature? | What system behavior does this create over time? |
| Does this work locally? | How does this fail, recover, and get operated? |
| Which technology should we use? | Which tradeoff fits our constraints? |
| Can this service scale? | Where are the real bottlenecks and ownership seams? |
| Can we ship this faster? | Can we ship this safely and keep changing it later? |
The Same Problem Looks Different At System Level
Imagine a team needs to add receipts to an ordering system.
At feature level, the work sounds simple:
- Add a receipt endpoint.
- Save receipt data.
- Send an email.
- Show receipt status in the app.
That is a valid implementation view. But it is not enough.
At system level, the same feature creates a different set of questions:
- Is receipt generation part of the order transaction, or does it happen after the order is accepted?
- What happens if the email provider is down?
- Can the receipt be generated twice?
- Which system owns the receipt data?
- Can mobile clients read old and new receipt formats during rollout?
- What should support teams see when receipt delivery fails?
- What does the audit trail need to prove later?
None of these questions are theoretical. They decide where the boundary goes, what data model is safe, what retries are allowed, what gets observed, and who gets paged when the system behaves badly.
That is architecture.
It is not a more senior version of coding. It is a different layer of responsibility.
Here is the practical difference in a design review:
| Feature Request | Developer Output | Architect Output |
|---|---|---|
| Send receipts after checkout. | Endpoint, table, email call, UI state. | Boundary between order and notification, retry behavior, receipt ownership, failure status, support visibility. |
| Add payment status to mobile. | New API field and client rendering. | Stable product status, compatibility window, old app behavior, provider-state hiding, telemetry by app version. |
| Split billing from checkout. | New service and moved code. | Migration sequence, data ownership, rollback path, dual-write period, contract tests, operational owner. |
The architect output is not "more documents." It is the set of decisions that prevents the feature from becoming a production puzzle later.
Architects Decide Boundaries
The most important architecture decision is often not the technology. It is the boundary.
A boundary says:
- this team owns this behavior
- this data belongs here
- this API is the contract
- this failure should not break that flow
- this part can change without forcing that part to change
Bad boundaries create permanent coordination costs.
Suppose a mobile app needs payment status. The backend team exposes the internal PaymentAttempt object directly:
{
"attemptId": "att_123",
"processorState": "CAPTURE_PENDING",
"retryCount": 2,
"gatewayCode": "PENDING_SETTLEMENT"
}This is easy to ship because it reuses an internal model. It is also a weak boundary.
The mobile app now knows processor states, retry behavior, and gateway language. If the backend changes payment providers, the client contract may break. If different providers use different states, the app may need logic it should never own.
A stronger boundary gives the client the product concept it needs:
{
"paymentId": "pay_123",
"status": "processing",
"message": "Your payment is being confirmed.",
"canRetry": false
}The backend still has internal payment attempts. The client gets a stable product contract.
The architecture decision is not "REST or GraphQL" yet. It is: "What concept should cross this boundary?"
The wrong concept creates coupling. The right concept gives both sides room to change.
Architects Decide Tradeoffs
Every architecture decision spends something.
Sometimes it spends money. Sometimes it spends simplicity. Sometimes it spends delivery speed, operational safety, team independence, or future flexibility.
The hard part is not finding a perfect design. There is no perfect design. The hard part is naming what the team is choosing and what the team is accepting.
That shows up everywhere: API compatibility, database migrations, mobile release windows, observability cost, service boundaries, and build-vs-buy decisions.
For example, dropping a database column can look like a simple cleanup. But if old application code, background jobs, reports, or mobile clients still read it, the real decision is not only "Can we remove this column?" It is "What compatibility window do we need before removal is safe?"
ALTER TABLE users DROP COLUMN full_name;That one line may be technically valid and architecturally unsafe at the same time.
The architect's job is not to always pick the safest or most flexible path. Sometimes the direct change is acceptable. Sometimes the simpler system is the better system. The job is to know which risk the team is taking and whether that risk matches the product, team, and stage of the system.
The next article focuses on tradeoffs directly. For this opener, the important point is simpler: architects make the cost of a decision visible before the system pays it later.
Architects Decide Ownership
A system without clear ownership becomes hard to operate even if the code is well written.
Ownership answers practical questions:
- who changes this API?
- who approves breaking changes?
- who watches the dashboard?
- who responds when this queue backs up?
- who decides whether this service can be deprecated?
- who understands the data well enough to repair it?
Architecture that ignores ownership often creates shared systems nobody fully owns.
A shared notification service sounds efficient. Every team can send email, push, and SMS through one place. But if the service owns delivery while product teams own the meaning of each notification, failures get blurry.
When a receipt email fails, is that a notification-platform incident or an order-system incident?
The answer cannot be discovered during the outage. It has to be designed before the outage.
One healthy split might be:
- the order system owns the decision that a receipt must be sent
- the notification platform owns delivery mechanics
- the order system owns user-facing receipt status
- the notification platform owns provider retries and delivery telemetry
- both systems agree on failure events and escalation rules
That is not just process. It is architecture.
The ownership model changes the API, the events, the dashboards, the alerts, and the runbook.
Architects Decide Failure Behavior
Working software is not the same as operable software.
A feature can pass tests and still fail badly in production because nobody decided what should happen under partial failure.
Consider a checkout flow that calls:
- inventory
- payment
- receipt generation
- notification delivery
- analytics
If analytics is slow, should checkout wait?
If receipt generation fails, should the order be rejected?
If the notification provider is down, should payment be reversed?
If payment succeeds but the app times out, can the user safely retry?
These are architecture questions because they define user-visible behavior under stress.
The implementation might use timeouts, retries, queues, idempotency keys, circuit breakers, or fallback states. Those are tools. The decision comes first:
- which paths are critical?
- which paths can be delayed?
- which failures are safe to retry?
- which operations must be idempotent?
- which state must be repairable later?
- which failures should users see?
Architects do not design for failure because they are pessimistic. They design for failure because every real system has dependencies, networks, deploys, humans, and time.
Architects Decide Change Paths
A system is not designed once. It is changed repeatedly.
That is why architecture has to care about sequencing.
The question is not only "What should the final design look like?" It is also "How do we get there without freezing delivery or breaking production?"
Suppose a team wants to split billing out of a monolith.
The naive plan is to create a billing service and move the billing code.
The architectural plan asks more:
- which data moves first?
- which API becomes the boundary?
- which writes must stay in the monolith during migration?
- which reads can move safely?
- how will old jobs and reports keep working?
- what does rollback mean after data starts moving?
- how do we know the new service is correct?
The migration path is part of the architecture.
A design that looks clean only after a risky big-bang rewrite is not necessarily a good design. Many real architecture decisions are valuable because they let the team move in smaller, safer steps.
Architects Decide What Must Be Visible
Observability is often treated as something added after implementation.
That is too late.
If a design creates a new boundary, asynchronous workflow, retry loop, cache, or external dependency, it also creates new questions the team must be able to answer:
- are requests failing or just slow?
- is one tenant affected or every tenant?
- is the queue growing?
- are retries helping or making pressure worse?
- are old mobile app versions hitting a deprecated endpoint?
- is cost rising because usage grew or because a design changed?
If the system cannot answer those questions, the architecture is incomplete.
This does not mean every service needs a giant dashboard. It means every important design decision should include the signals that make it operable.
The architect asks: "What will we need to know when this fails at 2 a.m.?"
That question changes the design earlier than most teams expect.
Architects Decide When Not To Add Architecture
Architecture is not a license to add more structure.
Sometimes the best decision is to keep the system plain.
A modular monolith can be better than five services. A relational database can be better than a pile of specialized stores. A synchronous API can be better than an event flow. A checklist can be better than a platform.
The point is not to avoid complexity forever. The point is to spend complexity only where the system earns it.
Before adding a new service, queue, cache, framework, platform layer, or abstraction, the architect should ask:
- what problem does this reduce?
- what new failure mode does it create?
- who will operate it?
- how will we test it?
- what will it cost?
- what will be harder after this?
- can a simpler boundary solve the same problem?
Good architecture often feels boring from the outside. That is not a weakness. It usually means the complexity is being kept where it belongs.
The Trunk And The Branches
This series is the trunk of the architecture map.
The trunk is about decisions:
- where the boundary belongs
- what tradeoff is acceptable
- what failure behavior the product needs
- what must be observable
- who owns the system after it ships
- how the system can change later
Some topics need deeper branches.
SDK architecture has implementation depth around public API surface, initialization, error models, distribution, and mobile constraints. That depth belongs in the Mobile SDK Design series.
Kafka has depth around event contracts, delivery semantics, idempotency, outbox patterns, testing, and observability. That depth belongs in the Kafka Mastery series.
Persistence has depth around transactions, entity lifecycle, performance traps, and production transaction patterns. That depth belongs in the JPA in Production series.
This system architect article does not pretend to teach all of that detail every time. It will teach when the topic matters, what decision is being made, and what risk the team is accepting. Then it will point to a branch when implementation depth is useful.
The trunk is not a shortcut around depth. It is the map that tells you which depth matters.
A Practical Decision Checklist
When a feature starts to look architectural, use these questions to identify which decisions need to be made:
- What user or business behavior are we protecting?
- What boundary are we creating or changing?
- Which team owns the behavior after it ships?
- Which data becomes the source of truth?
- Which failure behavior will users notice?
- Which part of the system must remain compatible during rollout?
- Which signals must exist before this can be operated?
- Which decision is reversible, and which one is hard to undo?
- Which deeper topic belongs in a branch article instead of this decision note?
These questions do not replace experience. They slow down the part of the work where teams often move too fast.
Summary
Architects do not only choose technologies.
They decide boundaries, tradeoffs, ownership, failure behavior, change paths, and visibility. They decide what the system is allowed to become.
That is why architecture is not separate from delivery. It shapes whether delivery stays safe as the system grows.
The developer question is still necessary: "How do we implement this?"
The architect question adds the part that keeps the system healthy over time: "What does this decision make easier, harder, safer, riskier, cheaper, or more expensive for everyone who has to live with it later?"
That is the question this series will keep coming back to.