Morteza Taghdisi

Writing9 min read
Abstract technical illustration representing a staged database migration across live production traffic
Architecture & Platform ThinkingMay 28, 2026

Database Migrations Without Breaking Production

Series

System Architecture Field Guide

9 of 12 in the series

Article 9 of 12

A database migration is not just a schema change. It is a production rollout that must account for old code, live traffic, background jobs, rollback limits, and data repair.

databasemigrationsarchitecturereliabilityproduction

A database migration looks harmless when it is one line.

sql
ALTER TABLE users DROP COLUMN full_name;

The database may accept it. The migration may pass locally. The code review may look small.

Production is where that one line becomes a system decision.

Old application code may still read the column. A background job may write it. A report may depend on it. A mobile API may still return it. A rollback may need it. A support tool may break without it.

That is why database migrations are architecture.

Not every migration is dangerous. Adding a nullable column is usually simple. Renaming a field that live code depends on is not. Changing a data model under traffic is not the same as changing a local schema.

The architecture question is:

"How do we change the data model while old code, live traffic, and rollback paths still exist?"

The Naive Migration

Imagine a system stores a user's name in one column:

sql
CREATE TABLE users (
  id UUID PRIMARY KEY,
  full_name TEXT NOT NULL
);

The product now needs separate first and last names.

The naive migration is direct:

sql
ALTER TABLE users
  DROP COLUMN full_name,
  ADD COLUMN first_name TEXT NOT NULL,
  ADD COLUMN last_name TEXT NOT NULL;

This is tidy in a migration file. It is risky in a running system.

The old application code may still read full_name. The new columns do not have values for existing rows. A rollback to the previous application version will fail because the old column is gone. Any job, export, integration, or support tool that expects full_name breaks immediately.

The problem is not SQL syntax.

The problem is sequencing.

Production Has More Than One Version

During a migration, production rarely has one clean version of the world.

For a while, the system may contain:

  • old application code
  • new application code
  • old background jobs
  • new background jobs
  • old mobile clients
  • cached data
  • reporting queries
  • support tooling
  • in-flight requests
  • rollback candidates

The schema has to survive that mixed state.

This is why safe migrations often look less elegant than the final model. They carry temporary duplication so the system can move in steps.

That temporary duplication is not a smell by itself. It is the price of changing a live system safely.

Expand And Contract

The usual safe pattern is expand and contract.

First, expand the schema without breaking old code:

sql
ALTER TABLE users
  ADD COLUMN first_name TEXT,
  ADD COLUMN last_name TEXT;

Old code can still use full_name. New code can start writing first_name and last_name.

Then update the application to write both models:

ts
await db.users.update(userId, {
  full_name: `${firstName} ${lastName}`,
  first_name: firstName,
  last_name: lastName,
})

This is not beautiful. It is intentional.

The system is now compatible with old readers and new readers.

Next, backfill existing data:

sql
UPDATE users
SET
  first_name = split_part(full_name, ' ', 1),
  last_name = substring(full_name from position(' ' in full_name) + 1)
WHERE first_name IS NULL
  AND last_name IS NULL;

That example is simplified. Real names are messy. The point is not the parsing logic. The point is that backfill is a production operation, not a quick local cleanup.

After the backfill, move reads to the new columns. Then observe. Then stop writing full_name. Only after rollback no longer needs the old column should the system contract:

sql
ALTER TABLE users DROP COLUMN full_name;

The final schema is clean. The path there is deliberately staged.

PhaseWhat ChangesOld Code Safe?Rollback Still Realistic?
Expand schemaAdd nullable new columns.Yes.Yes.
Dual writeWrite old and new shapes.Yes.Mostly yes, if both shapes stay correct.
BackfillFill new columns for existing rows.Yes, if the old shape remains.Sometimes, but data repair may be needed.
Switch readsRead from the new columns.Yes, if old writes continue.Yes, if dual writes continue.
Stop old writesMake the new model primary.Usually no.Limited. Forward fix may be safer.
Contract schemaDrop the old column.No.Usually no, unless restore or repair is planned.

The table is intentionally conservative. A migration becomes dangerous when the team crosses from "old code is still safe" into "old code cannot run anymore" without noticing.

Backfills Are Production Work

A backfill is not just a script.

It reads and writes real data under real load. It can lock rows, fill logs, increase replication lag, slow down application queries, and create partial progress states.

A safe backfill should be designed like operational work:

  • run in batches
  • be idempotent
  • track progress
  • be resumable
  • limit write pressure
  • avoid long locks
  • expose metrics
  • support pause and rollback strategy where possible

Instead of one large update:

sql
UPDATE users SET migrated = true;

prefer batch-oriented work:

sql
UPDATE users
SET migrated = true
WHERE id IN (
  SELECT id
  FROM users
  WHERE migrated = false
  ORDER BY id
  LIMIT 1000
);

Even this is only a sketch. The real implementation depends on the database, table size, indexes, replication model, and workload.

The architecture decision is whether the migration can run safely while the product keeps serving traffic.

Rollback Has Limits

Rollback is easy to say and hard to mean.

If a deployment changes code only, rollback may be straightforward.

If a migration changes data shape, rollback may not restore the old world.

Once new code writes to first_name and last_name, rolling back to code that only understands full_name may lose information unless dual writes continue. Once the old column is dropped, rollback cannot magically bring it back with correct data.

This means migration planning has to separate:

  • application rollback
  • schema rollback
  • data repair
  • forward fix

Sometimes the safest rollback is not reversing the migration. It is keeping compatibility layers in place long enough to roll the application back safely.

Before running a migration, ask:

  • can old code still run after this schema change?
  • can new code run before this backfill finishes?
  • what happens if the deploy fails halfway?
  • what data would be lost if we roll back?
  • is rollback realistic, or do we need a forward fix plan?

Rollback is not a checkbox. It is a design constraint.

Verification Matters More Than Hope

A migration is not done when the script finishes.

It is done when the system proves the new model is correct enough to rely on.

Verification can include:

  • row counts
  • null checks
  • consistency checks between old and new columns
  • sampling
  • application metrics
  • error rates
  • support reports
  • consumer usage of old fields

For a dual-write period, compare old and new values:

sql
SELECT COUNT(*)
FROM users
WHERE full_name IS NOT NULL
  AND (first_name IS NULL OR last_name IS NULL);

Then compare behavior, not only data. Are API responses correct? Are background jobs still working? Are reports using the new model? Are old clients still safe?

The database can tell you whether rows changed. The system has to tell you whether behavior is still correct.

Background Jobs And Reports Count Too

Many migration failures come from code paths outside the main request flow.

A web endpoint may be updated correctly while a nightly export still reads the old column. A background job may write only the old model. A support dashboard may break because it uses a raw query. An analytics pipeline may silently stop receiving a field.

Before changing schema, identify consumers:

  • application reads
  • application writes
  • background jobs
  • scheduled tasks
  • reports
  • exports
  • support tools
  • data pipelines
  • mobile or partner APIs

If the only migration plan is "update the service code," it is probably incomplete.

The system is bigger than the service.

Make The Migration Visible

A migration should be observable while it runs.

Useful signals include:

  • migration progress
  • batch duration
  • rows processed
  • rows skipped
  • error count
  • database CPU and locks
  • replication lag
  • application error rate
  • old-field read usage
  • old-field write usage

Without visibility, the team has only two states: "seems fine" and "incident."

That is not enough.

A safe migration gives operators a way to pause, inspect, and continue.

A Practical Migration Checklist

Before running a production migration, ask:

  1. Is this change backward-compatible with old application code?
  2. Can new code run before the migration is fully complete?
  3. Which jobs, reports, APIs, and tools read or write this data?
  4. Does the migration need expand and contract?
  5. Is the backfill batched, resumable, and idempotent?
  6. What metrics show migration progress and pressure?
  7. What verification proves the new model is safe to read from?
  8. What does application rollback mean during each phase?
  9. What data repair plan exists if the migration partially succeeds?
  10. When is it safe to remove the old schema?

These questions are practical because migration failures are practical.

They happen in jobs, reports, rollback paths, old clients, and data repair scripts. The architecture has to include those paths.

Where To Go Deeper

The JPA in Production series goes deeper into transaction mechanics, persistence behavior, and production transaction patterns.

Use that branch when the migration risk depends on how application code reads, writes, flushes, or rolls back data.

Summary

A database migration is not just a schema edit.

It is a production rollout across code, data, jobs, reports, clients, rollback paths, and operational visibility.

Safe migrations usually look less elegant than the final schema because they carry temporary compatibility. That is not accidental complexity. It is the cost of changing a live system without breaking it.

The architect's job is to design the path, not only the destination.