Database Migrations Without Breaking Production

A database migration looks harmless when it is one line.

sql

ALTER TABLE users DROP COLUMN full_name;

The database may accept it. The migration may pass locally. The code review may look small.

Production is where that one line becomes a system decision.

Old application code may still read the column. A background job may write it. A report may depend on it. A mobile API may still return it. A rollback may need it. A support tool may break without it.

That is why database migrations are architecture.

Not every migration is dangerous. Adding a nullable column is usually simple. Renaming a field that live code depends on is not. Changing a data model under traffic is not the same as changing a local schema.

The architecture question is:

"How do we change the data model while old code, live traffic, and rollback paths still exist?"

The Naive Migration

Imagine a system stores a user's name in one column:

sql

CREATE TABLE users (
  id UUID PRIMARY KEY,
  full_name TEXT NOT NULL
);

The product now needs separate first and last names.

The naive migration is direct:

sql

ALTER TABLE users
  DROP COLUMN full_name,
  ADD COLUMN first_name TEXT NOT NULL,
  ADD COLUMN last_name TEXT NOT NULL;

This is tidy in a migration file. It is risky in a running system.

The old application code may still read full_name. The new columns do not have values for existing rows. A rollback to the previous application version will fail because the old column is gone. Any job, export, integration, or support tool that expects full_name breaks immediately.

The problem is not SQL syntax.

The problem is sequencing.

Production Has More Than One Version

During a migration, production rarely has one clean version of the world.

For a while, the system may contain:

old application code
new application code
old background jobs
new background jobs
old mobile clients
cached data
reporting queries
support tooling
in-flight requests
rollback candidates

The schema has to survive that mixed state.

This is why safe migrations often look less elegant than the final model. They carry temporary duplication so the system can move in steps.

That temporary duplication is not a smell by itself. It is the price of changing a live system safely.

Expand And Contract

The usual safe pattern is expand and contract.

First, expand the schema without breaking old code:

sql

ALTER TABLE users
  ADD COLUMN first_name TEXT,
  ADD COLUMN last_name TEXT;

Old code can still use full_name. New code can start writing first_name and last_name.

Then update the application to write both models:

await db.users.update(userId, {
  full_name: `${firstName} ${lastName}`,
  first_name: firstName,
  last_name: lastName,
})

This is not beautiful. It is intentional.

The system is now compatible with old readers and new readers.

Next, backfill existing data:

sql

UPDATE users
SET
  first_name = split_part(full_name, ' ', 1),
  last_name = substring(full_name from position(' ' in full_name) + 1)
WHERE first_name IS NULL
  AND last_name IS NULL;

That example is simplified. Real names are messy. The point is not the parsing logic. The point is that backfill is a production operation, not a quick local cleanup.

After the backfill, move reads to the new columns. Then observe. Then stop writing full_name. Only after rollback no longer needs the old column should the system contract:

sql

ALTER TABLE users DROP COLUMN full_name;

The final schema is clean. The path there is deliberately staged.

Phase	What Changes	Old Code Safe?	Rollback Still Realistic?
Expand schema	Add nullable new columns.	Yes.	Yes.
Dual write	Write old and new shapes.	Yes.	Mostly yes, if both shapes stay correct.
Backfill	Fill new columns for existing rows.	Yes, if the old shape remains.	Sometimes, but data repair may be needed.
Switch reads	Read from the new columns.	Yes, if old writes continue.	Yes, if dual writes continue.
Stop old writes	Make the new model primary.	Usually no.	Limited. Forward fix may be safer.
Contract schema	Drop the old column.	No.	Usually no, unless restore or repair is planned.

The table is intentionally conservative. A migration becomes dangerous when the team crosses from "old code is still safe" into "old code cannot run anymore" without noticing.

Backfills Are Production Work

A backfill is not just a script.

It reads and writes real data under real load. It can lock rows, fill logs, increase replication lag, slow down application queries, and create partial progress states.

A safe backfill should be designed like operational work:

run in batches
be idempotent
track progress
be resumable
limit write pressure
avoid long locks
expose metrics
support pause and rollback strategy where possible

Instead of one large update:

sql

UPDATE users SET migrated = true;

prefer batch-oriented work:

sql

UPDATE users
SET migrated = true
WHERE id IN (
  SELECT id
  FROM users
  WHERE migrated = false
  ORDER BY id
  LIMIT 1000
);

Even this is only a sketch. The real implementation depends on the database, table size, indexes, replication model, and workload.

The architecture decision is whether the migration can run safely while the product keeps serving traffic.

Rollback Has Limits

Rollback is easy to say and hard to mean.

If a deployment changes code only, rollback may be straightforward.

If a migration changes data shape, rollback may not restore the old world.

Once new code writes to first_name and last_name, rolling back to code that only understands full_name may lose information unless dual writes continue. Once the old column is dropped, rollback cannot magically bring it back with correct data.

This means migration planning has to separate:

application rollback
schema rollback
data repair
forward fix

Sometimes the safest rollback is not reversing the migration. It is keeping compatibility layers in place long enough to roll the application back safely.

Before running a migration, ask:

can old code still run after this schema change?
can new code run before this backfill finishes?
what happens if the deploy fails halfway?
what data would be lost if we roll back?
is rollback realistic, or do we need a forward fix plan?

Rollback is not a checkbox. It is a design constraint.

Verification Matters More Than Hope

A migration is not done when the script finishes.

It is done when the system proves the new model is correct enough to rely on.

Verification can include:

row counts
null checks
consistency checks between old and new columns
sampling
application metrics
error rates
support reports
consumer usage of old fields

For a dual-write period, compare old and new values:

sql

SELECT COUNT(*)
FROM users
WHERE full_name IS NOT NULL
  AND (first_name IS NULL OR last_name IS NULL);

Then compare behavior, not only data. Are API responses correct? Are background jobs still working? Are reports using the new model? Are old clients still safe?

The database can tell you whether rows changed. The system has to tell you whether behavior is still correct.

Background Jobs And Reports Count Too

Many migration failures come from code paths outside the main request flow.

A web endpoint may be updated correctly while a nightly export still reads the old column. A background job may write only the old model. A support dashboard may break because it uses a raw query. An analytics pipeline may silently stop receiving a field.

Before changing schema, identify consumers:

application reads
application writes
background jobs
scheduled tasks
reports
exports
support tools
data pipelines
mobile or partner APIs

If the only migration plan is "update the service code," it is probably incomplete.

The system is bigger than the service.

Make The Migration Visible

A migration should be observable while it runs.

Useful signals include:

migration progress
batch duration
rows processed
rows skipped
error count
database CPU and locks
replication lag
application error rate
old-field read usage
old-field write usage

Without visibility, the team has only two states: "seems fine" and "incident."

That is not enough.

A safe migration gives operators a way to pause, inspect, and continue.

A Practical Migration Checklist

Before running a production migration, ask:

Is this change backward-compatible with old application code?
Can new code run before the migration is fully complete?
Which jobs, reports, APIs, and tools read or write this data?
Does the migration need expand and contract?
Is the backfill batched, resumable, and idempotent?
What metrics show migration progress and pressure?
What verification proves the new model is safe to read from?
What does application rollback mean during each phase?
What data repair plan exists if the migration partially succeeds?
When is it safe to remove the old schema?

These questions are practical because migration failures are practical.

They happen in jobs, reports, rollback paths, old clients, and data repair scripts. The architecture has to include those paths.

Where To Go Deeper

The JPA in Production series goes deeper into transaction mechanics, persistence behavior, and production transaction patterns.

Use that branch when the migration risk depends on how application code reads, writes, flushes, or rolls back data.

Summary

A database migration is not just a schema edit.

It is a production rollout across code, data, jobs, reports, clients, rollback paths, and operational visibility.

Safe migrations usually look less elegant than the final schema because they carry temporary compatibility. That is not accidental complexity. It is the cost of changing a live system without breaking it.

The architect's job is to design the path, not only the destination.

Morteza Taghdisi