The deploy that takes your app down is rarely the code — it's the migration that ran with it. Someone renames a column; for the thirty seconds between "schema changed" and "every server is on the new release," the old code is still querying a column that no longer exists. The fix isn't a faster deploy. Zero-downtime database migrations come from making each schema change backwards compatible, so old and new code can run against the same database at once.
The takeaway up front: never change the schema and the code that depends on it in the same step. Split every breaking change into three deploys — expand the schema additively, migrate the data and shift the code, then contract by removing the old shape once nothing reads it. That's the expand-contract pattern, and it separates a routine migration from a five-minute outage.
Why migrations cause downtime
Two distinct things go wrong during a schema change, and they share one root cause: a change that assumes the world flips from old to new in a single instant. It never does. The two failures need different fixes.
The first is the deploy gap. A deploy is never instantaneous: for some window, old and new instances both serve traffic. If the migration runs first and removes something the old code still needs — a column, a table, a NOT NULL guarantee — every request hitting an old instance fails until the rollout finishes. Roll back, and now the new code is broken against the reverted schema. You're trapped.
The second is the lock. Some DDL statements take a lock that blocks reads or writes on the table while they run. On a small table that's a blink. On a large, busy one the statement can wait behind in-flight transactions, then hold the lock long enough that queries pile up, connections saturate, and the app stalls — the same dead-end as connection pool exhaustion, reached from another direction. Knowing which statements lock, and for how long, is half the job.
The expand-contract pattern
Expand-contract (also called parallel change) replaces one breaking change with a sequence of safe, additive ones — each independently deployable and reversible.
- Expand. Make a purely additive schema change. Add the new column, table, or index. Nothing is removed, so old code keeps working untouched.
- Migrate. Backfill existing data into the new shape, and deploy code that writes to both old and new while reading from whichever you trust. Old instances still know only the old shape — fine, because it still exists.
- Contract. Once every instance is on the new code and nothing references the old shape, remove it — drop the column, delete the table, or tighten the constraint.
The cost is honest — a one-line change becomes three deploys — but at no point does running code depend on a shape that isn't there, so any single step rolls back cleanly with no broken intermediate state.
Worked example: renaming a column safely
A rename is the canonical trap: ALTER TABLE ... RENAME COLUMN is atomic at the database but catastrophic for the deploy gap — the instant it runs, old code referencing the old name breaks. Here's the safe version, renaming users.signup_date to users.created_at.
Step 1 — Expand. Add the new column — no backfill yet, no constraints that could fail.
ALTER TABLE users ADD COLUMN created_at timestamptz;
Step 2 — Migrate. Deploy code that writes both columns on every insert and update, so new rows stay consistent no matter which instance handled them. Then backfill old rows in batches — never one giant UPDATE, which locks the table and bloats a single transaction:
-- Backfill in bounded chunks to avoid a long lock and a huge transaction.
UPDATE users
SET created_at = signup_date
WHERE created_at IS NULL
AND id IN (
SELECT id FROM users WHERE created_at IS NULL LIMIT 5000
);
-- Repeat until zero rows are updated.
Once the backfill is complete and verified, deploy code that reads from created_at. Both columns now hold the same data, so a rollback is harmless.
Step 3 — Contract. When no running code references signup_date, drop it:
ALTER TABLE users DROP COLUMN signup_date;
Three deploys instead of one — but the app never queried a missing column.
Know which statements lock
You can't reason about safety without knowing what your specific database does. Behaviour differs by engine and version, so confirm against your own docs and test on a production-sized copy. The general shape on modern PostgreSQL:
- Cheap and safe. Adding a nullable column with no default is effectively instant. Dropping a column is fast (the data is reclaimed later). Renaming is a metadata-only change — the danger is the deploy gap, not the lock.
- Cheap if you're careful. Adding a column with a constant default is fast on recent PostgreSQL but can rewrite the whole table on older versions.
CREATE INDEX CONCURRENTLYavoids the write-blocking lock a plainCREATE INDEXtakes — at the cost of running outside a transaction and occasionally needing a retry. - Dangerous on a big table. Adding a
NOT NULLconstraint or a foreign key forces a full validation scan that holds a lock. The safe path is two-phase: add the constraint asNOT VALIDso it applies only to new rows, thenVALIDATE CONSTRAINTseparately, which takes a weaker lock.
-- Add the constraint without scanning existing rows...
ALTER TABLE orders
ADD CONSTRAINT orders_user_fk FOREIGN KEY (user_id)
REFERENCES users (id) NOT VALID;
-- ...then validate separately, under a lighter lock.
ALTER TABLE orders VALIDATE CONSTRAINT orders_user_fk;
The rule of thumb: any statement that must read or rewrite every existing row can lock for a long time. Split it so the expensive scan runs under the weakest lock available, or in batches.
Making it routine, not heroic
Zero-downtime migrations shouldn't depend on one engineer remembering all of this. Bake it into how you ship.
- Decouple migrations from app boot. Don't run migrations as a side effect of an instance starting. Make them an explicit, ordered step in the deploy pipeline, so a failed migration doesn't crash-loop your fleet.
- Set a lock timeout. Configure a short
lock_timeout(andstatement_timeout) for migrations. Far better for one to fail fast and be retried than to silently queue behind a long transaction and freeze the table. - Backfill in batches, always. Any data change touching a large table goes in bounded chunks with a pause between them, so you never hold a long lock or balloon a transaction's undo log.
- Make "is this backwards compatible?" a review question. The one check that catches most outages: can the code currently in production survive this schema change without a redeploy? If the answer is no, it must be split into expand and contract steps.
None of this needs special tooling — plain SQL run in the right order, behind a deploy step with a lock timeout, gets you most of the way. Online schema-change tools earn their keep only at a scale where even safe DDL is too slow.
FAQ
How do I rename a database column without downtime?
Don't rename it in place — the atomic RENAME breaks old code during the deploy gap. Instead, add the new column, deploy code that writes both columns, backfill existing rows in batches, then switch reads to the new column. Only once every instance is on the new code and nothing references the old name do you drop it.
Do I really need three deploys for one schema change?
Only for a breaking change against a live database. Additive changes — a nullable column, a concurrent index — are a single safe deploy. The three-step process applies specifically to removals, renames, and tightened constraints, where one step would leave running code depending on a shape that no longer matches the schema.
Will adding a column lock my table?
It depends on the engine, version, and what you add. On modern PostgreSQL, adding a nullable column (and often one with a constant default) is near-instant; the expensive cases are constraints and changes that scan or rewrite every row. Check your database's docs for your version and test on a production-sized table before assuming a statement is cheap.
When should I reach for an online schema-change tool?
Only when plain, careful SQL is too slow for your scale — typically very large tables where even a metadata change or concurrent index build is disruptive. For most applications, expand-contract plus a lock timeout and batched backfills is enough. Reach for heavier tooling to solve a measured problem.
Next step
Pull up the next migration in your branch and ask the question that prevents most deploy outages: can the code already running in production survive this change with no redeploy? If it can, ship it. If it can't — a rename, a drop, a new NOT NULL — split it into expand, migrate, and contract, give each step its own deploy, and put a lock timeout on the migration. Your schema changes stop being the scariest part of the release. More pragmatic engineering guides are at theappcode.net.