Handling migrations that require downtime

I work on a project that has a few tables that hit the 100M-1B range of records in PostgreSQL. I pushed a change that marked a column as NOT NULL. I used the approach of:

  1. Create a NOT VALID constraint that checked the column isn’t null.
  2. Validate the constraint.
  3. Put the application into maintenance mode.
  4. Alter the table column to be not null (locks the table), drop the constraint.
  5. Bring application out of maintenance mode.

The constraint work reduced the amount of time the table would be locked, but it still resulted in some downtime. For one of our servers, it was 15 minutes.

The problem I ran into is that our deployment process ties some migrations into the deployment process. I misidentified the migration of step 4, so while I had performed steps 1 and two correctly, the next deploy would and did inadvertently trigger the table locking migration causing 503’s.

So my question is, what processes and ideally automation do folks use to prevent locking database migrations from running without consent?

One idea I had is to change make our safe migration logic also check the migration operations against a known list of operations that can only run outside of a deployment process. That way it’d require a person to actively choose when to run that migration.

Hi Tim

First, on PostgreSQL 12+, it’s possible to “hoist” a NOT NULL check constraint into a NOT NULL column definition without locking the table, in three migrations:

  1. Add check constraint like X IS NOT NULL with the NOT VALID clause on the check consrtaint
  2. Use VALIDATE CONSTRAINT to make that constraint valid without locking the table
  3. Use SET NOT NULL to set the column to “NOT NULL” without locking the table, as it will see the valid matching check constraint and reason that the table does not need re-scanning. From the ALTER TABLE docs:

SET NOT NULL may only be applied to a column provided none of the records in the table contain a NULL value for the column. Ordinarily this is checked during the ALTER TABLE by scanning the entire table; however, if a valid CHECK constraint is found which proves no NULL can exist, then the table scan is skipped.

That said, this is quite far from Django does by default (at the moment). One problem with using VALIDATE CONSTRAINT is that it doesn’t help if it’s done in the same transaction as the migration being added, so if Django supported it, it would need to generate multiple migration files. #31653 will add django migration operations to use VALIDATE CONSRAINT

Of course, there are many other ways of locking up a DB with migrations. I normally rely on a bit of manual inspection and a statement timeout. I would aim for a statement timeout of 10-30s for your normal app code, and maybe 1-5 minutes, depending on how much downtime you’re able to handle, for migrations (maybe that’s also only 10 seconds, although some migration statements like VALIDATE CONSTRAINT take ages but don’t lock anything).

Another project I’ve seen but not used is django-migration-linter which tries to verify that all your migrations are safe.

Hope that helps - Adam

Hi Adam, thanks for the response. I did indeed try to hoist a NOT NULL check constraint, but I think I didn’t actually do that due to it being done in 2 migrations rather than 3. I’ll have to test the migration again to confirm that I messed that up.

I’ll take a look at the migration linter project, thanks for sharing!

Strange, I didn’t see a significant improvement in migration time when running the ALTER TABLE and DROP (check) CONSTRAINT in a single migration versus two migrations. When I tried running the ALTER TABLE without the constraint existing, the migration ended up taking 3ish minutes compared to the 4ish with hoisted check constraint approach.