Let's talk zero-downtime migrations

Hi there, djangonauts!

I have a simple idea in mind that I want to discuss. As a preface would like to say that, in my opinion, zero-downtime migrations are almost impossible to support in universal way. For any realistic implementation, these migrations will be a mix of manually written python and sql.

Let’s use a simplified model: that any migration can be split into:

  • non-destructive part, that can be rolled upon a running system right away
  • destructive part that remains pending until all the deployment is finished

For example, deleting a field is completely a destructive operation:

operations = []
pending_ops = [
   migrations.DeleteFieldl("Issue", "source"),
]

To apply the desired pending migrations one would use an API: /migrations/apply?no=001

Or, even better, we can have a page for this in the django admin that lists all pending migrations and where you can apply them. When a pending migration is applied, it saves this fact to the database.

One also can add pending_ops to an already applied migrations, by modifying the migration source.

Of course, when migrations are applied for an empty database (for example, in tests), pending_ops should be applied right after regular operations.

Tagging @charettes as the specialist in the field :slight_smile:

And the others @KenWhitesell @carltongibson @andrewgodwin

Hi @pwtail

Thanks for making a detailed post on the forum.

The idea for “split migrations” has been implemented at least twice already in third-party packages. There’s @charettes’s django-syzygy, and @ryanhiebert ’s django-safemigrate. I’ve not used either myself, but they seem fairly robust and long-lived.

I don’t think we’re ready to integrate this technique into Django for a few reasons:

  1. Many hosting platforms don’t make it easy to run both pre-deploy and post-rollout commands. Typically pre-deploy commands are only supported, such as Heroku’s “release phase”.
  2. It makes rollbacks harder to manage.
  3. In many cases, migrations can be managed to be zero-downtime using the builtin tools, such as always adding new fields as nullable, using special operations like the PostgreSQL-specific AddIndexConcurrently, or the blue-green pattern covered by Mariusz recently.

Also, the existing packages aren’t particularly popular (each ~50 GitHub stars and few contributors), so it would seem there’s little desire in general.

I think if you still want to try this idea out, it would be best to try those packages out and see how well they work for you. More eyeballs and fixing of edge cases will help discover how generally applicable the concept is.

I don’t think Django can add an HTTP API or admin page for applying migrations, or running any other management command, really. It would be very hard to secure the endpoint in a way that’s suitable for all projects. Also, web servers, load balancers, and hosting platforms enforce HTTP timeouts, but migrations can take an arbitrary amount of time. Cutting off migrations due to a timeout is generally not a good idea, it can lead to half-application on non-atomic DDL databases like MySQL, or hanging queries.

1 Like

Well, I think yours is more detailed :slight_smile:

I don’t want to fix edge cases, I want to throw the ideas in!

Well, you can use websockets then

Thank you @adamchainz, waiting for the others…

Since I’ve been specifically tagged here, I’ll acknowledge seeing the post, but must admit that zero-downtime migrations is not a topic that has ever hit my radar. I have never worked in an environment that didn’t provide for a “maintenance window”.

Have looked at the post by @felixxm - it seems that all the magic like SeparateDatabaseAndState was not required. One could just add the deletion migration at some point later. Wouldn’t get even a warning from django. Like this:

  • remove the field/model from the source
  • deploy the app
  • makemigrations
  • deploy again

That’s true. In a team setting it’s not typically feasible to prevent others from running makemigrations and committing the results. The technique in the post is mostly about orchestrating individual changes so that they don’t affect others like this.

1 Like

In a team setting it’s not typically feasible to prevent others from running makemigrations and committing the results.

For django-safemigrate, it was important to me that I be able to enforce that makemigrations would produce no changes for precisely this reason.

From my personal experience (solo dev and using migrations full time for 3 years now, doing roughly one migration per working day), I feel like almost all of the transitory downtimes I get on deploys is that I add a column that has no db_default and isn’t nullable.

A simple check command I can run in my deploy script that fails the deploy if I’m in one of those situations seems like it would handle most of this. Adding the same check for deleting columns should handle another good chunk.

I don’t think we need to overcomplicate this more than that honestly. If we can ship this simple check command, we can massively improve the situation and then wait a few months and see what problems remain.

1 Like

A simple check command I can run in my deploy script that fails the deploy if I’m in one of those situations seems like it would handle most of this. Adding the same check for deleting columns should handle another good chunk.

@boxed I’d welcome your feedback on django-syzigy in this case.

In the case of field addition it will make sure to always insert an intermediary AddField(db_default) (and have a post-deploy AlterField(<without-db_default>) is you used default) and use an equivalent approach for column removal.

The only thing needed is to change your deployment workflow to do migrate --pre-deploy, then deploy, and then migrate.

You can also configure your CI to prevent your from merging changes that are not safely deployable in stages by setting up a feature branch check do

  1. git checkout <target-branch>
  2. ./manage.py migrate which will set the CI database as it is currently deployed
  3. git checkout <feature-branch>
  4. ./manage.py migrate --pre-deploy which will exit with an error code if you are trying to merge schema changes that cannot be safely applied in a rolling deployment manner

I saw it linked before and tried reading the readme but honestly I didn’t understand it. I think the readme might need to be reworked. Your description above didn’t really click for me either I’m afraid.

I’ve just come across this thread in the course of looking into how others approach this problem. I’m operating in an environment where we it’d be advantageous to minimise downtime.

I also totally see this as a more advanced use-case that seems to necessarily invite a degree of complexity that might (justifiably!) be too confusing / intimidating to justify concerning ‘the masses’ with it. It seems reasonable to me for this to be locked away as a third-party thing, assuming that Django can be appropriately hooked to allow for a third-party solution, which it seemingly can be!

For what it’s worth @charettes, the README is understandable to me :stuck_out_tongue:.

1 Like

Coming a bit late to the party but I was toying with an idea on a similar line here. Just cross-referencing, discussion can continue here.