I work for a company with a very large Django monolith weighing in at over 4 million lines of code. This includes ~6,500 migration files across a large number of applications. A full migration on a developers machine takes over 15 minutes.
I wanted to build some tooling to help with database schema changes that runs in CI. This would involve comparing the “end state” database schema that Django ends up creating after manage.py migrate is executed on a clean database to the schema that is currently used in production.
Running the full suite of migrations is not feasible for this use case and I was thinking that it isn’t really required. Django surely can compute the SQL used to create the tables on a empty database without iteratively constructing/altering them via a series of migrations?
I ended up creating this incredibly hacky proof of concept: dump_schema.py · GitHub. It’s a management command that essentially “virtually” squashes the migrations by loading existing on-disk ones then detecting changes from an empty state.
I’m not sure if this is the best way to do this at all, and the code is rather crappy, but I came across a few surprising things during this that I thought I would start a discussion around.
Performance
I was surprised to find that the dominating factor for running migrations is not the actual queries executed but instead reloading models. Of the total 1072 second runtime of the migrate command, only 58 seconds is spent within the cursor.execute call and 782 seconds is spent within reload_model. I’ve uploaded some sanitised cprofile CSV files here: 1_total_time_spent.csv · GitHub
And a graphical tree here:
Interfaces
I’ve found that at a certain size Django migrations becomes an absolutely critical thing to monitor and a pretty dangerous thing overall. Getting a hold of this likely necessitates some custom tooling, as databases (and the way they are used) can be very bespoke.
However right now our interfaces don’t make this tooling very easy. For example if you want to extract a structured view of manage.py migrate --plan you need to parse the human-readable output and create the structure yourself. Or if you wanted to extract the SQL statements generated by a set of new migrations within a commit you need to work out which migrations have been added, then repeatedly call sqlmigrate and store the output.
Migration “stuff” is something that often involves external tools or custom processing to do safely at scale and we don’t currently have interfaces that make it easy to add those without creating custom management commands and using internal migration calls. Having a structured JSON output option for migrate would go a long way to helping with this.
Flexibility
At a certain point you might want to do something really specific, like my exploration above, and you may need to dive into the migration Python API. The migration internals are actually really quite nice once you grok how they work together. The use case I was exploring is obviously not something directly supported so I was definitely fighting parts of the interface. However they do currently seem quite coupled to migrations on disk, and I’m not sure they have to be?
One example of some custom tooling that I’m exploring, and that uses the internals, is one to automatically detect what locks are held by Postgres a given statement within a migration. It’s definitely not ready for prime time but it’s just one example of a use case that would be good to enable.
Perhaps we could explore opening up a stripped down, simple interface that we can document publicly to help with people who are not familiar with Django but also want to build similar tooling? It shouldn’t be expansive or expose any internals, but it could cover a few simple use cases like:
# Get all unapplied migrations
migrations.get_unapplied_migrations()
# Get all changes introduced by a series of apps or migration file paths
mgirations.diff(changes=[("app_name", "foo"), ("app_name", "bar")])
mgirations.diff(changes=["path_to_new_migration"])
# Virtually squash all migrations and return a new set of migrations
migration_plan = migrations.squash(...)
# Remove a specific type of operation from the plan
migration_plan.remove_operations(AddIndex, RemoveIndex)
RunPython
I think that the schema_editor argument passed to RunPython functions may have been a mistake. There currently isn’t a way to tell between a runpython operation that “inserts some rows” and one that “does some crazy database modifications”.
I’ve never actually seen a RunPython operation that does schema modifications - it’s been exclusively used for data loading of some kind in my experience.
Data loading is another wider topic that I might write down my thoughts on, and one I think we can nail with some changes, but I think it would be a good start to be able to differentiate between a RunPython that does schema alterations and one that does not.
Perhaps we could consider adding a new DataMigration operation distinct from RunPython that does not pass in the schema_editor to the user-supplied function?
Consensus
If any of these thoughts resonate and we can get a consensus on any improvements we could make here, then I can spend time on developing these, or even flesh some out at the DjangoCon hackathon.
