Migration performance, interfaces and flexibility

I work for a company with a very large Django monolith weighing in at over 4 million lines of code. This includes ~6,500 migration files across a large number of applications. A full migration on a developers machine takes over 15 minutes.

I wanted to build some tooling to help with database schema changes that runs in CI. This would involve comparing the “end state” database schema that Django ends up creating after manage.py migrate is executed on a clean database to the schema that is currently used in production.

Running the full suite of migrations is not feasible for this use case and I was thinking that it isn’t really required. Django surely can compute the SQL used to create the tables on a empty database without iteratively constructing/altering them via a series of migrations?

I ended up creating this incredibly hacky proof of concept: dump_schema.py · GitHub. It’s a management command that essentially “virtually” squashes the migrations by loading existing on-disk ones then detecting changes from an empty state.

I’m not sure if this is the best way to do this at all, and the code is rather crappy, but I came across a few surprising things during this that I thought I would start a discussion around.

Performance
I was surprised to find that the dominating factor for running migrations is not the actual queries executed but instead reloading models. Of the total 1072 second runtime of the migrate command, only 58 seconds is spent within the cursor.execute call and 782 seconds is spent within reload_model. I’ve uploaded some sanitised cprofile CSV files here: 1_total_time_spent.csv · GitHub

And a graphical tree here:

Profile tree

Interfaces

I’ve found that at a certain size Django migrations becomes an absolutely critical thing to monitor and a pretty dangerous thing overall. Getting a hold of this likely necessitates some custom tooling, as databases (and the way they are used) can be very bespoke.

However right now our interfaces don’t make this tooling very easy. For example if you want to extract a structured view of manage.py migrate --plan you need to parse the human-readable output and create the structure yourself. Or if you wanted to extract the SQL statements generated by a set of new migrations within a commit you need to work out which migrations have been added, then repeatedly call sqlmigrate and store the output.

Migration “stuff” is something that often involves external tools or custom processing to do safely at scale and we don’t currently have interfaces that make it easy to add those without creating custom management commands and using internal migration calls. Having a structured JSON output option for migrate would go a long way to helping with this.

Flexibility

At a certain point you might want to do something really specific, like my exploration above, and you may need to dive into the migration Python API. The migration internals are actually really quite nice once you grok how they work together. The use case I was exploring is obviously not something directly supported so I was definitely fighting parts of the interface. However they do currently seem quite coupled to migrations on disk, and I’m not sure they have to be?

One example of some custom tooling that I’m exploring, and that uses the internals, is one to automatically detect what locks are held by Postgres a given statement within a migration. It’s definitely not ready for prime time but it’s just one example of a use case that would be good to enable.

Perhaps we could explore opening up a stripped down, simple interface that we can document publicly to help with people who are not familiar with Django but also want to build similar tooling? It shouldn’t be expansive or expose any internals, but it could cover a few simple use cases like:

# Get all unapplied migrations
migrations.get_unapplied_migrations()
# Get all changes introduced by a series of apps or migration file paths 
mgirations.diff(changes=[("app_name", "foo"), ("app_name", "bar")])
mgirations.diff(changes=["path_to_new_migration"])
# Virtually squash all migrations and return a new set of migrations
migration_plan = migrations.squash(...)
# Remove a specific type of operation from the plan
migration_plan.remove_operations(AddIndex, RemoveIndex)

RunPython

I think that the schema_editor argument passed to RunPython functions may have been a mistake. There currently isn’t a way to tell between a runpython operation that “inserts some rows” and one that “does some crazy database modifications”.

I’ve never actually seen a RunPython operation that does schema modifications - it’s been exclusively used for data loading of some kind in my experience.

Data loading is another wider topic that I might write down my thoughts on, and one I think we can nail with some changes, but I think it would be a good start to be able to differentiate between a RunPython that does schema alterations and one that does not.

Perhaps we could consider adding a new DataMigration operation distinct from RunPython that does not pass in the schema_editor to the user-supplied function?

Consensus
If any of these thoughts resonate and we can get a consensus on any improvements we could make here, then I can spend time on developing these, or even flesh some out at the DjangoCon hackathon.

1 Like

In terms of the reload_model issue it has been recently discussed in this thread Migrating backwards takes orders of magnitude longer than migrating forwards.

It is a well known problem on which we made some progress over the past years by making sure that built-in Operation subclasses only operate from ModelState and not rendered model.

This means that under normal circumstances generating the project state up to a certain point (e.g. 1000’s of migration already applied) which is basically a series of state_forwards call is relatively fast as no model rendering takes place.

Things begin to get slow once you apply your migration operations on the database though as the schema editor still operate from rendered model. This means that from that point models are rendered and need to be invalidated on state_forwards and re-rendered on followup database_forwards.

It is a problem we’ve been trying to solve for a while and was also publicly discussed.

1 Like

Very interesting proof of concept. In my case I have deleted (or rename in order to keep the track history) some migrations and rebuild it in one migrations when they start to give me problems, I know is not the recommended, but sometimes is what deadlines allow.
I like this kind of approach.
Of course in my case I never have to use in a big protect, all are in internal tools that honestly are small.

1 Like

We take a similar approach. We only have about 10 installations of our primary application (4 production, 4 staging, and 2 for experimentation and research). They’re all kept close enough that at times, after the production instances are updated, we “collapse” or “rebase” (our internal term) our migrations to keep the numbers under control. We try to always keep the number of migrations at under 100. (If we ever run into a situation where we would need to recreate a database for a given structure, we pull the proper instance from git.)

Glad to read this Ken, I thought it was a bad practice that my lack of knowledge causes. But if your team takes a similar approach, I assume it’s very difficult not to follow this path.

I was aware of the general performance issues and some previous discussions, but I hadn’t seen that talk. Thanks for the link, I’ll give it a watch. I didn’t want to focus this entirely on the performance of migrations (nor come across as criticising previous work in this area!), I only wanted to show that the overhead for us was larger than I expected. I instead wanted to focus more on the flexibility aspect with the main thesis being:

Databases are tricky and often require bespoke tooling. Implementing this tooling in core might not be appropriate. Let’s make it easier to build this tooling when needed?

Thinking more about it overnight, perhaps exposing a simple-ish migrations Python interface isn’t the way to go. Perhaps we could add a inspectmigrations management command that does a few useful things, optionally outputting the data in JSON:

  1. Shows a full migration plan and operations within the plan
  2. Shows the rendered SQL for multiple migrations
  3. Allows generating a new migration plan with specific operation types omitted (i.e skipping all AddIndex or RemoveIndex operations)
  4. Supports referencing migrations as file paths instead of only (app, name) tuples
    • This is useful with CI, where you might just want to throw a list of “all changes that match a migration glob expression” into the command
    • i.e run all migrations, excluding "list migration files added in this branch" or show me the SQL generated for this list of migration files.

With this it should possible to compose a command that can answer questions like:

  • Show me all the CREATE/DROP TABLE DDL statements in a set of new migrations
  • Show me any CREATE /DROP INDEX statements that are not AddIndexConcurrently/RemoveIndexConcurrently
  • Show me the complete rendered end-state table DDL for all models across all migrations
  • Give me information on all the indexes created on all models

Then pipe the output into another command that does something bespoke with them (i.e add a label to a pull request, detect overlapping indexes, request additional approval, run some extra tests, trigger a task to sacrifice a goat to the Postgres migration gods etc).

(By another Kraken employee :wink:)

Aren’t these already answerable with sqlmigrate? Perhaps it could gain some flexibility to allow ranges of migrations there would help with showing all SQL that the migrations in a PR will run.

I’ve also spent quite a bit of time on this in my various workplaces, and have wished for a more extensible migrate command. So this topic is very interesting to me!

For example, the excellent library django-deprecate-fields currently relies on quite a hacky solution to do safe deletion of database fields, and it would be really interesting to see how one could extend the default migration command for such use cases.

I guess you can get a bit of the way there with sqlmigrate, but some information is currently lost and not very structured.

For example, say that we wanted to know which migration added these indexes and didn’t use the Concurrently variants. We could call sqlmigrate repeatedly and pipe it to grep, like:

for CHANGED_PATH in $ALL_CHANGES
do
    if $(manage.py sqlmigrate $CHANGED_PATH | rg "CREATE INDEX \""); do
        echo $CHANGED_PATH >> bad_migrations.txt
    end
done

Then collect the results and do something :tm:. But this seems hacky and could be slow.

If we are able to output something structured then anyone could throw together something like the example below which looks for any AddIndex or RemoveIndex operations in changed files:

python manage.py inspectmigrations --json $ALL_CHANGES  \
  | jq 'select(.[].operations[].name | in(["AddIndex", "RemoveIndex"])) | .[].file_path'

And add that to some custom github action (or any other CI system) to add an annotation to the pull request with a warning:

Github actions YAML
on:
  pull_request:
    paths:
    - "src/**/migrations/*.py"

jobs:
  sqlmigrate:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - id: changed-files
      uses: tj-actions/changed-files@v36
      with:
        files: src/**/migrations/*.py
    - id: bad-migrations
      run: |
        FILES=$(python manage.py inspectmigrations --json ${{ steps.changed-files.outputs.all_changed_files }} \
          | jq 'select(.[].operations[].name | in(["AddIndex", "RemoveIndex"])) | .[].file_path')
        echo "BAD_MIGRATIONS=$FILES" >> "$GITHUB_OUTPUT"
    - uses: actions/github-script@v6
      with:
        script: |
          const files = JSON.parse('${{steps.bad-migrations.outputs.BAD_MIGRATIONS}}');
          for (const path of files) {
            core.error('Dangerous territory! Get this migration double checked.', {file: path, startLine: 1}) 
          }        

This assumes we can output something like below from a built in management command:

[
  {
    "file_path": "src/migrations/0001_initial.py",
    "operations": [
      {
        "name": "AddIndex",
        "args": {"name": "FooIndex"}
        "sql": "CREATE INDEX \"FooIndex\" ..."
      }
    ]
  }
]

Obviously you could do the same or better with a linting rule, but my overall view is that being able to get structured data about migrations and migration state can enable custom, business specific stuff to be built quicker than it can now without needing to delve into any internals, parse human readable output or add bespoke one-off commands for specific use cases (like sqlmigrate arguably is).

1 Like

Hmm, I’m not convinced that a management command to dump data in JSON is a great solution. It requires knowing ahead of time which details are useful for the various checks you’re going to build, and JSON has limited expressiveness (for example with cyclic relationships).

Why not write a management command and read the data directly from the migrations classes? We could document parts of the migrations API as public as they are found necessary.

Yeah I think you’re right. Perhaps documenting some APIs for migrations is the more flexible solution.

For context one task I’m working on now is identifying indexes that have been created via RunSQL migrations but are not present in the model state, because pytest-django doesn’t create them when creates the database via syncdb without running the full set of migrations.

I’ve hacked through this with some copy-paste from “migrate”, but there could be some nice opportunities for a limited but somewhat flexible API we could expose?

I’m not sure we should document the current migrations API, but instead I think we should build some high-level interfaces on top of the current implementation.

Yeah, due to issue like this, I think skipping migrations and syncdb’ing in tests is unnecessarily dangerous. I mentioned it in Speed Up Your Django Tests, which Kraken has a copy of that you should be able to access.

It should be possible to save and load a SQL dump at the start of tests before migrate runs and apply any new migrations. I haven’t ever got around to doing that myself, but pytest-django covers some people doing that, I think even in the docs.

Sure, suggest what that can be when the time is right…

Would squashing migrations help? Maybe convert your current fresh db install into one single migration. Having 4000 migrations can often contain a ton of redundant and unnecessary operations