Feature proposal: faster fixture loading via loaddata command

JorisBenschop · December 4, 2024, 1:22pm

Currently the “loaddata” management command uses the obj.save() method for each deserialized object within a fixture. This function first tries an UPDATE statement and, if that fails, tries an INSERT statement.

I propose to add two optional flags to the loaddata command:

–force-insert adds the “force_insert=True” to the save() method. This reduces load time by ~50% but has an increased risk of the upload failing in case the record already exists.
–bulk_create: This option groups records in the fixture by the model, and insert the group as a single bulk_create statement. For large fixtures, I have achieved a 1000-fold improvement in loading time. It has a number of risks (already described in the bulk_create section), notably that it skips some routines covered in the save() method of the model.

Both these flags are supposed to be run by people who know that above-mentioned issues will not be a problem for them, and should not be enabled by default. I currently have a PoC ready as a django app, which basically subclasses the Command class inside the loaddata.py file. I achieved very significant improvements in load time. The added code is minimal (~ 20 lines)

Would this functionality be interesting to add to the main branch?

Lily-Foote · December 4, 2024, 6:59pm

This sounds promising to me. I think the biggest risk here is that someone will use these features carelessly, but that can be mitigated with good documentation and we have the same tradeoff in force_insert and bulk_create already.

JorisBenschop · December 4, 2024, 8:16pm

Thanks. Should i proceed by creating a ticket in trac with my branch linked?

Lily-Foote · December 4, 2024, 9:37pm

Yes, I think so. You should link to this forum thread too. Maybe the triage team will think this needs more discussion here first.

JorisBenschop · December 5, 2024, 3:06pm

ok, I created ticket 35975 and made a PR. Lets hope for the best.

sarahboyce · December 6, 2024, 9:16am

@JorisBenschop you had raised a similar ticket quite recently, thank you for creating this discussion. I think the comment from Simon still stands

Given this is a performance related new feature I suggest your proposal come equipped with some details about what kind of improvements users should expect (profiles, benchmarks instead of solely claiming it’s fairly inefficient) backed by step to reproduce as well as a PoC that properly deals with other features of serde framework such as natural keys and a plan on how to deal with backends that don’t support ignore_conflicts. It might even be a good opportunity to augment our performance tracking system with serde benchmarks.

JorisBenschop · December 6, 2024, 10:16am

Sarah, i really do not understand your question, as i specified the improvement much more detailed in the ticket that you closed. The theoretical speedup is 50% for single record inserts or , using bulk_create, a reduction of near 100%. But I have now also included exact benchmarks results in the trac ticket. On small models, the improvement is ~50% for force-insert, and ~90% for bulk_create. I expect more complex models to have better improvements.

The serde framework seems to be specific to measure performance when no changes are made to the API. In this case, there is an optional added parameter that, if added to older version, would create an exception. So without conditional logic (that I do not see being used anywhere in the serde benchmarks), I cannot add a test here.

Could you please indicate what you ask of me?

sarahboyce · December 6, 2024, 11:46am

I’m sorry you feel like something was thrown at you, that you feel dismissed, and that you believe I haven’t put effort into this.

I can assure you I did read the new ticket, old ticket, PR and this forum discussion.
“Closing” doesn’t mean the ticket is deleted, it either means “we don’t currently have the information or the community backing to action this” or “there are reasons why we shouldn’t do this”. This is in the “we don’t currently have the information or the community backing to action this” category - no energy or work is lost

Previously, you added stats on speedups but they were not backed with steps to reproduce.
I see you have since updated the ticket description and I believe the model and fixtures would be the same as what’s in the PR - so thank you, I imagine that’s something we can try to reproduce

As a side note, we also have docs on optimizations which also mention django-asv

In this case, there is an optional added parameter that, if added to older version, would create an exception.

It should be possible to run benchmarks locally against your code.
These benchmarks wouldn’t be merged into django-asv before the feature is merged into Django.
In general, writing benchmarks in django-asv for loading fixtures would be great

I believe the “serde framework” is around serialization rather than the framework we use for bench-marking

I also got the impression from @charettes that this might not be so simple.
Did you look into natural keys and how to deal with backends that don’t support ignore_conflicts?

JorisBenschop · December 6, 2024, 1:24pm

I’m really sorry but i do not really follow the blocking nature of natural keys here. The —force_insert option only skips the UPDATE statement, but does not interfere with any other code. The bulk_create option does skip the save() method object, but this method is already skipped by the loaddata command, as is described in the current documentation Here.

What i propose is that this potential risk is added to documentation. To my understanding, loaddata is only used in a development context. These optional paramaters do not cover 100% of use cases, but is that really a minimal requirement? I think in the majority of cases, these optional parameters may be beneficial. For small fixtures, these optimizations may also not be beneficial (only microsecond improvements).

As for ignore_conflicts, i think a similar case can be made: the parameters are meant for a development context, where the user is expected to understand the content of the fixtures. Perhaps adding more text about the risks in the documentation?

Could you please help me understand why the natural-keys support and ignore_conflicts (which may not even be impacted) is a critical blocker for merge?

sarahboyce · December 6, 2024, 2:35pm

The main thing is we need to make sure we investigate, test and document any limitations.

Folks expect that if something is in Django core, it should work always. Otherwise they’d raise a bug report (and these bugs would become release blocker top-priority bugs).

Hence, we need to be careful when adding new things in that we are clear about cases where it might not work or work as expected (and evaluate if that’s ok or if we need to do something about it).

I don’t have the answers here, and I would like to hear Simon’s opinion as he closed the original ticket.

JorisBenschop · December 6, 2024, 5:02pm

Yes that makes sense, thanks. I can add a bunch of test to trigger these issues (missing fk, duplicate pk etc) to ensure a graceful handling?

sarahboyce · December 6, 2024, 5:06pm

Yes that would be great, natural keys too (Serializing Django objects | Django documentation | Django)
It doesn’t all have to be tests added to Django. Sharing a test project with lots of different fixtures that the PR can be tested against can also work

sarahboyce · December 6, 2024, 5:07pm

Or maybe you can do some fancy things like have the current tests be re ran but using the new options everywhere and see if anything breaks?

JorisBenschop · December 6, 2024, 6:10pm

I’m going to need some advice here. From the PR I gather that there is a reluctance to add new fixtures, but testing these edge cases does require adding quite a lot of (large?) fixtures. Now I probably can mock/patch my way so that I can dynamically add fixtures from code (json.dump, mocking the location where fixtures are search for etc) but that makes the tests large and opaque. What do you prefer?

sarahboyce · December 6, 2024, 6:32pm

I made a suggestion of looking into having the current tests reran but with these new options (this might require subclassing or using self.subTest or having test cases added dynamically with a suffix)

JorisBenschop · December 7, 2024, 2:43pm

I added generic tests to monitor stdout for loaddata, and more in-depth tests for the use of natural keys. I also created new tests that mimic FixtureLoadingTests (without some tests that load fixture1 and test for specific compression or output), but with either force_insert or bulk_create enabled. I’ve added this to a separate file for easy removal.

All tests pass except the combination of bulk_create and prefetch_related data. To my understanding this is a limitation of bulk_create, and I cannot fix this except in documentation.

JorisBenschop · December 7, 2024, 3:25pm

I’ve tried most of the day to add these new tests without repeating code, but in general that create really confusing code. So for now it is just to show that the tests pass, I need to figure out a way to reduce the code repetition (no idea how yet)

JorisBenschop · December 7, 2024, 3:28pm

as for the ignore_conflicts, this option is off by default, and is not activated by adding --bulk_create for loaddata. So I have not yet been able to find a way where this would be a problem, but of course open for suggestions

JorisBenschop · December 13, 2024, 8:44am

Sorry to bother but i fear i lost momentum. Is there anything else i can do to forward this ticket?

nessita · December 16, 2024, 2:49pm

Hello @JorisBenschop, thank you for all the energy and dedication you are putting into this forum thread. I can clearly see that you care about this feature and that it is important to you.

That said, after reading the thread from top to bottom, I fear that a couple of things might be overlooked:

the high-level use case being solved, and
the applicability or value of solving that use case in Django as a core framework.

For the former, I think it’s fair to describe the use case as:

“When developing a Django web service, and when loading a very large fixture, the fixture loading takes a lot of time.”

For the latter, I honestly fail to see the immediate value in adding these new flags to Django Core. Django is a web framework aimed at solving common scenarios in a robust, secure, and stable manner. So far, given the use case description and the lack of community support for this proposal, I feel that this need arises from a niche use case (which doesn’t mean it’s not valid, just niche).

Because of the above, along with previous suggestions and comments, I agree with Simon’s proposal that the best path forward at this stage is to create a third-party app with an improved and faster loaddata command (or a version of it), and see if that gains community traction.

I understand that this may not be the answer you were hoping for, but I want to emphasize that this approach aligns with Django’s established processes for evaluating and deciding on new feature requests. I hope this helps clarify the reasoning behind the ticket closing decision.

Topic		Replies	Views
Providing initial Data fixtures vs datamigration Using Django	6	2586	May 27, 2020
loaddata for all fixtures in a project? Getting Started	5	1651	March 1, 2024
Is there a faster way of importing a large json dataset? Using Django	2	1826	April 8, 2020
initiate Django model data with new data daily and automate this process? Using Django	3	1984	August 12, 2021
Where should data import code live? Using Django	5	720	February 21, 2020

Feature proposal: faster fixture loading via loaddata command

Related topics