However, one of the slowest parts of the whole process of bulk-inserting data is still (quite surprisingly) creating Python objects before the actual inserting. This was already discussed in other threads:
(For additional reference, see this SO thread for a benchmark of creating objects in Python, or this blog post for a report of huge performance gains by avoiding creating model instances when fetching lots of data.)
So I was thinking about an option of inserting data with dictionaries. Since we already have values for fetching rows using dictionaries instead of objects, inserting data this way should fit as well. It could be handled directly in the existing bulk_create method (allowing for an arbitrary mix of dictionaries and objects in the list). We can just take data from the dictionary instead of from the object, and for missing keys we can insert the model defaults.
I’ve been experimenting with this, and have prepared an example implementation as a proof of concept: Faster bulk_create using dictionaries · adamsol/django@ed1ad9c · GitHub. The speedup in my tests was between 1.6x and 2x, depending on the model. It seems that models with db_default benefit the most, as we additionally avoid creating DatabaseDefault instances for each object.
Does this sound like something worth implementing in Django?
It’s of course possible to build such a helper function outside of Django (which I have done in a project I’m working on). Nonetheless, it would be convenient to have a faster method of inserting data built into the framework - especially for DB migrations, as they tend to be difficult to test automatically, so importing and using custom functions may easily lead to them getting broken after some code refactoring.
@adamsol I have not looked yet on your approach, but want to give you a few more pointer on what i have tried so far and why.
For django-computedfields I tested different approaches to cut ORM functionality which these results:
This tested SELECTs with model instances vs. dictionaries retrieved via .values(). The roundtrip with updates would still create model instances, thus the benefit is lower with ~1.5 times speedup. For a fully dictionary-based handling I expect the benefit somewhere in 2-3x speedup for postgres. This at least is indicated by my tests with copy_insert impls tested here: idea - should the postgres copy path get a copy_insert/create method? · Issue #4 · netzkolchose/django-fast-update · GitHub
I had no time yet to write everything down into neatly tested lib code yet, as I got distracted with a few psycopg issues and patched those first.
Maybe! I agree it fits with how .values() can return dictionaries—the symmetry of allowing dicts in some operations is appealing. However, it would be a big scope change, since it would logically lead tobulk_update also accepting dictionaries.
I would also like to see attempts to optimize Model.__init__ so this is less of a problem. It does a lot of work, and while some improvements have been made, and maybe there are more yet.
I would also like to see attempts to optimize Model.__init__ so this is less of a problem. It does a lot of work, and while some improvements have been made, and maybe there are more yet.
I’m not sure if much can be done on the Django side here, since object creation overhead comes from Python itself - as benchmarked in the SO answer that I linked earlier. It’s getting better in newer Python versions (my measurements were on 3.13), but dictionaries should still win convincingly in most cases.
I found that we don’t need to create DatabaseDefault instances per model instance, leading to this ~12% optimization:
Nice, so now the advantage of avoiding objects will diminish a little, but 1.6x-1.8x should still be achievable.
Yeah, objects cannot be as fast as plain dicts. But the thread is not quite an apples-to-apples comparison, as it’s using dataclasses, which have generated code that may not be as efficient as a vanilla class can be.
You’re right, after benchmarking further, I can see that Model.__init__ is actually the main culprit, and Python’s object creation overhead is less significant. The following script:
def measure(f):
import time
t = time.perf_counter()
f()
print(f'{(time.perf_counter() - t):.3f}')
class A:
def __init__(self, id):
self.id = id
class B(models.Model):
class Meta:
app_label = 'test'
N = 100_000
measure(lambda: [{'id': i} for i in range(N)])
measure(lambda: [A(id=i) for i in range(N)])
measure(lambda: [B(id=i) for i in range(N)])
gives results like:
0.018
0.051
0.236
But I guess this doesn’t change much regarding the dictionary idea.