However, one of the slowest parts of the whole process of bulk-inserting data is still (quite surprisingly) creating Python objects before the actual inserting. This was already discussed in other threads:
(For additional reference, see this SO thread for a benchmark of creating objects in Python, or this blog post for a report of huge performance gains by avoiding creating model instances when fetching lots of data.)
So I was thinking about an option of inserting data with dictionaries. Since we already have values for fetching rows using dictionaries instead of objects, inserting data this way should fit as well. It could be handled directly in the existing bulk_create method (allowing for an arbitrary mix of dictionaries and objects in the list). We can just take data from the dictionary instead of from the object, and for missing keys we can insert the model defaults.
I’ve been experimenting with this, and have prepared an example implementation as a proof of concept: Faster bulk_create using dictionaries · adamsol/django@ed1ad9c · GitHub. The speedup in my tests was between 1.6x and 2x, depending on the model. It seems that models with db_default benefit the most, as we additionally avoid creating DatabaseDefault instances for each object.
Does this sound like something worth implementing in Django?
It’s of course possible to build such a helper function outside of Django (which I have done in a project I’m working on). Nonetheless, it would be convenient to have a faster method of inserting data built into the framework - especially for DB migrations, as they tend to be difficult to test automatically, so importing and using custom functions may easily lead to them getting broken after some code refactoring.
@adamsol I have not looked yet on your approach, but want to give you a few more pointer on what i have tried so far and why.
For django-computedfields I tested different approaches to cut ORM functionality which these results:
This tested SELECTs with model instances vs. dictionaries retrieved via .values(). The roundtrip with updates would still create model instances, thus the benefit is lower with ~1.5 times speedup. For a fully dictionary-based handling I expect the benefit somewhere in 2-3x speedup for postgres. This at least is indicated by my tests with copy_insert impls tested here: idea - should the postgres copy path get a copy_insert/create method? · Issue #4 · netzkolchose/django-fast-update · GitHub
I had no time yet to write everything down into neatly tested lib code yet, as I got distracted with a few psycopg issues and patched those first.
Maybe! I agree it fits with how .values() can return dictionaries—the symmetry of allowing dicts in some operations is appealing. However, it would be a big scope change, since it would logically lead tobulk_update also accepting dictionaries.
I would also like to see attempts to optimize Model.__init__ so this is less of a problem. It does a lot of work, and while some improvements have been made, and maybe there are more yet.
I would also like to see attempts to optimize Model.__init__ so this is less of a problem. It does a lot of work, and while some improvements have been made, and maybe there are more yet.
I’m not sure if much can be done on the Django side here, since object creation overhead comes from Python itself - as benchmarked in the SO answer that I linked earlier. It’s getting better in newer Python versions (my measurements were on 3.13), but dictionaries should still win convincingly in most cases.
I found that we don’t need to create DatabaseDefault instances per model instance, leading to this ~12% optimization:
Nice, so now the advantage of avoiding objects will diminish a little, but 1.6x-1.8x should still be achievable.
Yeah, objects cannot be as fast as plain dicts. But the thread is not quite an apples-to-apples comparison, as it’s using dataclasses, which have generated code that may not be as efficient as a vanilla class can be.
You’re right, after benchmarking further, I can see that Model.__init__ is actually the main culprit, and Python’s object creation overhead is less significant. The following script:
def measure(f):
import time
t = time.perf_counter()
f()
print(f'{(time.perf_counter() - t):.3f}')
class A:
def __init__(self, id):
self.id = id
class B(models.Model):
class Meta:
app_label = 'test'
N = 100_000
measure(lambda: [{'id': i} for i in range(N)])
measure(lambda: [A(id=i) for i in range(N)])
measure(lambda: [B(id=i) for i in range(N)])
gives results like:
0.018
0.051
0.236
But I guess this doesn’t change much regarding the dictionary idea.
@adamsol Had a quick look at your demo impl - looks pretty straight forward, nice.
Still I stumbled over the value adaption your are doing in your approach as:
# your direct approach
value = obj[field.attname]
# vs. django's adaption
value = field_pre_save(obj)
I think this will lead to different behavior for certain field types and values, as some field types do advanced value adaption in the pre_save logic (e.g. JSON-field’s None behavior). Furthermore the docs mentions this as the way to do value adaptions.
While I like the idea to skip individual field adaption (creates a huge amount of runtime for postgres DB), I think this needs to be discussed, whether it should take that route (then plus docs with a hint, that ppl have to adapt values prehand in their dicts) or whether some basic adaption to flat out DB differences should be still done here.
If I’m seeing correctly, all adaptations actually happen in field_prepare (prepare_value), which my code still calls. My code indeed doesn’t call field_pre_save (pre_save_val), but that function is only for things like auto_now and files. I think the reason I skipped it was because pre_save_val is intertwined with calling getattr on the object, which cannot work in the context of dictionaries. Also, something similar already happens for raw queries: there is a condition in pre_save_val to avoid calling pre_save for them. So generally some decision would be required here: whether bulk_create via dictionaries is supposed to be more like a raw query or a standard query. And either some more changes in the code would be necessary, or the difference would need to be documented.