bulk_create has a questionable side-effect on its input

Antvarf · May 10, 2024, 1:32am

Current behavior

As of now, bulk_create accepts an iterable of Model instances, converting it to list of input objects. If it performs an insert query with RETURNING clause, it then sets the fields of Model instances to the values returned from this clause, correctly, however it performs it directly on the objects passed as input.

Potential problem

While it might be insignificant for current bulk_create use-cases, such behavior might cause more problems for conditional upserts, support for which is being worked on (ticket #34277)

An example

Provided we use Postgres as database backend and have defined a model

from django.db.models import Model, IntegerField

class ExampleModel(Model):
    key_1 = IntegerField()
    key_2 = IntegerField()

    conditional_property = IntegerField(null=True)
    data_property_1 = IntegerField()
    data_property_2 = IntegerField()

    class Meta:
        unique_together = ['key_1', 'key_2']

Running the following code

ExampleModel.objects.create(key_1=1337, key_2=37, data_property_1=42, data_property_2=0)
ExampleModel.objects.create(key_1=1338, key_2=38, data_property_1=9000, data_property_2=1, conditional_property=5)

create_instances = [
    ExampleModel(key_1=1337, key_2=37, data_property_1=0, data_property_2=42),
    ExampleModel(key_1=1338, key_2=38, data_property_1=1, data_property_2=9000),
]
ExampleModel.objects.bulk_create(
    create_instances,
    update_conflicts=True,
    update_fields=['data_property_1'],
    unique_fields=['key_1', 'key_2'],
    condition=Q(conditional_property__isnull=True),
)
results = [
    (
        result.data_property_1,
        result.data_property_2,
    )
    for result in ExampleModel.objects.bulk_create(
        create_instances,
        update_conflicts=True,
        update_fields=['data_property_2'],
        unique_fields=['key_1', 'key_2'],
    )

Would result in results containing the following values:

[(0, 0), (9000, 1)]

In other words, after the first bulk_create is performed, field data_property_1 is updated to value from input, after which every input instance gets updated from actual database values (if we consider these columns as db_returning_fields for the sake of this example, which they could surely be).

This causes the second bulk_create to simply pass values obtained and set from database on the input list right back to the database, which will effectively cause us to lose changes we wanted to write on second bulk_create, since our input values are overwritten by db data instead.

Further discussion setup

I believe such behavior is implemented by bulk_create so that it could preserve original input order while partitioning the set into records with and without pk and working on them separately.

One way to avoid this side-effect would be to make a deepcopy of input objects, operating on it instead, however this could be costly for large batches, since we would double the amount of memory needed for insert, so this approach should be treated with caution or discarded as inneficient.

With that said, i am not sure at the moment what other easy fixes would be.

As another measure taken, documentation could clarify and warn about this behavior, like it warns about potential queryset evaluation due to converting input to list internally.

I will be happy to collaborate and provide further info if needed, as of now I am not sure it deserves a separate ticket but will create one if it does.

adamchainz · May 10, 2024, 12:30pm

Isn’t this example quite contrived? The two data_property fields could both be updated in the one call, if I understand correctly.

In general, if you’re going to bulk_create() a set of model instances, you’d normally only call it once. I think it’s fair to allow updating other fields on conflict, as long as it’s documented appropriately.

Antvarf · May 10, 2024, 1:41pm

UPDATE: I now see, that in my case, it is indeed possible to avoid performing bulk_create twice, thank you for letting me see this through. I do, however, think, that documentation should be more clear on the side-effect itself, especially since bulk_create returns the result explicitly, while silently updating objects in the input list. My case might be not the only one, and this behavior generally is not expected or documented anywhere. I will be happy to work on updating documentation in this case to resolve the issue.

Further on is my old draft of a post which I no longer stand by, but will keep it for illustrating my thought process as a user.

previous post draft

I would disagree with it being contrived, since it comes from an actual business-case i am tasked with - even though the choice of this implementation is on me, i think it is a valid if not the best one given the inputs i have.

Let provide a more real-life example to illustrate.

Suppose we have a model:

from django.db.models import DateTimeField, IntegerField

class Entry(Model):
    data = IntegerField()

    unique_key = IntegerField(unique=True)
    created_at = DateTimeField(auto_now_add=True)
    deleted_at = DateTimeField(blank=True, null=True)

deleted_at field serves for soft-delete mechanism, i.e. any Entry with this field set is considered deleted while being still present in the database.

If we create Entry with the same unique_key as deleted model has, we need to update its created_at timestamp with the rest of the fields (data in this case).

If deleted_at field is null, we only need to update the latter, without resetting created_at.

This is the task I am trying to perform, and implementation that’s broken by the side-effect is the following:

Entry.objects.bulk_create(
    instances,
    update_conflicts=True,
    update_fields=['created_at', 'deleted_at'],
    unique_fields=['unique_key'],
    condition=Q(deleted_at__isnull=False),
)
created_instances = Entry.objects.bulk_create(
    instances,
    update_conflicts=True,
    update_fields=['data'],
    unique_fields=['unique_key'],
)

adamchainz · May 10, 2024, 3:07pm

Yeah, I 100% agree on better docs.

Topic		Replies	Views
Return id of bulk_create() instance in Django with MySQL Using Django	1	51	March 8, 2025
Challenges Encountered in Implementing Conditional WHERE Feature in `bulk_create` for Django Ticket #34277 ORM	0	238	November 20, 2023
bulk_create() method for related models Using the ORM	9	25962	August 11, 2022
Doubt in the usage of bulk_create with update_conflicts=True Using the ORM	2	5368	March 18, 2025
Get count of rows actually inserted using bulk_create with ignore_conflicts Using the ORM	1	76	February 1, 2025

bulk_create has a questionable side-effect on its input

Current behavior

Potential problem

An example

Further discussion setup

Related topics