bulk_create has a questionable side-effect on its input

Current behavior

As of now, bulk_create accepts an iterable of Model instances, converting it to list of input objects. If it performs an insert query with RETURNING clause, it then sets the fields of Model instances to the values returned from this clause, correctly, however it performs it directly on the objects passed as input.

Potential problem

While it might be insignificant for current bulk_create use-cases, such behavior might cause more problems for conditional upserts, support for which is being worked on (ticket #34277)

An example

Provided we use Postgres as database backend and have defined a model

from django.db.models import Model, IntegerField

class ExampleModel(Model):
    key_1 = IntegerField()
    key_2 = IntegerField()

    conditional_property = IntegerField(null=True)
    data_property_1 = IntegerField()
    data_property_2 = IntegerField()

    class Meta:
        unique_together = ['key_1', 'key_2']

Running the following code

ExampleModel.objects.create(key_1=1337, key_2=37, data_property_1=42, data_property_2=0)
ExampleModel.objects.create(key_1=1338, key_2=38, data_property_1=9000, data_property_2=1, conditional_property=5)

create_instances = [
    ExampleModel(key_1=1337, key_2=37, data_property_1=0, data_property_2=42),
    ExampleModel(key_1=1338, key_2=38, data_property_1=1, data_property_2=9000),
]
ExampleModel.objects.bulk_create(
    create_instances,
    update_conflicts=True,
    update_fields=['data_property_1'],
    unique_fields=['key_1', 'key_2'],
    condition=Q(conditional_property__isnull=True),
)
results = [
    (
        result.data_property_1,
        result.data_property_2,
    )
    for result in ExampleModel.objects.bulk_create(
        create_instances,
        update_conflicts=True,
        update_fields=['data_property_2'],
        unique_fields=['key_1', 'key_2'],
    )

Would result in results containing the following values:

[(0, 0), (9000, 1)]

In other words, after the first bulk_create is performed, field data_property_1 is updated to value from input, after which every input instance gets updated from actual database values (if we consider these columns as db_returning_fields for the sake of this example, which they could surely be).

This causes the second bulk_create to simply pass values obtained and set from database on the input list right back to the database, which will effectively cause us to lose changes we wanted to write on second bulk_create, since our input values are overwritten by db data instead.

Further discussion setup

I believe such behavior is implemented by bulk_create so that it could preserve original input order while partitioning the set into records with and without pk and working on them separately.

One way to avoid this side-effect would be to make a deepcopy of input objects, operating on it instead, however this could be costly for large batches, since we would double the amount of memory needed for insert, so this approach should be treated with caution or discarded as inneficient.

With that said, i am not sure at the moment what other easy fixes would be.

As another measure taken, documentation could clarify and warn about this behavior, like it warns about potential queryset evaluation due to converting input to list internally.

I will be happy to collaborate and provide further info if needed, as of now I am not sure it deserves a separate ticket but will create one if it does.

Isn’t this example quite contrived? The two data_property fields could both be updated in the one call, if I understand correctly.

In general, if you’re going to bulk_create() a set of model instances, you’d normally only call it once. I think it’s fair to allow updating other fields on conflict, as long as it’s documented appropriately.

UPDATE: I now see, that in my case, it is indeed possible to avoid performing bulk_create twice, thank you for letting me see this through. I do, however, think, that documentation should be more clear on the side-effect itself, especially since bulk_create returns the result explicitly, while silently updating objects in the input list. My case might be not the only one, and this behavior generally is not expected or documented anywhere. I will be happy to work on updating documentation in this case to resolve the issue.

Further on is my old draft of a post which I no longer stand by, but will keep it for illustrating my thought process as a user.

previous post draft

I would disagree with it being contrived, since it comes from an actual business-case i am tasked with - even though the choice of this implementation is on me, i think it is a valid if not the best one given the inputs i have.

Let provide a more real-life example to illustrate.

Suppose we have a model:

from django.db.models import DateTimeField, IntegerField

class Entry(Model):
    data = IntegerField()

    unique_key = IntegerField(unique=True)
    created_at = DateTimeField(auto_now_add=True)
    deleted_at = DateTimeField(blank=True, null=True)

deleted_at field serves for soft-delete mechanism, i.e. any Entry with this field set is considered deleted while being still present in the database.

If we create Entry with the same unique_key as deleted model has, we need to update its created_at timestamp with the rest of the fields (data in this case).

If deleted_at field is null, we only need to update the latter, without resetting created_at.

This is the task I am trying to perform, and implementation that’s broken by the side-effect is the following:

Entry.objects.bulk_create(
    instances,
    update_conflicts=True,
    update_fields=['created_at', 'deleted_at'],
    unique_fields=['unique_key'],
    condition=Q(deleted_at__isnull=False),
)
created_instances = Entry.objects.bulk_create(
    instances,
    update_conflicts=True,
    update_fields=['data'],
    unique_fields=['unique_key'],
)

Yeah, I 100% agree on better docs.