Automatically setting update_fields when saving a model instance

nickaddison · January 10, 2024, 2:59am

After running into concurrency issues where my Django model instances were being overwritten by previous values due to .save() not having update_fields specified, I began refactoring to specify update_fields wherever possible. This process proved more difficult than anticipated for complex functions and during the process I had a thought - if we can reliably know which fields have changed in-memory during thread execution (tracked on the model instance), can we automatically set update_fields?

This would make concurrency much safer as even if the instance gets “overwritten”, only the overlapping fields are an issue.
Though I can see how this could be undesirable in some cases, it seems like majority of cases would benefit from having update_fields restricted to only fields which have been updated.

Here’s a draft of a BaseModel that can hopefully be used to achieve this functionality. Appreciate any insight into problematic side-effects this may cause.
Note: this accounts for deferred fields and does not incur additional database queries.

from django.db import models


class _Null:
    """A class to represent a Null type."""

    def __bool__(self):
        return False


NULL = _Null()


class BaseModel(models.Model):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.__original_values = self._get_loaded_field_values()

    def _get_loaded_field_names(self):
        concrete_fields = self._meta.concrete_fields
        normalized_deferred_fields = []
        for deferred_field in self.get_deferred_fields():
            if deferred_field not in concrete_fields and deferred_field.endswith("_id"):
                deferred_field = deferred_field[:-3]
            normalized_deferred_fields.append(deferred_field)
        return [
            field.name
            for field in concrete_fields
            if field.name not in normalized_deferred_fields
        ]

    def _get_loaded_field_values(self):
        field_values = {}
        for field_name in self._get_loaded_field_names():
            foreign_key_id = getattr(self, field_name + "_id", NULL)
            if foreign_key_id is not NULL:
                field_values[field_name] = foreign_key_id
            else:
                field_values[field_name] = getattr(self, field_name, NULL)
        return field_values

    def _get_update_fields(self):
        update_fields = []
        for field_name, current_field_value in self._get_loaded_field_values().items():
            original_field_value = self.__original_values.get(field_name, NULL)
            if (
                current_field_value is not NULL
                and current_field_value != original_field_value
            ):
                update_fields.append(field_name)
        return update_fields

    def refresh_from_db(self, *args, **kwargs):
        super().refresh_from_db(*args, **kwargs)
        self.__original_values = self._get_loaded_field_values()

    def save(self, *args, update_fields=None, **kwargs):
        """
        Automatically set `update_fields` (if not explicitly specified)
        based on which fields have been modified
        """
        creating = self._state.adding is True
        if not creating and update_fields is None:
            update_fields = self._get_update_fields()
        super().save(*args, update_fields=update_fields, **kwargs)
        self.__original_values = self._get_loaded_field_values()

washeck · January 10, 2024, 8:41am

I’d be very happy to see improvements in this area. I’m long time Django user and I did not realize how big this problem is until we received many support tickets about pattern IMHO common to many projects: our users changed their password in UI but it was reset to the old one because we had a very long cron job running with

for u in User.objects.all():
   do_something()
   u.save()

I can imagine that your approach hits many edge cases but if Django cannot solve this automatically, at least the docs should guide users to always pass update_fields to save (which we actually did in our project - enforce update_field to be present on every update).

megaman821 · January 11, 2024, 8:17pm

Always using update fields can lead to data inconsistency. If I had an inventory record that had quantity and name fields. I read it has 12 Pet Rocks and I change it to 12 Monster Trucks. At the same time someone else is readying 12 Pet Rocks and changes it to 15 Pet Rocks. After both partial updates go through we end up with 15 Monster Trucks. I would rather have last write wins in that scenario.

nickaddison · January 11, 2024, 11:01pm

@megaman821 That would be a good example of an undesirable case. I think it can be described more generally as

initial_state = Model.objects.get()
final_state = do_stuff(initial_state)
final_state.save()

Where another process is running concurrently to change to a different final state. If update fields are restricted then you can end up with a combination of final states which could be an invalid or unexpected state. Which is definitely fair criticism.
We could always change the usage to handle both scenarios depending on what you want. For example, we could do something like .save(automatic_update_fields=True) or in the reverse default .save(update_fields=["__all__"]).

nickaddison · March 7, 2024, 12:32am

Update: I have been running this on some of my models in production for almost 2 months now without any known issues/drawbacks so far and has eliminated the concurrency problems (in the rarer case when fields are overlapping I use database locks to maintain data consistency)

kierangillblueberry · May 14, 2024, 1:28pm

@nickaddison Thanks for posting your solution to this problem! Have you made any changes to your original BaseModel since your first post?

We’re using a similar solution but with a couple of differences:

_get_loaded_field_values is return {f.name: getattr(self, f.name) for f in self._meta.fields}
Our version of _get_update_fields is…

    # self._original_state is computed with _get_loaded_field_values
    self._changed_fields = [
        f.name for f in self._meta.fields if getattr(self, f.name) != self._original_state.get(f.name)
    ]

The main differences are:

We don’t do the same NULL comparison as you (current_field_value is not NULL).
We use self._meta.fields instead of self._meta.concrete_fields.
We don’t filter out fields that fall into normalized_deferred_fields.

Could you help us understand each of these changes? Why did you add them / what is the use case?

Thanks for posting your solution on this forum. It’s nice to know we’re not the only ones running into this situation!

Topic		Replies	Views
Why doesn't django only UPDATE the fields that are changed? ORM	2	907	March 5, 2024
Hook for extending `update_fields` in save() overrides ORM	6	1041	December 22, 2023
compare model data before save Using the ORM	2	644	January 13, 2024
Where does Model.save() make a list of fields? Using Django	1	669	November 23, 2019
Force update of an existing model Using the ORM	6	1403	February 8, 2023

Automatically setting update_fields when saving a model instance

Related topics