Add support for atomic upserts

Hello Community!

I have a feature request / enhancement that I’d like to discuss here. It comes from a situation I encountered when I took over maintenance of a Wagtail CMS package called wagtail-ab-testing. To give a quick overview of what it does: it supports CMS editors in testing two versions of the same Wagtail page, keeping track of analytics like views and conversions.

I noticed a raw SQL query is used to insert or update (UPSERT) a record atomically (​link to relevant source code, if you want to have a look yourself)

This raw query initially confused me, why is it there? It turns out it is there because Django does not support atomic upserts without taking a lock on the row, which hurts performance. This is important because it is quite a hot code path.

The raw query (for PostgreSQL) looks somewhat like this:

INSERT INTO %s (ab_test_id, version, date, hour, participants, conversions)
VALUES (%ss, %ss, %%s, %%s, %%s, %%s)
ON CONFLICT (ab_test_id, version, date, hour)
  DO UPDATE SET participants = %s.participants + %%s, conversions = %s.conversions + %%s;

There are a few things to note here:

  • There is an unique constraint on combination of ab_test_id, version, date, and hour columns that prevents duplicate objects from being created.
  • The participants and conversions columns are updated atomically; That is: these columns are incremented by the given values, not set directly. This is important for atomicity. We don’t want multiple concurrent database calls to overwrite each other.

A common way to ‘create or update’ in Django would be to use the ​update_or_create method, but this ​internally takes a lock on the row in case of updates, which is not acceptable because, as mentioned before, my example is a hot code path with a lot of concurrent requests. We cannot afford to take a lock on the row. For reference, here’s the relevant Django source code that takes the lock

The other alternative I researched is to use bulk_create with update_conflicts=True but this won’t deal with F expressions. Which are necessary for the atomicity guarantee.

Why I would like Django to support this feature

I feel like Django’s ORM capabilities would be strenghtened if it would allow developers to make efficient queries out of the box. No need for raw SQL or utility packages (as an example for PostgreSQL, there is django-pg-upsert)

Considerations

Database vendor specific SQL. not all databases use the same syntax and some pose limitations.

Quoting from what I wrote in #35793

PostgreSQL and SQLite support this syntax since PostgreSQL 9.5 and SQLite 3.24.0 respectively. MySQL and MariaDB do not support this natively, but it can be emulated using INSERT … ON DUPLICATE KEY UPDATE syntax and creating a unique index on the columns.

Oracle apparently supports something similar using the vastly different MERGE statement syntax. That might be a bit of a challenge.

How this would fit in the ORM. I’m not feeling at all qualified enough to make suggestions on how this would fit in the ORM. My, perhaps naive, take on it would be that update_or_create already gets very close now that the create_defaults parameter was added in Django 5.0. Perhaps we could add an argument to instruct Django to perform an actual UPSERT query instead of emulating it by taking a lock?


Hope to hear thoughts on this. Is there appetite to add support for this in Django?

2 Likes

I think it would be worthwhile to investigate if this restriction could be removed.

As for changing update_or_create, I think #34996 (Enhance update_or_create() method with upsert SQL.) – Django is relevant.

Not sure what the solution here could be in Django (considering backwards compatibility and support for different databases) but I just want to +1 here because I recently had to resolve to a similar custom SQL to do the same. It is missing in Django.

Also check out django Postgres extra

I also think upsert support would be great built into django.

It feels like making update_fields also accept a dict[str, Expression] (today it only accepts list[str]) would be the most straight forward way to implement this feature.

Today doing

AbTestHourlyLog.objects.bulk_create([
    AbTestHourlyLog(...),
    update_conflicts=True,
    unique_fields=["ab_test", "version", "date", "hour"],
    update_fields=["participants", "conversions"],
])

Results in something like

INSERT INTO abtesthourlylog (ab_test_id, version, date, hour, participants, conversions)
VALUES (%s, %s, %s, %s, %s, %s)
ON CONFLICT (ab_test_id, version, date, hour)
  DO UPDATE SET participants = EXCLUDED.participants, conversions = EXCLUDED.conversions;

Looking at how bulk_create, SQLInsertLogic, and on_conflict_suffix_sql are implemented is seems relatively straightforward to add support for

AbTestHourlyLog.objects.bulk_create([
    AbTestHourlyLog(...),
    update_conflicts=True,
    unique_fields=["ab_test", "version", "date", "hour"],
    update_fields={
        "participants": F("participants") + participants_incr,
        "conversions": F("conversions") + conversions_incr,
    }
])

To result in

INSERT INTO abtesthourlylog (ab_test_id, version, date, hour, participants, conversions)
VALUES (%s, %s, %s, %s, %s, %s)
ON CONFLICT (ab_test_id, version, date, hour)
  DO UPDATE SET participants = EXCLUDED.participants + %s, conversions = EXCLUDED.conversions + %s;

In order to avoid peppering the logic with isinstance(update_fields, list) or dict we could immediately turn list[str] into dict[str, F] in QuerySet.bulk_create to keep it as a quick alias and more importantly preserve backward compatiblity

if isinstance(update_fields, list):
    update_fields = {
        update_field: F(update_field)
        for update_field in update_fields
    }

All the downstream logic would then only have to expect update_fields: dict[str, Expression] | None.

4 Likes

I think we can (re-)open a ticket for this based on @charettes’s draft.

TIL that the orm already supports upserts!

Great to see the positive responses so far!

Looks like I forgot to copy the note that should preface my opening post from my draft. I can’t seem to find the edit button for my post, hopefully a moderator can edit my post and add this note:

this was originally raised as #35793 on the issue tracker

I’ve re-opened #35793.

I think that makes a lot of sense!

Not to hijack the thread – let me know if this should be a separate discussion. Do you think it would make sense for us to change the for_save parameter in BaseExpression.resolve_expression() to allow differentiating between INSERT vs UPDATE queries?

Currently they’re set to True in both cases, but I can imagine it being useful to resolve the expression differently when used in bulk_create(update_conflicts=True) and update_or_create().

For example: @adzshaf and I are currently working on adding JSONSet() and JSONRemove() DB functions. We’d like to change how the expression is resolved depending on whether it’s an INSERT or an UPDATE query. See the example hack and this PR comment for more context.

P.s. thanks again for writing down your thoughts here and on the ticket – big fan of your ORM work!

1 Like

i had a query, while working i got to know that the F strings are not supported inside the bulk create, and F works on the basis that when a save() is hit, it generated a SQL. So is this still a approach we should go for? I need some advice over it.

@YashRaj1506

The work required to land this work is to adapt the bulk_create logic to support any resolvable being provided through update_fields values (F being a resolvable itself).

The suggestion of turning update_fields: list into update_fields: dict[str, Expression] is merely a suggested top-level simplification to maintain backward compatibility with the current supported signature.

2 Likes

Okay, i get it now! Thanks for the clarification, will continue to work on it.

1 Like