Add support for atomic upserts

Stormheg · October 20, 2024, 11:49am

Hello Community!

I have a feature request / enhancement that I’d like to discuss here. It comes from a situation I encountered when I took over maintenance of a Wagtail CMS package called wagtail-ab-testing. To give a quick overview of what it does: it supports CMS editors in testing two versions of the same Wagtail page, keeping track of analytics like views and conversions.

I noticed a raw SQL query is used to insert or update (UPSERT) a record atomically (link to relevant source code, if you want to have a look yourself)

This raw query initially confused me, why is it there? It turns out it is there because Django does not support atomic upserts without taking a lock on the row, which hurts performance. This is important because it is quite a hot code path.

The raw query (for PostgreSQL) looks somewhat like this:

INSERT INTO %s (ab_test_id, version, date, hour, participants, conversions)
VALUES (%ss, %ss, %%s, %%s, %%s, %%s)
ON CONFLICT (ab_test_id, version, date, hour)
  DO UPDATE SET participants = %s.participants + %%s, conversions = %s.conversions + %%s;

There are a few things to note here:

There is an unique constraint on combination of ab_test_id, version, date, and hour columns that prevents duplicate objects from being created.
The participants and conversions columns are updated atomically; That is: these columns are incremented by the given values, not set directly. This is important for atomicity. We don’t want multiple concurrent database calls to overwrite each other.

A common way to ‘create or update’ in Django would be to use the update_or_create method, but this internally takes a lock on the row in case of updates, which is not acceptable because, as mentioned before, my example is a hot code path with a lot of concurrent requests. We cannot afford to take a lock on the row. For reference, here’s the relevant Django source code that takes the lock

The other alternative I researched is to use bulk_create with update_conflicts=True but this won’t deal with F expressions. Which are necessary for the atomicity guarantee.

Why I would like Django to support this feature

I feel like Django’s ORM capabilities would be strenghtened if it would allow developers to make efficient queries out of the box. No need for raw SQL or utility packages (as an example for PostgreSQL, there is django-pg-upsert)

Considerations

Database vendor specific SQL. not all databases use the same syntax and some pose limitations.

Quoting from what I wrote in #35793

PostgreSQL and SQLite support this syntax since PostgreSQL 9.5 and SQLite 3.24.0 respectively. MySQL and MariaDB do not support this natively, but it can be emulated using INSERT … ON DUPLICATE KEY UPDATE syntax and creating a unique index on the columns.

Oracle apparently supports something similar using the vastly different MERGE statement syntax. That might be a bit of a challenge.

How this would fit in the ORM. I’m not feeling at all qualified enough to make suggestions on how this would fit in the ORM. My, perhaps naive, take on it would be that update_or_create already gets very close now that the create_defaults parameter was added in Django 5.0. Perhaps we could add an argument to instruct Django to perform an actual UPSERT query instead of emulating it by taking a lock?

Hope to hear thoughts on this. Is there appetite to add support for this in Django?

Lily-Foote · October 20, 2024, 1:51pm

I think it would be worthwhile to investigate if this restriction could be removed.

As for changing update_or_create, I think #34996 (Enhance update_or_create() method with upsert SQL.) – Django is relevant.

stribny · October 21, 2024, 9:22am

Not sure what the solution here could be in Django (considering backwards compatibility and support for different databases) but I just want to +1 here because I recently had to resolve to a similar custom SQL to do the same. It is missing in Django.

massover · October 21, 2024, 10:41am

Also check out django Postgres extra

I also think upsert support would be great built into django.

charettes · October 22, 2024, 2:50am

It feels like making update_fields also accept a dict[str, Expression] (today it only accepts list[str]) would be the most straight forward way to implement this feature.

Today doing

AbTestHourlyLog.objects.bulk_create([
    AbTestHourlyLog(...),
    update_conflicts=True,
    unique_fields=["ab_test", "version", "date", "hour"],
    update_fields=["participants", "conversions"],
])

Results in something like

INSERT INTO abtesthourlylog (ab_test_id, version, date, hour, participants, conversions)
VALUES (%s, %s, %s, %s, %s, %s)
ON CONFLICT (ab_test_id, version, date, hour)
  DO UPDATE SET participants = EXCLUDED.participants, conversions = EXCLUDED.conversions;

Looking at how bulk_create, SQLInsertLogic, and on_conflict_suffix_sql are implemented is seems relatively straightforward to add support for

AbTestHourlyLog.objects.bulk_create([
    AbTestHourlyLog(...),
    update_conflicts=True,
    unique_fields=["ab_test", "version", "date", "hour"],
    update_fields={
        "participants": F("participants") + participants_incr,
        "conversions": F("conversions") + conversions_incr,
    }
])

To result in

INSERT INTO abtesthourlylog (ab_test_id, version, date, hour, participants, conversions)
VALUES (%s, %s, %s, %s, %s, %s)
ON CONFLICT (ab_test_id, version, date, hour)
  DO UPDATE SET participants = EXCLUDED.participants + %s, conversions = EXCLUDED.conversions + %s;

In order to avoid peppering the logic with isinstance(update_fields, list) or dict we could immediately turn list[str] into dict[str, F] in QuerySet.bulk_create to keep it as a quick alias and more importantly preserve backward compatiblity

if isinstance(update_fields, list):
    update_fields = {
        update_field: F(update_field)
        for update_field in update_fields
    }

All the downstream logic would then only have to expect update_fields: dict[str, Expression] | None.

Lily-Foote · October 22, 2024, 7:21am

I think we can (re-)open a ticket for this based on @charettes’s draft.

massover · October 22, 2024, 8:51pm

TIL that the orm already supports upserts!

Stormheg · October 23, 2024, 2:52pm

Great to see the positive responses so far!

Looks like I forgot to copy the note that should preface my opening post from my draft. I can’t seem to find the edit button for my post, hopefully a moderator can edit my post and add this note:

this was originally raised as #35793 on the issue tracker

Lily-Foote · October 23, 2024, 4:58pm

I’ve re-opened #35793.

sage · October 25, 2024, 11:40am

I think that makes a lot of sense!

Not to hijack the thread – let me know if this should be a separate discussion. Do you think it would make sense for us to change the for_save parameter in BaseExpression.resolve_expression() to allow differentiating between INSERT vs UPDATE queries?

Currently they’re set to True in both cases, but I can imagine it being useful to resolve the expression differently when used in bulk_create(update_conflicts=True) and update_or_create().

For example: @adzshaf and I are currently working on adding JSONSet() and JSONRemove() DB functions. We’d like to change how the expression is resolved depending on whether it’s an INSERT or an UPDATE query. See the example hack and this PR comment for more context.

P.s. thanks again for writing down your thoughts here and on the ticket – big fan of your ORM work!

YashRaj1506 · November 13, 2024, 2:45pm

i had a query, while working i got to know that the F strings are not supported inside the bulk create, and F works on the basis that when a save() is hit, it generated a SQL. So is this still a approach we should go for? I need some advice over it.

charettes · November 13, 2024, 9:02pm

@YashRaj1506

The work required to land this work is to adapt the bulk_create logic to support any resolvable being provided through update_fields values (F being a resolvable itself).

The suggestion of turning update_fields: list into update_fields: dict[str, Expression] is merely a suggested top-level simplification to maintain backward compatibility with the current supported signature.

YashRaj1506 · November 14, 2024, 6:02am

Okay, i get it now! Thanks for the clarification, will continue to work on it.

Topic		Replies	Views
Doubt in the usage of bulk_create with update_conflicts=True Using the ORM	1	4175	January 11, 2024
Lock record for edit/atomic transactions Using the ORM	1	560	December 21, 2023
Question about transactions and timeouts Using Django	0	59	September 3, 2024
Challenges Encountered in Implementing Conditional WHERE Feature in `bulk_create` for Django Ticket #34277 ORM	0	233	November 20, 2023
Use max aggregation in Model save with single SQL statement Using Django	1	589	December 31, 2020

Add support for atomic upserts

Why I would like Django to support this feature

Considerations

Related topics