Towards a more efficient `bulk_update` with `VALUES` and `UPDATE FROM`

Hey! Long time no see! My current employer is running into some performance issues with bulk_update and I figured I’d take a look. When I initially added bulk_update I went with the simplest approach possible - a big CASE WHEN statement. This was because it was universally supported with a common syntax on all our backends.

Specific databases have much more efficient ways of doing this, and I planned to add database-specific updates at some point (ticket). But I never got round to it (until now?).

All database-specific syntax relies on some form of static VALUES expression. For example, the syntax between Postgres and Sqlite is almost the same. I haven’t tried with MySQL, but things look to work similarly (except using a slightly specific VALUES syntax), same with Oracle, except annoying to implement.

-- Postgres:
UPDATE blog_posts
SET title = v.title
FROM (
  VALUES (1, 'post 2'), (3, 'post 3')
) as v(id, title) 
where blog_posts.id = v.id;

-- Sqlite:
-- Annoyingly, sqlite doesn't support named columns 
-- with `VALUE` expressions. We name these with a sub-select:
UPDATE blog_posts
SET title = v.title
FROM (
  SELECT "column1" AS "id", "column2" AS "misc"
  FROM (VALUES (1, 'post 2'), (3, 'post 3'))
) AS v
WHERE blog_posts.id = v.id;

I’ve been prototyping how to add this to Django, and I’ve got something half-working on sqlite. The relevant code is here (it’s horrible, I know). I’d love some feedback on the general approach before I spend any more time on it.

I added two new expressions: Values and ValueTuple, which can be used to construct VALUES ('foo'), ('bar')) type expressions.

In order to make this work on sqlite, and potentially other backends, it seems easiest to try to add a new “kind” of Table type: ValuesTable. This gives us some nice properties: we can validate the column types/counts against the base table, and it trivially allows us to surface this via a queryset method:

# Combine with values_list to select a subset of data
Note.objects.values("pk", "note").select_values(
    [(1,"test"), (2, "test2")]
).all()

# Or even do this, to get back fully-formed `Note` instances?
Note.objects.select_from_literals([(1,"test"), (2, "test2")))

And in doing so, we can treat this as a subquery and re-use a bunch of our subquery/alias logic. In order for this to work, we need to support some kind of UPDATE ... FROM in SQLUpdateCompiler:

# A bulk-update implementation
# No idea about the specific syntax

SomeModel.objects.update_from(
    note=F('source__note'),  # ?
    source=SomeModel.objects.values("pk", "note").select_values([
        (1, "foobar"), (2, "some new note")
    ])
)

I haven’t dug into this, but it seems that doing this might also unlock the ability to use related fields in .update() - i.e. Note.objects.update(note=F('tag__name')), by pushing the joins into update from and returning a set of values:

SomeModel.objects.update_from(
    note=F('related_field__content'),
    source=SomeModel.objects.values("pk", "related_field__content")
)
UPDATE some_model
SET note = v.content
FROM (
         SELECT "some_model"."pk"
                "related_table"."content" AS "content"
         FROM some_model 
         INNER JOIN related_table ... 
 ) AS v
WHERE some_model.id = v.id;

Essentially, it boils down to this: I can probably hack something very postgres specific into bulk_update somewhat easily, and I might well do this as an initial change.

But it seems that adding some kind of “table from VALUES” expression would unlock a few nice things down the line, and bulk_update could be trivially built off these expressions/abstractions?

Is this something we’d like to add? Are there any glaring issues or roadblocks that I’m not seeing?

Hey Tom, glad to see you back!

I’m excited to see you pick up the work from #29771 and #31202 which should likely be merged under the same ticket?

I haven’t had the chance to look at your proposed code in depth but my immediate feedback would be that we should stay away from adding (yet other) queryset methods (in your case select_literal, update_from, and select_values) to resolve these problems and instead focus these tickets on making bulk_update fast by using the best available strategy depending on the kind of values provides (literals, expressions).

In other words, I don’t think that these tickets were accepted under the purpose of augmenting the queryset interface to support new query building patterns but to enhance what’s under the hood of the current implementation of bulk_update. I personally would have expected the changes to be entirely localized to the QuerySet.bulk_update method and either to SQLUpdateCompiler or to a distinct SQLBulkUpdateCompiler class.

I could see the work here land in per-backend phases (e.g. having the new compiler default to the CASE/WHEN implementation) if getting all of them to work in a single unit of work proves to be too challenging. Just having a SQL compiler avoid the (crazy) amount of resolving that currently takes place for these expressions and generate raw SQL should already go a long way in making things faster.

Thanks for the feedback! Yeah, I was perhaps a bit too into the weeds when I wrote the initial post :joy:.

I’ve taken a step back and have got an initial implementation for SQLite working. Before I spend time making it work on the other backends, I’d appreciate any insight you can give on the high-level approach :heart: - I’d like to hedge against implementing this for all backends but in a totally incorrect way (or worse, with a far simpler way I missed!) and then having to re-do it.

I added two new expression types, RowTupleValues and RowTuple, to generate the correct VALUES ... expressions on different backends. This seems OK I think.

bulk_update produces a list of RowTupleValues from the given instance fields, and then passes it to sql.BulkUpdateQuery for compilation + execution.

The SQLBulkUpdateCompiler class, which would provide a simple implementation of UPDATE FROM VALUES for each backend. The syntax is quite DB specific, but I’m not sure this is the right approach as we’d end up with one db-specific compiler for each backend in core (we currently only have them for MySQL). Maybe that’s perfectly fine though?

I also didn’t want to jam this into the existing SQLUpdateCompiler even though it does something similar.

I then thought that in the base SQLBulkUpdateCompiler implementation we could implement the bulk of the logic from the current bulk_update method: i.e it would translate the given RowTupleValues into a CASE WHEN statement, then execute it as a standard UPDATE query.

But this logic seems quite different from the other compilation classes, so I’m wondering if I should keep the fallback logic in the bulk_update method and only use the specialized SQLBulkUpdateCompiler if the backend supports it? This also seems a bit iffy.

The performance numbers so far are pretty good: updating 10k simple models with 2 fields goes from 0.67 seconds to 0.14 seconds :rocket:

I went ahead and cracked on, and landed on this: Fixed #29771 -- Support database-specific syntax for bulk_update by orf · Pull Request #18844 · django/django · GitHub

I need to properly crunch the numbers, but it’s about 4x as fast as the current implementation across all backends (except oracle, which is not implemented yet).

I managed to generalize the implementation across all backends, with only a small bit of mysql-specificness.

2 Likes