Could bulk_update() aggregate the pk's in WHEN clauses?

Hi, I created a ticket for this topic at first but was rejected and referred here.

In bulk_update(), there is a simple code appending WHEN clause for each field value.

With a small change (list → defaultdict), instead of having WHEN for each pk, which is very ineffective IMO, aggregated WHEN ("t"."id" IN (14679095, 14679134, 14679335, ...)) THEN ... could be produced.

Current code:

    for field in fields:
	when_statements = []
	for obj in batch_objs:
	    attr = getattr(obj, field.attname)
	    if not hasattr(attr, "resolve_expression"):
		attr = Value(attr, output_field=field)
	    when_statements.append(When(pk=obj.pk, then=attr))

Modified code:

    for field in fields:
	when_updates = defaultdict(list)
	for obj in batch_objs:
	    attr = getattr(obj, field.attname)
	    if not hasattr(attr, "resolve_expression"):
		attr = Value(attr, output_field=field)
	    when_updates[attr].append(obj.pk)
	when_statements = [
	    When(pk__in=pks, then=attr) if len(pks) > 1 else When(pk=pks[0], then=attr)
	    for attr, pks in when_updates.items()
	]

That could massively reduce the query length. This works for me in PostgreSQL. Can anyone think of a reason not to do this?
Thanks, Vláďa

Hello @tuttle, thanks for opening a discussion on the subject.

One thing you have to keep in mind here is that BaseExpression.__hash__ (which is necessary for your when_updates[attr] is not cheap by any mean). We actually just uncovered and resolved a significant performance regression in it but it still remains relatively expensive. There could be ways to circumvent this problem by storing non-expression (e.g. literal values such as 42) directly and then turning then into Value at when_statements creation but that seems like a lot of work.

Given the current complain about bulk_update is how long it takes to generate the query I’m worried that making do even more stuff at query creation time might not be desirable. (see #31202 (Bulk update suffers from poor performance with large numbers of models and columns) – Django and #29771 (Support database-specific syntax for bulk_update) – Django).

IMO users will prefer to use chains of filter(pk__in=pks).update(field=something) over bulk_update when dealing with updates that have a small cardinality of values which is the only case where this optimization would make sense in the first place (if there’s no overlap of updated values the same chain of when_statements will be generated at the cost of many hash of expressions).

You mention that the query length is massive but is that a problem for query planers? Ultimately the ORM often generates verbose SQL queries but what actually matter, particularly in the case of bulk_update, is how fast the query performs.

Do you have any benchmark demonstrating that this would make updates significantly faster? For which backends and to what extent? Does it only apply when dealing with a certain number of fields or different values? Is it relevant if we eventually switch to using database-specific syntax instead?

These are questions that should be answered to a certain extent before such a ticket is considered.

1 Like

Thank you, Simon, for a quick and clear answer. :slight_smile: I have no such benchmark, nor do I have answer for these question. I just wanted to curiously ask. And it’s true is that what matters is the speed of query execution, not if I find the query short and tidy.