Django 4.2 behavior change when using ArrayAgg on unnested ArrayField (PostgreSQL-specific)

Greetings, fellows.

I am upgrading my app to Django 4.2. So far, I stumbled onto a single issue with PostgreSQL-specific queries.

Here is a queryset snippet that has a different behaviour between Django 4.1.9 and Django 4.2+:

from django.db import models
from django.contrib.postgres.fields import ArrayField
from django.contrib.postgres.aggregates import ArrayAgg

class Offer(models.Model):
	keywords = ArrayField(
        models.TextField(),
        default=list,
        blank=True,
    )

    @classmethod
    def flatten_keywords(cls):
    	return cls.objects.all().annotate(keyword_lists=models.Func(models.F("keywords"), function="unnest")).values_list("keyword_lists", flat=True).aggregate(keyword_list=ArrayAgg("keyword_lists", distinct=True))

Function flatten_keywords performs a following working query in Django 4.1.9:

SELECT ARRAY_AGG(DISTINCT "keyword_lists" ) FROM (SELECT unnest("camps_offer"."keywords") AS "keyword_lists" FROM "camps_offer" subquery

However, if I upgrade to Django 4.2.0 (or 4.2.2), the statement changes:

SELECT ARRAY_AGG(DISTINCT unnest("camps_offer"."keywords") ) AS "keyword_list" FROM "camps_offer"

It is followed by the error and a hint:

ERROR:  aggregate function calls cannot contain set-returning function calls at character 27
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.

I am struggling to cherry-pick the change that led to it, can’t see anything related to it in the changelog.

Did anyone happen to have this as well, or know what was changed?

Hello @amureki!

Django 4.2 shipped changes that avoids doing subquery pushdown on aggregation when unnecessary as these can be harmful performance wise in a lot of cases.

What it failed to account to is that when aggregate expressions refer to certain expressions the pushdown must still take place.

An example of that fixed in Django 4.2.2 is reference to subqueries but the problem you are facing is similar in essence; an aggregate referencing an expression that produces rows.

Could you confirm that using the following code instead addresses your issue?

class Unnest(Func):
    subquery = True
    function = "unnest"

cls.objects.annotate(
    keyword_lists=Unnest("keywords"),
).values_list(
    "keyword_lists", flat=True
).aggregate(
    keyword_list=ArrayAgg("keyword_lists", distinct=True)
)

If it does it makes me believe we might need an Expression flag to denote either that it returns rows or maybe simply a flag that denotes that it can only be summarized by reference. The latter would be the case for aggregates, subqueries, and functions like unnest.

2 Likes

Hey Simon!

Thank you for the reply and extensive explanation.
This is good to know, I definitely missed this change.

Your example indeed addresses the issue. I am honestly a bit far from the Django database internals architecture to justify whether the patch in such logic is needed and what would be the right place for it. But it feels like such pushdown certainly makes sense in some cases.

The solution by @charettes worked well until Django 4.2.7 release. :sweat_smile:

Now, this code fails again with the similar error (code logic did not change, but Django version):

aggregate function calls cannot contain set-returning function calls
LINE 1: SELECT ARRAY_AGG(DISTINCT unnest("camps_offer"."keywords") )...
                                  ^
HINT:  You might be able to move the set-returning function into a LATERAL FROM item.

I am not sure yet, where the regression comes from, I see only #34798 (Subquery wrapping is required in QuerySet.aggregate() for aggregates referencing nested subquery.) – Django that might affect this.

Is there any chance I could get your brain here again, @charettes ? :pray:
Otherwise, I see just two options - to go raw SQL or switch to Python-based evaluations, which would be much slower.

Best,
Rust

Hello @amureki!

The solution for Django 4.2.7+ should be as simple as switching from subquery = True to contains_subquery = True when defining Unnest

class Unnest(Func):
    contains_subquery = True
    function = "unnest"
1 Like

Oh, I should pay more attention to what is happening. This totally makes sense as I see it now.
Many thanks again for the suggestion, and please forgive me for the noise.

:sparkles:

np, I wanted to get back to this thread after the patch to let you know but it slipped off my radar.

1 Like

I just upgraded django from 4.1.7 to 5.0.3 and had a similar situation: using unnest function with an aggregate, which resulted in the same error. The solution of @charettes works.

Discussions about introducing an Expression.set_returning: bool = False flag to allow the ORM to identify user-defined Func subclasses that may return more than one row are taking place in this accepted ticket if y’all want to chime in.

It would avoid (ab)using the .contains_subquery flag and more importantly account for cases where a set-returning annotation is not referenced by the eventuated aggregation and document how it should be used.

1 Like