Weird Coalesce Group by

In the documentation ​https://docs.djangoproject.com/en/4.2/topics/db/aggregation/#order-of-annotate-and-values-clauses, the .values() call dictates the group by if you have an .annotation() with aggregation function (e.g. sum).

Somehow if you proceed to add more field annotations without any aggregation function, this isn’t included in the group by in the generated SQL query. For some reason, if you have a field annotation with coalesce, it gets added in the group by.

Snippet:

from decimal import Decimal

from django.db import models
from django.db.models import OuterRef, Subquery, Sum
from django.db.models.functions import Coalesce

class A(models.Model):
    name = models.CharField(max_length=100)
    amount = models.DecimalField(max_digits=12, decimal_places=2)

class B(models.Model):
    name = models.CharField(max_length=100)
    foo = models.CharField(max_length=200)

class BSnapshot(models.Model):
    version_name = models.TextField()
    name = models.CharField(max_length=100)
    foo = models.CharField(max_length=200)

def run():
    B.objects.bulk_create([
        B(name='Alice', foo='live_alice_foo'),
        B(name='Bob', foo='live_bob_foo'),
        B(name='Eve', foo='live_eve_foo'),
    ])

    BSnapshot.objects.bulk_create([
        BSnapshot(version_name='v1', name='Alice', foo='snap_v1_alice'),
        BSnapshot(version_name='v2', name='Alice', foo='snap_v2_alice'),
        BSnapshot(version_name='v1', name='Charlie', foo='snap_v1_charlie'),
    ])

    A.objects.bulk_create([
        A(name='Alice', amount=Decimal('10.50')),
        A(name='Alice', amount=Decimal('5.25')),
        A(name='Bob', amount=Decimal('7.00')),
        A(name='Charlie', amount=Decimal('3.00')),
        A(name='Dennis', amount=Decimal('4.00')),  # no B or snapshot -> will be filtered out
    ])

    version_name = 'v1'
    live_foo = B.objects.filter(name=OuterRef('name'))
    snapshot_foo = BSnapshot.objects.filter(name=OuterRef('name'), version_name=version_name)

    foo = Coalesce(
        Subquery(snapshot_foo.values('foo')[:1]),
        Subquery(live_foo.values('foo')[:1])
    )

    version_1 = (
        A.objects
        .values('name')
        .order_by('name')
        .annotate(amount_sum=Sum('amount'))
        .annotate(foo=foo)
        .filter(foo__isnull=False)
    )
    
    version_2 = (
        A.objects
        .values('name')
        .order_by('name')
        .annotate(amount_sum=Sum('amount'))
        .annotate(foo=Subquery(snapshot_foo.values('foo')[:1]))
        .filter(foo__isnull=False)
    )
 
    print(version_1.query.group_by)
    print(version_2.query.group_by)
    list(version_1)
    list(version_2)

The group by for both querysets is only Col(app_a, app.A.name),

While the generated SQL is

SELECT "app_a"."name" AS "name",
       (CAST(SUM("app_a"."amount") AS NUMERIC)) AS "amount_sum",
       COALESCE(
                  (SELECT U0."foo" AS "foo"
                   FROM "app_bsnapshot" U0
                   WHERE (U0."name" = ("app_a"."name")
                          AND U0."version_name" = 'v1')
                   LIMIT 1),
                  (SELECT U0."foo" AS "foo"
                   FROM "app_b" U0
                   WHERE U0."name" = ("app_a"."name")
                   LIMIT 1)) AS "foo"
FROM "app_a"
WHERE COALESCE(
                 (SELECT U0."foo" AS "foo"
                  FROM "app_bsnapshot" U0
                  WHERE (U0."name" = ("app_a"."name")
                         AND U0."version_name" = 'v1')
                  LIMIT 1),
                 (SELECT U0."foo" AS "foo"
                  FROM "app_b" U0
                  WHERE U0."name" = ("app_a"."name")
                  LIMIT 1)) IS NOT NULL
GROUP BY 1,
         3
ORDER BY 1 ASC
SELECT "app_a"."name" AS "name",
       (CAST(SUM("app_a"."amount") AS NUMERIC)) AS "amount_sum",

  (SELECT U0."foo" AS "foo"
   FROM "app_bsnapshot" U0
   WHERE (U0."name" = ("app_a"."name")
          AND U0."version_name" = 'v1')
   LIMIT 1) AS "foo"
FROM "app_a"
WHERE
    (SELECT U0."foo" AS "foo"
     FROM "app_bsnapshot" U0
     WHERE (U0."name" = ("app_a"."name")
            AND U0."version_name" = 'v1')
     LIMIT 1) IS NOT NULL
GROUP BY 1
ORDER BY 1 ASC

Tested the code on both 4.2 and 5.2. I was redirected here for clarification if this is a bug after filing a bug ticket.

Update on this one, deep dived on the source code and it’s adding the columns here from the compiler code django/django/db/models/sql/compiler.py at 1167cd1d639c3fee69dbdef351d31e8a17d1fedf · django/django · GitHub

        # Note that even if the group_by is set, it is only the minimal
        # set to group by. So, we need to add cols in select, order_by, and
        # having into the select in any case.
        selected_expr_positions = {}
        for ordinal, (expr, _, alias) in enumerate(select, start=1):
            if alias:
                selected_expr_positions[expr] = ordinal
            # Skip members of the select clause that are already explicitly
            # grouped against.
            if alias in group_by_refs:
                continue
            expressions.extend(expr.get_group_by_cols())

As long as the expression has more than one source expression, an annotation gets added into the group by django/django/db/models/expressions.py at 1167cd1d639c3fee69dbdef351d31e8a17d1fedf · django/django · GitHub

     def get_group_by_cols(self):
        if not self.contains_aggregate:
            return [self]
        cols = []
        for source in self.get_source_expressions():
            cols.extend(source.get_group_by_cols())
        return cols

Looks like a weird behavior to me but definitely can have a performance impact if suddenly your non aggregate annotations become included in the group by.

To me the only surprising part here is that the ORM doesn’t warn you or explicitly errors out when you try to augment the GROUP BY by augmenting the SELECT clause with your annotate call. This has been brought up a few times in the past (#23383).

Looks like a weird behavior to me but definitely can have a performance impact if suddenly your non aggregate annotations become included in the group by.

This is just part of the SQL standard, all SELECT’ed non-aggregate function must be included in the GROUP BY. Some database backends like Postgres allow for set of grouped by expressions to be reduced to the primary keys of tables referenced but if an explicit grouping is specified this optimization is obviously not applied.

If your goal here is to alias an expression to be referenced in your WHERE clause without SELECT’ing it then you must use alias and not annotate as the moment you SELECT expressions that are not aggregate functions they must be included in the GROUP BY as they might have an incidence on the results.

Also, annotating a correlated subquery doesn’t follow the exact same rule (see version_1) due to the fact that such expressions can always be grouped by their external references (name in this case which is already part of the GROUP BY). It’s not a rule that can be generalized to all expressions though and only happen to work in your case because you are already grouping by name which is the only external reference of correlated subqueries.

It might be possible to further optimize SQLCompiler.get_group_by to introspect complex expression annotations and exclude them from the GROUP BY when their source expressions grouping expressions are already covered by the explicit grouping but it would likely require some backend specific logic as MySQL is pretty bad at resolving functional dependency.

I gave the above a shot with this patch

diff --git a/django/db/models/sql/compiler.py b/django/db/models/sql/compiler.py
index 0e483dc4f6..d4567383f3 100644
--- a/django/db/models/sql/compiler.py
+++ b/django/db/models/sql/compiler.py
@@ -7,7 +7,7 @@
 from django.core.exceptions import EmptyResultSet, FieldError, FullResultSet
 from django.db import DatabaseError, NotSupportedError
 from django.db.models.constants import LOOKUP_SEP
-from django.db.models.expressions import ColPairs, F, OrderBy, RawSQL, Ref, Value
+from django.db.models.expressions import Col, ColPairs, F, OrderBy, RawSQL, Ref, Value
 from django.db.models.fields import AutoField, composite
 from django.db.models.functions import Cast, Random
 from django.db.models.lookups import Lookup
@@ -197,6 +197,24 @@ def get_group_by(self, select, order_by):
         return result

     def collapse_group_by(self, expressions, having):
+        # XXX: This will only work on backends that have decent functional
+        # dependency inference support as it's not mandated by the SQL
+        # standard(read not MySQL).
+        cols = {expr for expr in expressions if isinstance(expr, Col)}
+        if cols:
+            # Expressions that are only referencing columns already part
+            # of the group by don't have to be explicitly group'ed by as
+            # it might be quite expensive to do so.
+            expressions = [
+                expr
+                for expr in expressions
+                if (
+                    isinstance(expr, Col)
+                    or not set(
+                        self.query._gen_cols([expr], include_external=True)
+                    ).issubset(cols)
+                )
+            ]
         # If the database supports group by functional dependence reduction,
         # then the expressions can be reduced to the set of selected table
         # primary keys as all other columns are functionally dependent on them.
diff --git a/tests/aggregation/tests.py b/tests/aggregation/tests.py
index bd33a532b3..4d457f63ce 100644
--- a/tests/aggregation/tests.py
+++ b/tests/aggregation/tests.py
@@ -2527,6 +2527,18 @@ def test_order_by_in_subquery(self):

         self.assertEqual(expected_values, values)

+    def test_functional_dependency_optimization(self):
+        # Expressions that are functionnally dependent on columns
+        # that are explicitly grouped by should be elided from the
+        # GROUP BY on backends that support it.
+        with self.assertNumQueries(1) as ctx:
+            list(
+                Author.objects.values("name", name_lower=Lower("name")).annotate(
+                    book_books=Count("book")
+                )
+            )
+        self.assertEqual(ctx[0]["sql"].split("GROUP BY ")[-1], "1")
+

 class AggregateAnnotationPruningTests(TestCase):
     @classmethod

which passes the aggregation and queries test suites but it obviously require streamlining and further investigation to determine the impact it would have on backends such as Oracle and MySQL.

It would also be great if you could share a query plan that demonstrate that Postgres is not smart enough to skip doing the GROUP BY over the COALESCE in the place since it’s able to infer the functional dependency of the subqueries on name (otherwise it would crash asking for an explicit exclusion of the COALESCE in the group).

Last note on the subject.

The above functional dependent eliding optimization for the GROUP BY can only be applied for deterministic expressions (for example it wouldn’t work for RANDOM(seed)) which the ORM has no way to introspect for the time being.