Excess group by columns, causing problems with window functions

JoelFeiner · January 11, 2024, 11:24pm

Using Django 4.2.7 and sqlite3 as the backend.

There are two problems here, so bear with me. One is that Django will generate group by columns for every field in the model, regardless of what is actually selected, if there as at least one aggregate in the query. This creates problems because it groups by the id column and that effectively undoes all grouping. Example (in Django shell plus):

str(Person.objects.annotate(latest_dob=Max('date_of_birth')).values('first_name', 'latest_dob').query)

Produces:

'SELECT "org_person"."first_name", MAX("org_person"."date_of_birth") AS "latest_dob" FROM "org_person" GROUP BY "org_person"."id", "org_person"."created_on", "org_person"."first_name", "org_person"."middle_name", "org_person"."last_name", "org_person"."prefix", "org_person"."suffix", "org_person"."date_of_birth"'

Because of this, instead of having a row per unique first name, there is a row for every entry. This is not desired at all.

I found a trick on StackOverflow that involves changing the query to look like this:

str(Person.objects.annotate(dummy_group_by_value=models.Value(1)).values('dummy_group_by_value').annotate(latest_dob=Max('date_of_birth')).values('first_name', 'latest_dob').query)

The resulting SQL is now correct (and produces the correct number of rows when run):

'SELECT "org_person"."first_name", MAX("org_person"."date_of_birth") AS "latest_dob" FROM "org_person" GROUP BY "org_person"."first_name"'

This works fine until I introduce window functions. It attempts to group by the window function, which is invalid:

str(Person.objects.annotate(dummy_group_by_value=Value(1)).values('dummy_group_by_value').annotate(rn=Window(RowNumber(), partition_by=F('last_name'), order_by='first_name'), latest_dob=Max('date_of_birth')).values('first_name', 'latest_dob', 'rn').query)

Which produces this SQL:

'SELECT "org_person"."first_name", ROW_NUMBER() OVER (PARTITION BY "org_person"."last_name" ORDER BY "org_person"."first_name") AS "rn", MAX("org_person"."date_of_birth") AS "latest_dob" FROM "org_person" GROUP BY 2, "org_person"."first_name"'

Note the GROUP BY 2, where 2 refers to the window function column. It is not possible to do this in SQLite or any database that I’m aware of. And running it generates an error “misuse of window function”. It is perfectly okay not to group by the column at all and indeed, that’s what I want to happen.

I can “fix” this by removing the trick. The window function field is no longer grouped by, but all of the other group by columns show back up (including id) and the results are no longer correct.

If I can solve the extraneous group by columns problem without using the trick, I feel like this window function problem goes away as well. I’m honestly not sure why Django has to add the extra columns. I explicitly specify which fields I want and which ones are aggregated. It should only included non-aggregated, non-window function fields explicitly listed in .values() for the group by. If there is no .values(), then I suppose it can add all the group by fields, though in that case, I’m not sure what the results are supposed to mean anyway. It feels broken no matter what.

charettes · January 12, 2024, 12:01am

Hello @JoelFeiner,

Your first problem is definitely a misuse of the ORM. Please refer to the documentation about values and annotate when dealing with aggregate functions.

The values method must be called before calling annotate with an aggregate function to produce an explicit grouping

Person.objects.values("first_name").annotate(
   latest_dob=Max("date_of_birth")
)

As for the window function it’s not clear to me what you are trying to accomplish. The fact that the group by is performed against the selected column seems like a bug which I’m unsure how to solve but I’m surprised to see that SQLite doesn’t require that last_name be part of GROUP BY as other backends like Postgres and MySQL do.

That seems like a deviation of the standard and I’m not surprised the ORM doesn’t support it.

The reason why the GROUP BY bug against window function is so hard to solve is because the explicit grouping interface through values(*group_by).annotate() usage has always meant explicit group by values members but in the case of window functions this cannot be done for obvious reasons but I’m not convinced adding an implicit group by last_name would have either.

I think that at the very least the ORM should error out loudly when trying to explicitly group by window functions to denote that it doesn’t support it. The only way I can see to add support for it would be to have the explicit grouping logic performed by values(*group_by).annotate() ignore any group_by members that are window function but raise a warning or error out if the expressions referenced by them are not explicitly part of group_by (e.g. in your case last_name).

JoelFeiner · January 12, 2024, 12:40am

According to the documentation, values before annotate with aggregates sets up values to specify the columns to group by. I will try this when I am next able to test. I must admit this reliance on the ordering of calls to a function that normally does something else (specifying what fields to return) is both surprising and confusing. Why is there not an explicit group_by method?

As for window functions, it looks like they need to be nested in an aggregate in this situation. Obviously this won’t work with row number but works with min/max/etc. As such, this problem can be considered resolved.

JoelFeiner · January 12, 2024, 3:24pm

Okay, it looks like the .values().annotate() mechanism causes some issues if everything is an annotation, which is possible.

For example, say you have:

Person.objects.annotate(max_dob=Max('date_of_birth'), max_last_name=Max('last_name')).values('max_dob', 'max_last_name')

Since there’s nothing to put in a .values() before the call to .annotate(), it goes ahead and generates a group by of every column. If I put a call to .values() with no arguments before .annotate(), it makes no difference.

Now you might say that I should use .aggregate() for this. But .aggregate() doesn’t return a query set. It immediately executes the query and returns a dictionary. Thus, it is not composable (for example, it cannot be used in a subquery). That is a requirement in my project, so .aggregate() is not a solution here.

Here’s another example, this time where everything is an annotation, but not all annotations are aggregates:

Person.objects.annotate(max_dob=Max('date_of_birth'), full_name=Concat(F('first_name'), Value(' '), F('last_name')).values('max_date_of_birth', 'full_name')

It will still generate a group by for every column, which is again not desired. I want to group by the full name.

In this case, I can “fix” the issue by putting expression’s constituent fields first_name and last_name in a call to .values() before the .annotate(): Person.objects.values('first_name', 'last_name').annotate(...). In this version, it groups by first_name and last_name, which is semantically okay in this particular case. However, it really ought to group by the concatenation expression as a whole. For expressions that produce values that might end up being the same for different inputs (e.g., doing math of some sort), it would be wrong to group by the input columns and right to group by the expression. I can get it to do the right thing using the “trick” from my first post, but I cannot get it to do that using the sanctioned approach.

But…perhaps I’m still missing something with the correct approach. How can I get the desired behavior when there is nothing to put in a .values() before a .annotate()? It seems like no matter what, I don’t actually have control over what is grouped and what isn’t, and because of that, I have to fight with the ORM to do the right thing. Contrary to my tone in the previous message, I don’t actually want to fight with the ORM. I want it to represent the query correctly.

JoelFeiner · January 12, 2024, 5:46pm

Aggregating window functions is currently broken and there is a ticket: #32277 (Support nested aggregates in window expressions.) – Django. The suggested workaround is to use a subquery, which is probably acceptable.

Topic		Replies	Views
Unable to use group by correctly ORM	4	3229	September 22, 2019
Group By Multiple Values Using the ORM	6	1872	March 20, 2024
Django ORM Group by issue Using Django	3	971	August 3, 2021
How to get all columns in a table by apply group on couple of fields in django ORM Using the ORM	2	1498	January 13, 2023
How Query class work? Django Internals	3	585	May 10, 2022

Excess group by columns, causing problems with window functions

Related topics