Using annotation to annotate

Hello everybody,

So I’ve been strungling a couple days trying to make this out. I’ve read several times now the Aggregation and Complex Queryset help pages in Django docs and gone through several stackoverflow topics, but none of them pointed to my exact issue.

I’m trying to annotate a query using a previous annotation on the same query. Is this doable?

Let me place an example. I have these models:

class Author(models.Model):
     id = models.AutoField(primary_key=True)
     name = models.CharField(max_length=50)

class Book(models.Model):
     id = models.AutoField(primary_key=True)
     author = models.ForeignKey(Author, on_delete=models.CASCADE)

class Page(models.Model):
    book = models.ForeignKey(Book, on_delete=models.CASCADE)
    words = models.FloatField()

I want to get all the books that belong to each author that have pages with less than 100 words, so I prepared the following query:

 Author.objects.values('book').annotate(book_min_pages=Min('book__page__words')).annotate(short_books=Count('book', filter=Q(book_min_pages__lt=100)))

This sentence looks perfectly fine when executed from Django, but MariaDB will complain saying:

django.db.utils.ProgrammingError: (1111, 'Invalid use of group function')

And SQLite will complain as well:

django.db.utils.OperationalError: misuse of aggregate function MIN()

I think this is because the use of Count() and Min(), if I look into the generated SQL, there is something like this:

COUNT(CASE WHEN MIN(`page`.`words`) < 100.0 THEN `book`.`id` ELSE NULL END) AS `short_books`

On the other side, I have been able to achieve what I want using Subquery, but not only it’s completely ugly, it’s also blatanly slow, taking more than 6 seconds for a few hundred thousand records, so I’m sure there must be another way.

Does anyone have any better idea?

Django probably shouldn’t allow construction of a QuerySet using an aggregate function like that in the wrong place.

I think you can achieve what you want by thinking about querying the books first, and then from that the authors (untested):

short_books = Book.objects.annotate(min_page_length=Min("page__words")).filter(min_page_length__lt=100).values_list('author_id', flat=True)
Author.objects.filter(id__in=short_books.values_list('author_id', flat=True))

The __in lookup with another QuerySet becomes a subquery.

Thank you for your answer. Nice alternative, but, isn’t this the same as?:

short_books_subquery = Book.objects.filter(author__id=OuterRef('id')).annotate(min_page_lenght=Min('page__words')).filter('min_page_lenght__lt=100)
Author.objects.annotate(short_books=SubqueryCount(books_subquery))

For this, I would be using a special SubQuery subclass:

class SubQueryCount(Subquery):
   output_field = models.IntegerField()

   def __init__(self, *args, **kwargs):
       Subquery.__init__(self, *args, **kwargs)
       self.queryset = self.queryset.annotate(cnt=Count("*")).values("cnt")
       self.queryset.query.set_group_by()  # values() adds a GROUP BY we don't want here

Besides, your example doesn’t really give all the authors, but only the ones that have books with short pages :slight_smile:

This shouldn’t be necessary.

That’s what I thought you wanted :sweat_smile:

I think what you want can be done with:

short_books = Book.objects.annotate(min_page_length=Min("page__words")).filter(min_page_length__lt=100).values_list('author_id', flat=True)
Author.objects.annotate(
    short_books=Q(id__in=short_books.values_list('author_id', flat=True))
)

Possibly - I can’t remember - the inner Q might need wrapping in an ExpressionWrapper(..., output_field=BooleanField())

Hello Adam,

Thanks again for your help (and happy forum anniversary - btw).

I’ve tried what you suggested, but it returns the short books from ALL the authors, not only for the current one (that would be done using an OuterRef as I said earlier with the SubqueryCount).

Still couldn’t figure it out.

I’ve been following this from the sideline, trying to see if I can help in any way. But I’m stuck trying to figure out what it is you’re looking for. So far, I’ve seen what reads to me to be three different descriptions of what you want this query to return.

Where I am at now is this:

  • Given an author, return all books for that author containing at least one page with less than 100 words.

Is this a correct summary of your desired query? (Because, if so, Adam’s last reply would work if you applied a filter on the Author queryset.)

If not, can you provide a more precise statement of what you’d like to see?

Hello Ken,
Thank you for replying as well. So let clarify what I really need:

  • The list of all the authors annotated with the count of books written by them that have at least one page with less than 100 hundred words.

Let me put another example: we have players, games and sessions (being session time spent playing without interruption), so it would be: list of all players annotated with the games they played that had a session of less 100 minutes.

Is this a correct summary of your desired query? (Because, if so, Adam’s last reply would work if you applied a filter on the Author queryset.)

How do I do that without the Subquery as I said?

Cool - yes, that’s clear. So I’m going to assume that if an Author has no book, they will still show up on this list, with an annotation value of 0.

Let me think about that for a little bit… It’s possible that it might not be able to be done without a subquery. If I were doing this in SQL, I’d probably try to “invert” the query to where I’m selecting a count of books by author and annotating with the count of books having at least one page < 100 words, group by author.

(Note, my use of the term “invert” here is a very informal term when I’m thinking about twisting a query around to look at it from another angle. It’s not anything formal - just the way I think about it.)

This is an interesting puzzle. I’m going to toss this into my test environment and see what I can come up with.

So I’ve taken an hour or so to look at this, and I’m not finding any way to do this without a subquery.

If you have an SQL statement that does this without a subquery, I’d like to see it - and possibly work it backward into the ORM.

The issue (in my mind at least) is that you’re grouping within groups. You’re wanting to annotate by Author, with an aggregation by Book. This says to me that it’s basically n+1 queries. For each Book, you need to determine whether or not there’s at least one page < 100 words. Then you need to count the number of those results by author. Everything that I can think of, and all the web-searching I’ve done, only show this as being done with a subquery.

You might be able to speed this up some by adding some indexes - an index on words would probably help a lot to avoid a table scan of Page - but that’s as much a guess as anything else.

One more thought. Depending upon the circumstances around the data being collected, I’d give serious consideration to adding a column to the book table with the count of “short pages”. It might not be that much overhead for what you gain. (You mentioned this in two contexts, one of books, where the pages aren’t likely going to change after being entered; the other being games and sessions, where the data might be much less static. In this situation, my recommendations would vary based upon the underlying conditions.)

Hey Ken,

Thank you very much for taking so much time having a look into it.

You’re wanting to annotate by Author, with an aggregation by Book. This says to me that it’s basically n+1 queries.

Yes, this is exactly the issue. Even with indexes, this is taking long enough to be unusable.

If you have an SQL statement that does this without a subquery, I’d like to see it - and possibly work it backward into the ORM.

I have a Django ORM that you should do this, back actually when it’s translated SQL complains giving the error I pointed in the first message of this thread.

One more thought. Depending upon the circumstances around the data being collected, I’d give serious consideration to adding a column to the book table with the count of “short pages”. It might not be that much overhead for what you gain. (You mentioned this in two contexts, one of books, where the pages aren’t likely going to change after being entered; the other being games and sessions, where the data might be much less static. In this situation, my recommendations would vary based upon the underlying conditions.)

The data is much more similar to the “games” example than to the “books” example, as you said, it’s more dinamic. So, I would have no problem in adding an extra column with the derived data called “minimum words in one page” for books or “shorter session” for games, but I guess I would have to do it in SQL ussing triggers, right? No Django-ORM way to do this, I guess.

Two ways come to mind, again depending upon your specific use-case.

  1. Override the save method on the Page object to update the corresponding Book object
  2. Register a listener on the post_save signal for the Page object.

However, both of these have “gaps” in that there are conditions where your code won’t be triggered. (Bulk deletes being just one of them. Again, I’d look carefully at the overall environment to determine whether or not they’re appropriate to use.) But an actual database trigger is probably going to be more reliable and likely to be more performant as well. Whether or not that step is necessary is also context dependent.

1 Like

Great, very interesting. So thank you the support!

I think the best option might be the DB trigger.