Simple annotation of related count using OuterRef

This is a suggestion to make it easier to annotate using count and suqueries.

Considering this example with three models, where I want to annotate a queryset with the count of related objects in the other two models.

class Discussion(models.Model):
  title = models.CharField(max_length=200)
  ...

class Comment(models.Model):
  discussion = models.ForeignKey(Discussion, models.PROTECT)
  ...

class Reaction(models.Model):
  discussion = models.ForeignKey(Discussion, models.PROTECT)
  ...

Now, if I annotate this queryset using models.Count, I will get table joins at the database level, resulting in a potentially huge response from the server. This seems to happen even if I add .distinct().

Discussion.objects.annotate(comment_count=models.Count('comment'), reaction_count=models.Count('reaction'))

Let’s say I evaluate this on 100 discussions, with each discussion having on average 1000 comments and as many reactions. That makes my database (I’m using psql) return 100x1000x1000 rows. Using this with .distinct avoids loading all rows as objects in Django, but still generates those rows in db, possibly causing an error.

To solve this, I’ve done this little dance:

Discussion.objects.annotate(
  comment_count=Comment.objects
    .filter(discussion=models.OuterRef("id"))
    .annotate(count=models.Func(models.F("id"), function="Count"))
    .values("count")
)

This works like a charm, but it’s quite a mouthful! Now, if I try this, I get an error:

Discussion.objects.annotate(
  comment_count=Comment.objects
    .filter(discussion=models.OuterRef("id"))
    .count()
)
>>> ValueError: This queryset contains a reference to an outer query and may only be used in a subquery.

What if we made that work, so that QuerySet.count() returned a queryset that can be used for annotations, much like I did above, when called on a QuerySet using OuterRef?

  1. This would provide a simpler method to avoid row bombs from using models.Count.
  2. It does not interfere with existing codebases, because doing that currently causes an exception.
  3. It could probably be implemented in a way that is userful for more types of annotations(?).

Am I missing something? Is there already a better way of achieving this?

I should add:
All this is assuming that modifying models.Count() is a no-go for compat reasons.

What if we made that work, so that QuerySet.count() returned a queryset that can be used for annotations, much like I did above, when called on a QuerySet using OuterRef?

That has been attempted and results in a lot of complications with regards to type handling (int is a literal, querysets are lazy by definition) and would only address the subset of the problem that relates to Count instead of all aggregate functions.

Work towards allowing any aggregate function to be used as a subquery is being tracked in this ticket Add support for aggregation through subqueries.

Until this gets fixed there seems to be a few third party packages that cover this use case such as this one

World class support! Thanks for the quick response, @charettes.

django-sql-utils basically solves the problem in a pretty neat way. Better than what I was suggesting.

While looking at what the documentation says about the annoying models.Count() problems, i found this ticket about multiple table annotation. This seems to be a pretty common pitfall while using Django. Perhaps it’s time to consider changing this in Django core. People will keep making the same mistake. Having a well documented and easy solution to this would help a lot of people, I think.

Three ideas:

  1. Breaking change: Modify models.Count to make a subquery instead.
  2. Add a subquery boolean parameter to models.Count. (i.e. models.Count(‘field_name’, subquery=True))
  3. Add SubqueryCount, similar to django-sql-utils.

Even though the docs warns about this problem, it doesn’t really offer a solution.

I appreciate the candid feedback but there’s a reason why these tickets have been opened for so long (I’ll throw #2361 into the mix).

Similar ideas have been brought up over and and over again in the past two decades but as you might expect the devils in the details.

We can’t break backward compatibility without a clear deprecation route and unless multiple multi-valued relationships are involved not using a subquery is the best solution as most database backends will perform worst when aggregation through a subquery is used.

For example, given a query of the form

Author.objects.filter(
    books__pages__gte=100,
).annotate(
    total_pages=Sum("books__pages"),
)

it is much more efficient to do

SELECT
   author.*,
   SUM(book.pages) AS total_pages
FROM
   author
INNER JOIN book ON (book.author_id = author.id)
WHERE book.page >= 100
GROUP BY author.id

than

SELECT
   author.*,
   (
        SELECT SUM(page)
        FROM book
        WHERE author_id = author.id
    ) AS total_pages
FROM
   author
WHERE EXISTS(
     SELECT 1
     FROM book
     WHERE author_id = author.id AND page >= 100
)

Query planners are just better at introspecting JOIN conditions than subqueries so we can’t systematically use them; it should only be done when not using them would result in the wrong results being returned and we should most likely use normal aggregation for all non-multi valued aggregation and the first multi-valued relationship. To take your provided set of models that would mean doing something like

SELECT
   discussion.*,
   COUNT(comment.id) AS comment_count,
   (
       SELECT COUNT(*)
       FROM reaction
       WHERE discussion_id = discussion.id
   ) AS reaction_count
FROM discussion
LEFT OUTER JOIN comment ON (comment.discussion_id = discussion_id)
GROUP BY discussion.id

Even though the docs warns about this problem, it doesn’t really offer a solution.

The two main challenges remain in order to achieve that are

  1. Add a way to the ORM to support aggregation through subqueries. That’s tracked by #28296 and hasn’t received an update in more than two years. Django is a volunteer led project and features only land if there are users who build them, review them, and stick around so they are adequately supported through the years.
  2. Emit runtime warnings when aggregation on more than one multi-valued relationship is performed and point at using as_subquery instead.
  3. Potentially find a way to teach the ORM to do 2. automatically but that’s a 10/10 on the difficulty scale as it requires not only JOIN eliding (which the ORM doesn’t have today) but JOIN transplant as relationship reference would need to be moved to the subqueries.

Given the state of the contributions to the ORM I think 1. and 2. are realistic but I’d be surprised if we ever get to 3.