Avoiding RawSQL by controlling related object annotation

Hi everyone, I hope to get some help to translate a query into Django ORM. I tried but failed to come up with a ORM-only, fast query.

Here’s a simplified version of the problem I am facing.

I have 4 entities.

class Collection(models.Model):
    items = models.ManyToManyField(Item, related_name='collections')

class Item(models.Model):
    pass 

class Atom(models.Model):
    items = models.ManyToManyField(to='core.Item', related_name='atoms')

class Status(model.Model):
    atom = models.ForeignKey(to='core.Atom', related_name='statuses')
    status = models.CharField(max_length=50, choices=(('done', 'done'), ('pending', 'pending'),))
    user = models.CharField(max_length=50)

I would like to annotate each collection with the number of item that is done. An item is done if all its atoms are done, i.e. the statuses related to the atoms are ‘done’.

I can for example get this done in raw SQL using the following query that takes just a few milliseconds.

with items as (
	select 
		ci.id,
		count(ca.id) as atoms_total_count, 
		count(ca.id) filter(where ca.status is not null) as atoms_done_count
	from core_item ci
	join core_atom_items cai on cai.item_id = ci.id 
	join core_atom ca on ca.id = cai.atom_id 
    left join core_status cs on cs.atom_id = ca.id
    where cs.user = '123'  -- this is helping performance I believe.
	group by ci.id
)
select 
	cc.*, 
	count(i.id) as items_total_count,
	count(i.id) filter(where atoms_total_count = atoms_seen_count) as items_done_count,
from core_collection cc 
join core_collection_items cci on cci.collection_id = cc.id 
join items i on i.id = cci.item_id 
group by cc.id

Here’s what I have tried so far on the ORM side:

items = Item.objects.filter(collections=OuterRef("pk")).order_by()
items = items.values(
    "collections",
    atoms_total_count=Count('atoms', filter=Q(atoms__statuses_user='123')),
    atoms_done_count=Count(
        'atoms', 
        # Here the filter condition will have to scan the entire cartesian product of all tables?
        filter=Q(
            atoms__statuses_user='123', 
            atoms__statuses_status='done'
        )
    )
)
items_done = items_qs.filter(atoms_total_count=F('atoms_done_count')).values("pk")

queryset = Collection.objects.annotate(
    items_total_count=Count('items', distinct=True),
    items_done=Count(
        'items',
        filter=Q(items__in=items_done)
        distinct=True
    )
)

Unfortunately,

  1. I am not sure this is the right way to proceed with the ORM.
  2. this is extremely slow. I believe this is due to the fact that the where clause in the with clause is not generated by Django’s ORM, hence the DB has to scan all collections x items x atoms x statuses.

I am afraid this is a limitation of Django’s ORM. However, not being very familiar with it I hope there might be a way around it.

Welcome @zermelozf !

I don’t have a specific answer for you, but I do have a couple ideas you might wish to investigate.

  • Take a look at django-cte. While I don’t know that it is suitable for your specific requirements here, I have seen it used as a solution in what I believe to be very similar situations.

  • A Subquery, with possibly a 'Case` clause, might also be useful here.

  • Rather than making conjectures about where the performance issues are with your queries, it might be worth looking at what the actual query is being executed and using explain (or the explain() function) to see what’s happening.

1 Like

Thank you @KenWhitesell !

I tried django-cte as you suggested and was able to reduce the query time to about 200ms down from more than 10 seconds.

Here’s what I used to get there:

items = With(
    models.Item.objects.annotate(
        atoms_total_count=Count('atoms'),
        atoms_seen_count=Count('atoms',filter=Q(atoms__statuses='done')),
    ), name='items_cte'
)
collections_annotated = With(
    items.queryset().values("collections").annotate(
        cid=F('collections'),
        items_total_count=Count('id'),
        items_done_count=Count(
            'id', filter=Q(atoms_total_count=F('atoms_seen_count')))
    ), name='collections_cte'
)

collections = collections_annotated.join(qs, id=collections_annotated.col.cid) \
    .with_cte(items) \
    .with_cte(collections_annotated) \
    .annotate(
        items_total_count=collections_annotated.col.items_total_count,
        items_done_count=collections_annotated.col.items_done_count
    )

Which produced the following SQL:

WITH RECURSIVE "items_cte" AS (
    SELECT "core_item"."id",
          COUNT("core_atom_items"."atom_id") AS "atoms_total_count",
          COUNT("core_atom_items"."atom_id") FILTER (WHERE "core_status"."status" = 'done' AND "core_status"."learner_id" = '123') AS "atoms_done_count",
    FROM "core_item"
    LEFT OUTER JOIN "core_atom_items" ON ("core_item"."id" = "core_atom_items"."item_id")
    LEFT OUTER JOIN "core_atom" ON ("core_atom_items"."atom_id" = "core_atom"."id")
    LEFT OUTER JOIN "core_status" ON ("core_atom"."id" = "core_status"."atom_id")
    GROUP BY "core_item"."id"
), "collections_cte" AS (
    SELECT "core_collection_items"."collection_id",
          "core_collection_items"."collection_id" AS "cid",
          COUNT("items_cte"."id") AS "items_total_count",
          COUNT("items_cte"."id") FILTER (WHERE "items_cte"."atoms_total_count" = "items_cte"."atoms_seen_count") AS "items_completed_count",
   FROM "items_cte"
   LEFT OUTER JOIN "core_collection_items" ON ("items_cte"."id" = "core_collection_items"."item_id")
   GROUP BY "core_collection_items"."collection_id")
SELECT DISTINCT "core_collection"."id",
                "core_collection"."owner_id",
                "core_collection"."name",
                "core_collection"."description",
                "core_collection"."image",
                "core_collection"."access_type",
                "collections_cte"."items_total_count" AS "items_total_count",
                "collections_cte"."items_done_count" AS "items_done_count",
FROM "core_collection"
INNER JOIN "collections_cte" ON "core_collection"."id" = "collections_cte"."cid"

That is good enough for now IMO although it might require a bit more work for optimization.