Join on subquery

Hi!

I have this two models:

class Creditor(models.Model):
        name = models.CharField(max_length=300, unique=True)

class Order(BaseModel):
        creditor = models.ForeignKey(
                        Creditor,
                        on_delete=models.CASCADE,
                        related_name="orders"
                       )

       value = models.FloatField()

I would like to annotate to each Order, the creditor_size (sum of all order values for each Creditor).
This could be done pretty straightforward with this query:

creditor_size_subquery = Subquery(
            models.Creditor.objects
            .filter(pk=OuterRef("creditor"))
            .annotate(size=Coalesce(Sum("orders__value"), 0.0, output_field=FloatField()))
            .values("size")
        )

models.Order.objects.select_related("creditor").all().annotate(creditor_size=creditor_size)

Problem is, for only 37000 orders rows, this query takes 3 minutes. That’s because of this,
the raw sql:

SELECT "order"."id",
       (SELECT COALESCE(SUM(U1."value"), 0.0) AS "size"
        FROM "creditor" U0
                 LEFT OUTER JOIN "order" U1 ON (U0."id" = U1."creditor_id")
        WHERE U0."id" = "order"."creditor_id"
        GROUP BY U0."id") AS "creditor_size"
FROM "order"
         LEFT OUTER JOIN "creditor" ON ("order"."creditor_id" = "creditor"."id")
WHERE "order"."hidden_at" IS NULL
ORDER BY "order"."created_at" DESC

The query above contains a subquery that is executed once per row, which is slowing the query, if i have 37000 orders, the subquery will be executed 37000 times.

The solution is to calculate this query once and append the results as a join, the query bellow will run in less than 1 second:

SELECT o.id, o.creditor_id, subquery.creditor_size
FROM order as o
         LEFT JOIN
     (
         SELECT c.id, SUM(c.value) as creditor_size
         from creditor c
                  left join agrometrics_cpr c ON (c.creditor_id = b.id)
         GROUP BY b.id
     ) subquery ON (subquery.id = o.creditor_id)

Is it possible to achieve this query with django ORM? Basically join a subquery and use values from this subquery.

1 Like

Hey!
Golden shovel award goes to me this time. I spent last few days googling around how I can join table with subquery, looks like there is no way (or correct me if I’m wrong).
My scenario is quite similar to the one Caio provides. However, my query is even more complicated, with some case-when instructions based on the columns retrieved from the grouped table. Long story short - in the query generated by django ORM the same grouping query appears ca 10 times. And even for the relatively small dataset (not exceeding 10K in the biggest table) execution time for such a query reaches 10 seconds, while for the equivalent query with subquery in FROM clause it returns instantly.
Am I convicted to use raw and lose all the perks of having orm? Unfortunately I’m on DRF, so I’m losing also out of the box filtering via path parameters.

Sorry for the late response, leaving my comment here to help others. From that day to today, i’ve come to know that i was using the wrong tool for the problem (analytics/data engineering pipelines). In general, you should use your ORM only to retrieve instances of your Model. I was distorting the ORM to populate dashboards, so my Order query in the end returns nothing like a order, returns something like “city, product, count, sum”.

These analytical queries should be done in pure SQL, and depending on the volume and complexity, you need to go very deep in the topics of analytics/data engineering. Like star schema, analytical databases (OLAP) and a infinitude of topics.

Hope i can prevent others from doing an entire dashboard with ORM like i did.

I disagree with the over generalization and your characterization of the ORM here.

We regularly use the ORM to create dashboards and reports for tables consisting of millions of rows.

Yes, part of doing this is ensuring that the tables are appropriately designed (indexes, partitioning, etc) and that the database engine is suitably configured (buffers, query space, etc), but I don’t think we’ve run into more than one or two situations where we’ve found it necessary or preferable to use raw SQL.

Going back to your original question regarding the query to calculate the sum of all the orders for the Creditor in an Order, that query is much more simply written as:

results = Order.objects.annotate(size=Sum('creditor__orders__value'))

Which generates the query:

SELECT "te_order"."id", "te_order"."creditor_id", "te_order"."value", SUM(T3."value") AS "size" 
FROM "te_order" 
INNER JOIN "te_creditor" ON ("te_order"."creditor_id" = "te_creditor"."id") 
LEFT OUTER JOIN "te_order" T3 ON ("te_creditor"."id" = T3."creditor_id") GROUP BY "te_order"."id"

Still very inefficient, even with indexes. suppose i have only 2 Creditors, and millions of Orders per creditor. The produced query falls in the same problem, for each order, the same calculation will be performed millions of times.

That’s why i wanted to join on subquery, because this subquery will first calculate the sum per Creditor ONCE and than i join this value in each row of order. A subquery or a CTE will do the job.

something like:

WITH creditor_total as (
 select creditor_id, sum(value) as creditor_total from orders group by creditor_id
) select o.*, ct.creditor_total  from orders o join creditor_total ct on ct.creditor_id = o.creditor_id

creditor_total is calculated ONCE, instead of repeated per each row of orders.

I stand with my point, if your model query does not return instances of the respective model, this is a misuse of the ORM. I was starting with Order.objects…(+90 lines of queries directly in the ORM) and the final query would return something like City, Product, count(orders), sum(value). So my Order model was not returning instances of Orders, only aggregations.

Let’s not be afraid of raw SQL, ORM is for CRUD in a transactional context. For the analytical context, the tooling and techniques are completely different.

I have no problem with using raw sql when it is needed. I’ve even posted here in a few places where the ORM may not be the best answer for a specific situation.

However, I do have an issue is with the blanket statements asserting that the ORM isn’t usable, suitable, or appropriate in cases where it demonstrably is.

These statements:

are generalizations that may be true in your case.

However, it is misrepresenting the state and quality of the ORM by trying to assert that they are true in all cases, when there are many people productively and beneficially using the ORM in exactly those situations.

My bias toward the ORM isn’t based on the fact that it allows us to not write SQL. We focus on the ORM because of its integration into Django and Python. Its extensibility through model methods and custom managers and querysets have allowed us to build component libraries making very complex queries quite manageable. We derrive a lot of benefit from being able to reuse these types of components throughout our system.

Coming back to the issue regarding the query being described. If the ratio of the cardinality is that extreme between the two models, my first reaction would be one of two different options. I’d either look at the django-cte package to allow for that expression, or, I’d create the Creditor SUM calculation as a custom queryset and allow Django to manage the reference to the data within the Python code rather than in the database. It’s still two queries within the database either way, and my decision between the two would likely depend upon whether or not that SUM calculation is something being used in multiple locations throughout the code.