Why does union() limit follow-up operations?

From the documentation:

In addition, only LIMIT, OFFSET, COUNT(*), ORDER BY, and specifying columns (i.e. slicing, count(), exists(), order_by(), values()/values_list()) are allowed on the resulting QuerySet. Further, databases place restrictions on what operations are allowed in the combined queries. For example, most databases don’t allow LIMIT or OFFSET in the combined queries.

Could someone explain why this limitation exists, and whether it’s possible it will be fixed in a future release? I often find the union() operator necessary, but after I use it I need to force the execution of the query, and then build a fresh Queryset with the results of the first query, so the resulting queryset supports all the normal operators. I’d love to be able to use union() but also have the resulting queryset support all the standard operators. I use Postgres primarily so I’m mostly interested in answers specific to that backend.

I suggest you have a look at this ticket to learn more about the situation.

The TL;DR is that Django’s layer in charge of accumulating operations that compose a queryset must be bound to a single model at a time. There is no intermediary abstraction that represents a relation or a table (subset of fields, union of different tables) so supporting cases like

Author.objects.filter(
    name__startswith="Leon"
).union(
    Author.objects.filter(
        publisher=some_publisher
    )
).annotate(
    books_count=Count("books")
).filter(books_count__gt=10)

is relatively straightforward while ones that mix models or have SELECT schema that differs from each others (e.g. values, only, select_related) must be treated with a lot of care.

The problems is less in supporting the happy path but in making sure that union disallows all (at least most) of incompatible mixing of Queryset instances which has a very large surface API.

1 Like

OK that makes sense. So basically UNION lets you combine queries in ways that aren’t bound to a particular model, and the Queryset abstraction can’t support that. If there was a way to detect that a UNION stuck to a particular model, it might be OK to lift the restrictions, but it’s hard to detect that given the API surface area.

I have a follow-up question. I’ve previously worked around this problem with a pattern like

u1 = User.objects.filter(...)
u2 = User.objects.filter(...)
user_ids = set(u1.union(u2).values_list('id', flat=True))
result = User.objects.filter(id__in=user_ids)

# more filtering works now
result.filter(...)

which works, but requires an extra DB query.

Recently I experimented with not forcing the db query:

u1 = User.objects.filter(...)
u2 = User.objects.filter(...)
user_ids = u1.union(u2).values_list('id', flat=True)
result = User.objects.filter(id__in=user_ids)

# more filtering works now
result.filter(...)

which also seems to work, and doesn’t require an extra round-trip to the DB. Is the problem avoided if you do the UNION in a sub-select like this?

If by problem you mean circumventing the limitation of supporting follow up operations yes it does circumvent that at the cost of possible query planner optimization barriers and other issue that might arise in using IN (<subquery>) when the subquery can get quite large.

From my experience UNION is most often useful to help query planners figure out how to efficiently deal with complex filtering predicates that include OR clauses by manually breaking down the problem to target specific indices so the IN (<subquery>) approach might help with that at the cost of an extra hash join.

Here’s an example of how Postgres is able to push down predicates to union subqueries when using a FROM clause over the union compared to a an IN clause.

EXPLAIN SELECT * FROM author WHERE id IN (
  SELECT id FROM author WHERE id > 10
  UNION
  SELECT id FROM author WHERE id < 10
) AND id > 100
Hash Join (cost=117.48..138.95 rows=567 width=4)
  Hash Cond: (author_1.id = author.id)
  -> HashAggregate (cost=75.48..92.48 rows=1700 width=4)
        Group Key: author_1.id
        -> Append (cost=10.74..71.23 rows=1700 width=4)
              -> Bitmap Heap Scan on author author_1 (cost=10.74..31.37 rows=850 width=4)
                    Recheck Cond: (id > 10)
                    -> Bitmap Index Scan on author_pkey (cost=0.00..10.53 rows=850 width=0)
                          Index Cond: (id > 10)
              -> Bitmap Heap Scan on author author_2 (cost=10.74..31.37 rows=850 width=4)
                    Recheck Cond: (id < 10)
                    -> Bitmap Index Scan on author_pkey (cost=0.00..10.53 rows=850 width=0)
                          Index Cond: (id < 10)
  -> Hash (cost=31.37..31.37 rows=850 width=4)
        -> Bitmap Heap Scan on author (cost=10.74..31.37 rows=850 width=4)
              Recheck Cond: (id > 100)
              -> Bitmap Index Scan on author_pkey (cost=0.00..10.53 rows=850 width=0)
                    Index Cond: (id > 100)
EXPLAIN SELECT * FROM (
    SELECT * FROM author WHERE id > 10
    UNION
    SELECT * FROM author WHERE id < 10
) WHERE id > 100

Notice the absence of Hash and further bitmap heap scans.

HashAggregate  (cost=57.11..65.74 rows=863 width=4)
  Group Key: author.id
  ->  Append  (cost=12.87..54.95 rows=863 width=4)
        ->  Bitmap Heap Scan on author  (cost=12.87..35.62 rows=850 width=4)
              Recheck Cond: ((id > 10) AND (id > 100))
              ->  Bitmap Index Scan on author_pkey  (cost=0.00..12.65 rows=850 width=0)
                    Index Cond: ((id > 10) AND (id > 100))
        ->  Bitmap Heap Scan on author author_1  (cost=4.29..15.02 rows=13 width=4)
              Recheck Cond: ((id < 10) AND (id > 100))
              ->  Bitmap Index Scan on author_pkey  (cost=0.00..4.29 rows=13 width=0)
                    Index Cond: ((id < 10) AND (id > 100))

If by problem you mean circumventing the limitation of supporting follow up operations yes it does circumvent that

OK that’s good to know.

From my experience UNION is most often useful to help query planners figure out how to efficiently deal with complex filtering predicates that include OR clauses

The main issue I run into that UNION solves is this, i.e. OR conditions involving multiple tables (due to a select-related table being involved in a condition for example), which causes the Postgres query planner to resort to scanning one or both tables. In my experience the OR version might take a minute, and the UNION version might take 1ms, but then I need to use some work-around to make the Queryset use-able again (using one of the approaches I mentioned above).

That adds up with my previous use case for it. In this case your best option is likely the __in workaround until proper support for subquery pushdown of union lands.

You might want to subscribe to the ticket above about adding support for follow up operations and this one to add support for subquery wrapping which could allow you to do use qs.union(other_qs).subquery().filter() explicitly.

1 Like