Problem with union - order_by - values

In a project of mine with Django 5.1 I have the following (simplified) models:

class MainModel(models.Model):
    name = models.CharField(max_length=8)


class SecondaryModel(models.Model):
    main = models.ForeignKey(MainModel, on_delete=models.CASCADE)
    value = models.IntegerField()
    timestamp = models.DateTimeField()

    class Meta:
        constraints = (
            models.UniqueConstraint(
                fields=('main', 'value'),
                name='secondary_uniq'),
        )

What I want to achieve is to retrieve the MainModel which was the latest to be assigned one of a given set of SecondaryModel value.

Initially I had this implemented via a series of subqueries over SecondaryModel to ensure to get for each value the latest record which was associated with it:

def query_with_subquery(data):
    secondary_qs = SecondaryModel.objects.none()
    for elem in data:
        secondary_qs |= (
            SecondaryModel.objects.filter(
                value=elem['value'],
                timestamp__gte=elem['timestamp']
            ).filter(
                timestamp__gte=SecondaryModel.objects.filter(
                    value=elem['value'],
                ).values('timestamp').order_by('-timestamp')[:1]
            )
        )

    secondary_qs = secondary_qs.order_by('-timestamp')
    return MainModel.objects.get(pk=secondary_qs.values('main')[:1])

I found that this query is very expensive in my real-life database, then I wanted to change the query by hacking the SQL UNION, because to get the element which is the latest to have a particular value I just need to filter for that value, sort by time descending and get the first element. When having multiple values I just neet to UNION the queries for each value and then perform an ulterior sort by time to get the result that I am looking for. What I want to achieve is the following subquery:

SELECT main_id
FROM
    (
        SELECT * FROM (
            SELECT main_id, timestamp
            FROM test_union_secondarymodel
            WHERE value = 1
            ORDER BY timestamp DESC
            LIMIT 1
        )
        UNION
        SELECT * FROM (
            SELECT main_id, timestamp
            FROM test_union_secondarymodel
            WHERE value = 2
            ORDER BY timestamp DESC
            LIMIT 1
        )
    )
ORDER BY timestamp DESC;

To do this I written this method:

def query_with_union(data):
    secondary_qs = SecondaryModel.objects.none()
    for elem in data:
        secondary_qs = secondary_qs.union(
            SecondaryModel.objects.filter(
                value=elem['value'],
                timestamp__gte=elem['timestamp'],
            )
            .order_by('-timestamp')
            .values('main', 'timestamp')[:1]
        )
    
    # required since if there is only 1 query the UNION is not performed
    if len(data) > 1:
        secondary_qs = secondary_qs.order_by('-timestamp')

    return MainModel.objects.get(pk=secondary_qs.values('main')[:1])

But when I run this code with more than one query inside the UNION I get the following error:

 File "/tmp/test-union/test_union/tests.py", line 27, in query_with_union
    return MainModel.objects.get(pk=secondary_qs.values('main')[:1])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...                               
  File "/tmp/test-union/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 557, in compile                                         
    sql, params = node.as_sql(self, self.connection)           
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                         
  File "/tmp/test-union/.venv/lib/python3.12/site-packages/django/db/models/sql/query.py", line 1263, in as_sql                                            
    sql, params = self.get_compiler(connection=connection).as_sql()          
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^          
  File "/tmp/test-union/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 747, in as_sql                                          
    extra_select, order_by, group_by = self.pre_sql_setup(                   
                                       ^^^^^^^^^^^^^^^^^^^                   
  File "/tmp/test-union/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 85, in pre_sql_setup                                    
    order_by = self.get_order_by()                                           
               ^^^^^^^^^^^^^^^^^^^
  File "/tmp/test-union/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 500, in get_order_by                                    
    raise DatabaseError(                                                     
django.db.utils.DatabaseError: ORDER BY term does not match any column in the result set. 

What can I do to obtain the query I would like to produce?
Why does the ORM complains that there the timestamp column is not in the result set only when I call .values('main')?

Here is a sample repository with a test-case which uses both approaches GitHub - sevdog/test-union-query: Tests for usage of union query in django

Hey there!

The problem you are running into is surfaced through this code path.

The reason you are running into this problem is due to a few constraining properties of your solution.

First, as you likely know given the if len(data) > 1 branch, the local ordering of inner queries (the members of the UNION) is ignored by most backends and a global ordering must be specified.

Secondly, your union subquery must return a single column to match against MainModel.pk in the get call.

Lastly, because some backends have trouble resolving by global ORDER BY alias name (e.g. ambiguity of ordering of some aliases in members of the union) and due to some re-aliasing limitation of the ORM the global ordering of unions is performed by SELECT position (e.g. ORDER BY 1). This means that if a column referenced in order_by is not a member of the values / SELECT mask then you’ll get this specific error.

Another way of looking at this problem is that order_by is resolved after the SELECT clause is compiled so what you’re asking the ORM by doing order_by('-timestamp').values('main') is similar to asking

SELECT * FROM
(
    SELECT main_id
    FROM test_union_secondarymodel
    WHERE value = 1
    ORDER BY timestamp DESC
    LIMIT 1
    UNION
    SELECT main_id
    FROM test_union_secondarymodel
    WHERE value = 2
    ORDER BY timestamp DESC
    LIMIT 1
)
ORDER BY timestamp DESC  -- Invalid; timestamp is not part of the SELECT mask
LIMIT 1

Given you must only return only one column though you are kind of in a pickle as including timestamp in the select mask will prevent usage of the subquery as a lookup.

There is an alternative solution to this problem though thanks to Django 4.2+ support for filtering against window functions

def query_with_window(data):
    values = []
    filter_ = Q()
    for elem in data:
        filter_ |= Q(
            value=elem["value"],
            timestamp__gte=elem["timestamp"],
        )
    secondary_qs = (
        SecondaryModel.objects.filter(filter_)
        .annotate(
            value_rownumber=Window(
                RowNumber(), partition_by="value", order_by="-timestamp"
            )
        )
        .filter(value_rownumber=1)
        .order_by("-timestamp")
    )
    return MainModel.objects.get(pk=secondary_qs.values("main")[:1])

If window functions don’t make the cut performance wise there might an alternative but it’s a bit more involved so I’ll wait for your reply before getting into it.

1 Like

Thank you @charettes, with window function the returned result are those which I expected when using the initial subquery and the overall cost of the operation has decreased (which was my main goal, since this operation is a bit frequent).

I did not know that .order_by was evaluated after the build of the SELECT clause.
When using a basic queryset (without any union/intersect/difference) is always possible to perform sorting on a field which is not present in the values/values_list, which is also the reason to have .alias method. So I belive that the problem in the ORM could be how those queryset are handled.

However the implementation with UNION would have been faster and cheaper (in my setup with indexes and other elements). These are the summaried values of EXPLAIN ANALYZE on those queries in an environment with 34M records of SecondaryModel and 573K records of MainModel:

Query Cost Time
SubQuery (initial) cost=10335.70…10337.92 rows=1 width=79 actual time=70.135…70.136 rows=0 loops=1
Window Query cost=637.23…639.45 rows=1 width=79 actual time=70.256…70.259 rows=1 loops=1
Union Query cost=544.52…546.74 rows=1 width=79 actual time=2.682…2.687 rows=1 loops=1

I know that a lower cost means lower CPU/disk usage and UNION and WINDOW have very similar costs, however the UNION is 20 times faster. Thus I belive that it deserves more effort into porting it to my codebase.

EDIT: found out that by writing better some conditions the initial SubQuery would also produce a cost around 800 (which is fair enough compared the the previous 10k), also further tests showed that the timing are similar. I still would like to perform it with UNION as a personal achievement and to satisfy my taste.

Since my initial problem with the .values inside the components of UNION I tried removing that to see what would happen. I got an error which currently I cannot reproduce in the dummy repository which I setup for sharing (if I got more time I will try to adjust the setup to see how to reproduce it).

These are the latest frames of the stacktrace:

            ...
            if len(data) > 1:
                secondary_qs = secondary_qs.order_by('-timestamp')

            secondary_qs = secondary_qs.values('main')[:1]
            with suppress(MainModel.DoesNotExist):
                return MainModel.objects.filter(customer=customer).get(pk=secondary_qs)
../.venv/lib/python3.12/site-packages/django/db/models/query.py:645: in get
    num = len(clone)
        args       = ()
        clone      = <[FieldError("Cannot resolve keyword '' into field. Choices are: main, main_id, id, timestamp, type, value") raised in repr()] QuerySet object at 0x797a86982360>
        kwargs     = <[FieldError("Cannot resolve keyword '' into field. Choices are: main, main_id, id, timestamp, type, value") raised in repr()] dict object at 0x797a86aed540>
        limit      = 21
        self       = <QuerySet [<MainModel: 9a99e1c4-adea-46a0-ad0c-67f93340a3a3>, <MainModel: bd64a09d-0e2c-4f71-841b-819434fdcb58>, <MainModel: e9afd7ac-589b-4510-8d52-e4989ba0c56e>, <MainModel: 753d4f5b-f406-400d-88f0-18f9708ebf66>, <MainModel: 753d4f5b-f406-401d-88f0-18f9708ebf66>]>
../.venv/lib/python3.12/site-packages/django/db/models/query.py:382: in __len__
    self._fetch_all()
        self       = <[FieldError("Cannot resolve keyword '' into field. Choices are: main, main_id, id, timestamp, type, value") raised in repr()] QuerySet object at 0x797a86982360>
../.venv/lib/python3.12/site-packages/django/db/models/query.py:1928: in _fetch_all
    self._result_cache = list(self._iterable_class(self))
        self       = <[FieldError("Cannot resolve keyword '' into field. Choices are: main, main_id, id, timestamp, type, value") raised in repr()] QuerySet object at 0x797a86982360>
../.venv/lib/python3.12/site-packages/django/db/models/query.py:91: in __iter__
    results = compiler.execute_sql(
        compiler   = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        db         = 'default'
        queryset   = <[FieldError("Cannot resolve keyword '' into field. Choices are: main, main_id, id, timestamp, type, value") raised in repr()] QuerySet object at 0x797a86982360>
        self       = <django.db.models.query.ModelIterable object at 0x797a86980890>
../.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py:1561: in execute_sql
    sql, params = self.as_sql()
        chunk_size = 100
        chunked_fetch = False
        result_type = 'multi'
        self       = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
../.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py:775: in as_sql
    self.compile(self.where) if self.where is not None else ("", [])
        combinator = None
        distinct_fields = []
        distinct_params = []
        extra_select = []
        f_params   = []
        features   = <django.db.backends.postgresql.features.DatabaseFeatures object at 0x797a8b20dee0>
        for_update_part = None
        from_      = ['"test_union_main"']
        group_by   = []
        order_by   = []
        refcounts_before = {'test_union_main': 2, 'portal_customer': 0}
        self       = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        with_col_aliases = False
        with_limit_offset = True
        with_limits = True
../.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py:557: in compile
    sql, params = node.as_sql(self, self.connection)
        node       = <WhereNode: (AND: RelatedExact(Col(test_union_main, test_union.MainModel.customer), 50001), Exact(Col(test_union_main, test_union.MainModel.id), <django.db.models.sql.query.Query object at 0x797a86982090>))>
        self       = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        vendor_impl = None
../.venv/lib/python3.12/site-packages/django/db/models/sql/where.py:151: in as_sql
    sql, params = compiler.compile(child)
        child      = Exact(Col(test_union_main, test_union.MainModel.id), <django.db.models.sql.query.Query object at 0x797a86982090>)
        compiler   = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        connection = <DatabaseWrapper vendor='postgresql' alias='default'>
        empty_needed = 1
        full_needed = 2
        params     = [50001]
        result     = ['"test_union_main"."customer_id" = %s']
        result_params = [50001]
        self       = <WhereNode: (AND: RelatedExact(Col(test_union_main, test_union.MainModel.customer), 50001), Exact(Col(test_union_main, test_union.MainModel.id), <django.db.models.sql.query.Query object at 0x797a86982090>))>
        sql        = '"test_union_main"."customer_id" = %s'
../.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py:557: in compile
    sql, params = node.as_sql(self, self.connection)
        node       = Exact(Col(test_union_main, test_union.MainModel.id), <django.db.models.sql.query.Query object at 0x797a86982090>)
        self       = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        vendor_impl = None
../.venv/lib/python3.12/site-packages/django/db/models/lookups.py:385: in as_sql
    return super().as_sql(compiler, connection)
        __class__  = <class 'django.db.models.lookups.Exact'>
        compiler   = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        connection = <DatabaseWrapper vendor='postgresql' alias='default'>
        self       = Exact(Col(test_union_main, test_union.MainModel.id), <django.db.models.sql.query.Query object at 0x797a86982090>)
../.venv/lib/python3.12/site-packages/django/db/models/lookups.py:246: in as_sql
    rhs_sql, rhs_params = self.process_rhs(compiler, connection)
        compiler   = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        connection = <DatabaseWrapper vendor='postgresql' alias='default'>
        lhs_sql    = '"test_union_main"."id"'
        params     = []
        self       = Exact(Col(test_union_main, test_union.MainModel.id), <django.db.models.sql.query.Query object at 0x797a86982090>)
../.venv/lib/python3.12/site-packages/django/db/models/lookups.py:121: in process_rhs
    sql, params = compiler.compile(value)
        compiler   = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        connection = <DatabaseWrapper vendor='postgresql' alias='default'>
        self       = Exact(Col(test_union_main, test_union.MainModel.id), <django.db.models.sql.query.Query object at 0x797a86982090>)
        value      = <django.db.models.sql.query.Query object at 0x797a86982090>
../.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py:557: in compile
    sql, params = node.as_sql(self, self.connection)
        node       = <django.db.models.sql.query.Query object at 0x797a86982090>
        self       = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        vendor_impl = None
../.venv/lib/python3.12/site-packages/django/db/models/sql/query.py:1263: in as_sql
    sql, params = self.get_compiler(connection=connection).as_sql()
        compiler   = <SQLCompiler model=MainModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using='default'>
        connection = <DatabaseWrapper vendor='postgresql' alias='default'>
        self       = <django.db.models.sql.query.Query object at 0x797a86982090>
../.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py:762: in as_sql
    result, params = self.get_combinator_sql(
        combinator = 'union'
        extra_select = []
        features   = <django.db.backends.postgresql.features.DatabaseFeatures object at 0x797a8b20dee0>
        for_update_part = None
        group_by   = []
        order_by   = [(OrderBy(Ref(__orderbycol2, Col(U0, test_union.SecondaryModel.timestamp)), descending=True),
  ('"__orderbycol2" DESC', [], False))]
        refcounts_before = {'U0': 4, 'U1': 1, 'U2': 0}
        self       = <SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>
        with_col_aliases = False
        with_limit_offset = True
        with_limits = True
../.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py:598: in get_combinator_sql
    part_sql, part_args = compiler.as_sql(with_col_aliases=True)
        all        = False
        combinator = 'union'
        compiler   = <SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>
        compilers  = [<SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>,
 <SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>]
        features   = <django.db.backends.postgresql.features.DatabaseFeatures object at 0x797a8b20dee0>
        parts      = ()
        self       = <SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>
../.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py:762: in as_sql
    result, params = self.get_combinator_sql(
        combinator = 'union'
        extra_select = []
        features   = <django.db.backends.postgresql.features.DatabaseFeatures object at 0x797a8b20dee0>
        for_update_part = None
        group_by   = []
        order_by   = []
        refcounts_before = {'U0': 5, 'U1': 1, 'U2': 0}
        self       = <SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>
        with_col_aliases = True
        with_limit_offset = False
        with_limits = True
../.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py:591: in get_combinator_sql
    compiler.query.set_values(
        all        = False
        combinator = 'union'
        compiler   = <SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>
        compilers  = [<SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>,
 <SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>]
        features   = <django.db.backends.postgresql.features.DatabaseFeatures object at 0x797a8b20dee0>
        parts      = ()
        self       = <SQLCompiler model=SecondaryModel connection=<DatabaseWrapper vendor='postgresql' alias='default'> using=None>
../.venv/lib/python3.12/site-packages/django/db/models/sql/query.py:2500: in set_values
    self.add_fields(field_names, True)
        annotation_names = []
        extra_names = []
        field      = '__orderbycol2'
        field_names = ['main', '__orderbycol2']
        fields     = ('main', '__orderbycol2')
        selected   = frozenset({'main', '__orderbycol2'})
        self       = <django.db.models.sql.query.Query object at 0x797a86983170>
../.venv/lib/python3.12/site-packages/django/db/models/sql/query.py:2195: in add_fields
    join_info = self.setup_joins(
        alias      = 'U0'
        allow_m2m  = True
        cols       = [Col(U0, test_union.SecondaryModel.main)]
        field_names = ['main', '__orderbycol2']
        final_alias = 'U0'
        join_info  = JoinInfo(final_field=<django.db.models.fields.related.ForeignKey: main>, targets=(<django.db.models.fields.UUIDField: id>,), opts=<Options for MainModel>, joins=['U0', 'U1'], path=[PathInfo(from_opts=<Options for SecondaryModel>, to_opts=<Options for MainModel>, target_fields=(<django.db.models.fields.UUIDField: id>,), join_field=<django.db.models.fields.related.ForeignKey: main>, m2m=False, direct=True, filtered_relation=None)], transform_function=<function Query.setup_joins.<locals>.final_transformer at 0x797a86955d00>)
        joins      = ['U0']
        name       = '__orderbycol2'
        opts       = <Options for SecondaryModel>
        self       = <django.db.models.sql.query.Query object at 0x797a86983170>
        target     = <django.db.models.fields.related.ForeignKey: main>
        targets    = (<django.db.models.fields.related.ForeignKey: main>,)
../.venv/lib/python3.12/site-packages/django/db/models/sql/query.py:1863: in setup_joins
    path, final_field, targets, rest = self.names_to_path(
        alias      = 'U0'
        allow_many = True
        can_reuse  = None
        final_transformer = <function Query.setup_joins.<locals>.final_transformer at 0x797a86955c60>
        joins      = ['U0']
        last_field_exception = FieldError("Cannot resolve keyword '' into field. Choices are: main, main_id, id, timestamp, type, value")
        names      = ['', 'orderbycol2']
        opts       = <Options for SecondaryModel>
        pivot      = 1
        self       = <django.db.models.sql.query.Query object at 0x797a86983170>

The colalias calculated here is getting considered as a lookup for a field since it contains a double underscore.

If you really need to stick to using UNION I’d suggest something like

from django.db.models import Subquery

class SubqueryMask(Subquery):
    def __init__(self, queryset, mask):
        self.mask = mask
        super().__init__(queryset)

    def as_sql(self, compiler, connection, **kwargs):
        sql, params = super().as_sql(compiler, connection, **kwargs)
        mask_sql = ", ".join(
            connection.ops.quote_name(field) for field in self.mask
        )
        return f"SELECT {mask_sql} FROM ({sql})", params

def query_with_union(data):
    secondary_qs = SecondaryModel.objects.none()
    for elem in data:
        secondary_qs = secondary_qs.union(
            SecondaryModel.objects.filter(
                value=elem["value"],
                timestamp__gte=elem["timestamp"],
            )
            .order_by("-timestamp")
            .values("main", "timestamp")[:1]
        )

    if len(data) > 1:
        secondary_qs = secondary_qs.order_by("-timestamp")

    return MainModel.objects.get(pk=SubqueryMask(secondary_qs[:1], ["main_id"]))

Which should allow to keep timestamp in the SELECT mask of UNION members for ordering purpose while excluding it when looking up against it.

1 Like