Joining a set-returning function and aggregating its output

I have an application currently running on Django 4.2 and PostgreSQL 15, with the following model structure (obviously there’s much more than this, but this is the relevant bit for the purposes of this question):

class Artifact(models.Model):
    category = models.CharField()
    data = models.JSONField(default=dict, blank=True)

class Collection(models.Model):
    category = models.CharField()

class CollectionItem(models.Model):
    parent_collection = models.ForeignKey(Collection)
    artifact = models.ForeignKey("Artifact", null=True)

Artifact holds artifacts of many different categories (e.g. code packages), and its data field has a category-dependent structure, validated at the Python level using Pydantic. The structure can be quite complex, but for any given category we know how it will look. For the case I’m looking at here, the data model is something like this:

class DebianSourcePackage(pydantic.BaseModel):
    name: str
    version: str
    type: Literal["dpkg"]
    dsc_fields: dict[str, Any]

So far so good; we’ve had all the above fairly established for some time, along with quite a lot of things on top of that (so making fundamental changes to the data model isn’t on the table). Now I’m trying to build logic to build a particular text-based index format from one of these collections, which essentially needs to pick out values from dsc_fields above and glue them together in particular ways. For performance reasons I don’t want to iterate over all the artifacts and all their fields in Python; I want to have the database do most of the work. I’m perfectly happy for this to be PostgreSQL-specific.

A cut-down version of the sort of query I’m after that works in PostgreSQL looks like this:

SELECT
    db_collectionitem.id,
    string_agg(dsc_data.key || ': ' || dsc_data.value, E'\n')
FROM
    db_collectionitem
    JOIN db_artifact ON db_collectionitem.artifact_id = db_artifact.id
    -- "ON true" looks odd, but seems correct for an explicit join with a function.
    JOIN jsonb_each_text(db_artifact.data->'dsc_fields') AS dsc_data ON true
WHERE db_collectionitem.parent_collection = 1
GROUP BY db_collectionitem.id;

The query is going to be more complex than this in practice, such as ordering the fields in the output and handling certain keys in different ways that might involve some more joins; so to make this compose nicely I’d really prefer to use the ORM for this rather than having to resort to raw query sets or similar. However, I can’t figure out how to make it work. jsonb_each_text is a set-returning function - it returns a set of (key, value) records - and as such I believe it needs to go in the FROM clause. Django Expression objects can be used as SELECT output expressions or in the WHERE clause, but I can’t find a way to use them here. Django does have some support for set-returning functions, but only as of 5.2 which I won’t be in a position to upgrade to for at least a few months (and I also haven’t checked whether it would be enough to solve this problem).

Is there any semi-reasonable way to add an extra join like this in 4.2? I don’t mind writing a moderate amount of custom code for this, but after several hours of trying to wrap my head around this today I’m not sure I’m much further on. I don’t even mind much if it involves poking at internals for the time being - we’ll upgrade to Django 5.2 eventually and can clean things up then. Things I’ve tried so far:

  • the obvious sort of thing with annotate (doesn’t work; that puts the annotating expression in SELECT’s output clause)
  • custom lookups/transforms (unclear how I’d attach one bearing in mind that the structure of Artifact.data is variable, and in any case it wasn’t clear I’d be able to do this sort of join that way)
  • manually calling query.join or something like that (in desperation; but django.db.models.sql.datastructures.Join doesn’t seem to be able to emit non-table joins anyway)

Am I barking up the wrong tree somehow? And if this just isn’t possible at all with Django 4.2, what’s the least intrusive way to inject a bit of raw SQL for this? Thanks in advance!

Hello @cjwatson!

There’s effectively no supported way to JOIN against a set returning function.

You should either use QuerySet.raw of give a shot at creating a subclass of sql.Join that implements as specialized as_sql method and inject it into your queryset’s .query.alias_map and call ref_alias.

Thanks for the hint!

This sort of approach seems to be roughly working (though it’d be nowhere near Django’s quality standards):

from typing import Any

from django.contrib.postgres.aggregates import StringAgg
from django.db.backends.base.base import BaseDatabaseWrapper
from django.db.models import Func
from django.db.models.functions import Concat
from django.db.models.sql.constants import INNER
from django.db.models.sql.datastructures import BaseTable
from django.db.models.sql.compiler import SQLCompiler


class FunctionJoin(BaseTable):
    """
    Join a set-returning function.

    See:
    https://forum.djangoproject.com/t/joining-a-set-returning-function-and-aggregating-its-output/40930

    Although this is really a join, ``Query.join`` accepts either ``Join``
    or ``BaseTable``, and we subclass ``BaseTable`` instead since it's less
    inconvenient.
    """

    def __init__(self, table_name: str, alias: str | None, function: Func) -> None:
        super().__init__(table_name=table_name, alias=alias)
        self.function = function

    def as_sql(self, compiler: SQLCompiler, connection: BaseDatabaseWrapper) -> tuple[str, list[str | int]]:
        alias_str = f" {self.table_name}" if self.table_alias is None else f" {self.table_alias}"
        sql, params = compiler.compile(self.function)
        return f"{INNER} {sql}{alias_str} ON true", list(params)

    def relabeled_clone(self, change_map: dict[str | None, str]) -> "FunctionJoin":
        return self.__class__(self.table_name, change_map.get(self.table_alias, self.table_alias), self.function)

    @property
    def identity(self) -> tuple[Any, ...]:
        return self.__class__, self.table_name, self.table_alias, *self.function.identity


class JSONBEachText(Func):
    function = "jsonb_each_text"
    arity = 1


qs = CollectionItem.objects.filter(id__in=(1, 2, 3, 4), artifact__category="debian:source-package")
qs.query.join(FunctionJoin("dsc_data", None, JSONBEachText(RawSQL("db_artifact.data->'dsc_fields'", ()))))
qs.values("id", stanza=StringAgg(Concat(RawSQL("dsc_data.key", ()), Value(": "), RawSQL("dsc_data.value", ())), delimiter="\n"))

I might be able to reduce the RawSQL calls a bit (this is literally just the first thing I got to work), but at least they’re contained in a way that doesn’t get in the way of composability.