I have an application currently running on Django 4.2 and PostgreSQL 15, with the following model structure (obviously there’s much more than this, but this is the relevant bit for the purposes of this question):
class Artifact(models.Model):
category = models.CharField()
data = models.JSONField(default=dict, blank=True)
class Collection(models.Model):
category = models.CharField()
class CollectionItem(models.Model):
parent_collection = models.ForeignKey(Collection)
artifact = models.ForeignKey("Artifact", null=True)
Artifact
holds artifacts of many different categories (e.g. code packages), and its data
field has a category-dependent structure, validated at the Python level using Pydantic. The structure can be quite complex, but for any given category we know how it will look. For the case I’m looking at here, the data model is something like this:
class DebianSourcePackage(pydantic.BaseModel):
name: str
version: str
type: Literal["dpkg"]
dsc_fields: dict[str, Any]
So far so good; we’ve had all the above fairly established for some time, along with quite a lot of things on top of that (so making fundamental changes to the data model isn’t on the table). Now I’m trying to build logic to build a particular text-based index format from one of these collections, which essentially needs to pick out values from dsc_fields
above and glue them together in particular ways. For performance reasons I don’t want to iterate over all the artifacts and all their fields in Python; I want to have the database do most of the work. I’m perfectly happy for this to be PostgreSQL-specific.
A cut-down version of the sort of query I’m after that works in PostgreSQL looks like this:
SELECT
db_collectionitem.id,
string_agg(dsc_data.key || ': ' || dsc_data.value, E'\n')
FROM
db_collectionitem
JOIN db_artifact ON db_collectionitem.artifact_id = db_artifact.id
-- "ON true" looks odd, but seems correct for an explicit join with a function.
JOIN jsonb_each_text(db_artifact.data->'dsc_fields') AS dsc_data ON true
WHERE db_collectionitem.parent_collection = 1
GROUP BY db_collectionitem.id;
The query is going to be more complex than this in practice, such as ordering the fields in the output and handling certain keys in different ways that might involve some more joins; so to make this compose nicely I’d really prefer to use the ORM for this rather than having to resort to raw query sets or similar. However, I can’t figure out how to make it work. jsonb_each_text
is a set-returning function - it returns a set of (key, value)
records - and as such I believe it needs to go in the FROM
clause. Django Expression
objects can be used as SELECT
output expressions or in the WHERE
clause, but I can’t find a way to use them here. Django does have some support for set-returning functions, but only as of 5.2 which I won’t be in a position to upgrade to for at least a few months (and I also haven’t checked whether it would be enough to solve this problem).
Is there any semi-reasonable way to add an extra join like this in 4.2? I don’t mind writing a moderate amount of custom code for this, but after several hours of trying to wrap my head around this today I’m not sure I’m much further on. I don’t even mind much if it involves poking at internals for the time being - we’ll upgrade to Django 5.2 eventually and can clean things up then. Things I’ve tried so far:
- the obvious sort of thing with
annotate
(doesn’t work; that puts the annotating expression inSELECT
’s output clause) - custom lookups/transforms (unclear how I’d attach one bearing in mind that the structure of
Artifact.data
is variable, and in any case it wasn’t clear I’d be able to do this sort of join that way) - manually calling
query.join
or something like that (in desperation; butdjango.db.models.sql.datastructures.Join
doesn’t seem to be able to emit non-table joins anyway)
Am I barking up the wrong tree somehow? And if this just isn’t possible at all with Django 4.2, what’s the least intrusive way to inject a bit of raw SQL for this? Thanks in advance!