Converting Postgres SQL to Django ORM query

I want to convert the following query to Django

select file.id, array_agg(fileregion.defect_ids) from  file 
left outer join (select file_id, jsonb_object_keys(defects) as defect_ids
from ai_fileregion ) as fileregion
on fileregion.file_id=file.id group by file.id

Following are my Django models

class File(Base):
    class Meta:
        db_table = "file"


class FileRegion(Base):
    file = models.ForeignKey(File, on_delete=models.PROTECT, related_name='file_regions')
    # defects is of the form {'<defectid1>': {}, <defectid2>: {}}
    defects = JSONField(default=dict)
    class Meta:
        db_table = "ai_fileregion"

Aim is to basically get a list of records in which one field is the file_id and another field has an array of aggregated keys from defects json field for that file_id.

Note: Each file may have more than one entry in the ai_fileregion table.

Note: There will be more chaining added to this queryset. So, getting results in a different format to server and modifying it at the application level is not an option.

There is a Django expression for array_agg. You should be able to create your own expression for jsonb_object_keys following the docs here or you might be able to use a Func.

Once you have those created, you might be able to use them in annotate() calls like other aggregates. I’m not sure if it’ll all work as I haven’t done something like this and I’m not sure how ArrayAgg works in conjunction with a Subquery instance. That said, I would expect it to look like the following:

from django.db.models import Func, OuterRef, Subquery
from django.db.postgres.aggregates import ArrayAgg

defect_ids = FileRegion.objects.filter(
    file=OuterRef('id')
).annotate(
    defect_ids=Func(F('defects'), function='jsonb_object_keys')
).values('defect_ids')

files = File.objects.annotate(defect_ids=Subquery(defect_ids[:1])).annotate(
    defect_array=ArrayAgg('defect_ids')
)

Another possibility would be:

file_regions = FileRegion.objects.select_related('file').annotate(
    defect_ids=Func(F('defects'), function='jsonb_object_keys')
).annotate(
    defect_array=ArrayAgg('defect_ids')
)

The select_related('file') would pull the file so you could access it as file_regions.first().file

@CodenameTim, Thank you so much for giving a detailed answer. I still have a few queries in the approach you suggested.

I know for sure that the second approach doesn’t work because postgres doesn’t support set returning function as argument to aggregate functions. It will throw an error saying aggregate function calls cannot contain set-returning function calls

As for the first approach, The problem is that each file can have multiple file regions. Also, when we use jsonb_object_keys it will split all keys as different records. So, when we use defect_ids[:1] in the subquery, will it not loose the information about the other records from defect_ids?

There’s definitely a chance of that, I can’t say for certain it will work. You can inspect the query with print(qs.query) to verify it’s generating a query that will work for you.

@CodenameTim, Yes, I verified it and it’s happening as I said earlier. Any other suggestions on how I could approach this?