I’m trying to execute a query using Django ORM to obtain the following SQL query:
python: 3.9.16
django: 3.2.19
value = text[]
SELECT CAST(x AS float)
FROM public.collection_basic_arrays
CROSS JOIN unnest(value) AS t(x)
WHERE x::float BETWEEN 0.001 AND 0.01
However, the ORM is translating it to the following SQL query:
SELECT CAST(unnest("collection_basic_arrays"."value") AS double precision) AS "value_unnest"
FROM "collection_basic_arrays"
WHERE
(CAST(unnest("collection_basic_arrays"."value") AS double precision) > 0.001
AND CAST(unnest("collection_basic_arrays"."value") AS double precision) <= 0.01)
ERROR: set-returning functions are not allowed in WHERE
LINE 5: AND CAST(unnest("collection_basic_arrays"."value") AS doubl...
^
SQL state: 0A000
Character: 219
I’m using a custom function and Django built-in functions to execute the query. I’m trying to obtain the query result with the following function:
from django.db.models.functions import Cast
from django.db import models
from django.db.models import Func, Q
class Unnest(Func):
function = 'unnest'
def execute(self):
value_key = 'value'
annotate = {}
filter = {}
annotate[value_key + '_unnest'] = Unnest(value_key)
value_key += '_unnest'
if value_key in annotate:
conversion = Cast(annotate[value_key], models.FloatField())
if conversion is not None:
annotate[value_key] = conversion
else:
conversion = Cast(value_key, models.FloatField())
if conversion is not None:
annotate[value_key] = conversion
value_key += '_conv'
filter[value_key + '__gt'] = 0.001
filter[value_key + '__lte'] = 0.01
data = MyModel.objects.annotate(**annotate).filter(Q(**filter)).values(value_key)
query_str = str(data.query)
return data
Is there any way to obtain the expected SQL query using Django ORM? I’d appreciate any suggestions or solutions you may have.
Note that this does not return the cartesian product as your cross join does, no clue if you rely on that somehow. It will still scale bad for a big table and/or many array values per entry, as it forces the planner into sequential array evaluation. If thats an issue, you prolly should normalize the array values itself.
Maybe also reshape the .extra(...) into a Lookup method, as extra is deprecated.
Let me explain better, I need to return all the values of the array and a datetime field contained in the row of the array, so if the array has 250 values there will be 250 equal datetime values and 250 different values, all in single records, so use necessarily unnest, but only those that satisfy the “greater than” and “less than or equal than” conditions, the conditions can also exist alone.
I had thought of 2 RAW queries in postgres which are:
SELECT CAST(x AS float), updateat
FROM public.collection_basic_arrays
CROSS JOIN unnest(value) AS t(x)
WHERE x::float BETWEEN 0.001 AND 0.01;
SELECT CAST(x AS float), updateat
FROM public.collection_basic_arrays, LATERAL unnest(value) as x
WHERE x::float BETWEEN 0.001 AND 0.01;
The second should perform better.
The table consists of 3 fields:
id::integer, updateat::timestamp with time zone, value::text.
It has updateat as index, to optimize its filtering, value does not have it because the array always contains different values.
Well to make it short - with django’s builtin ORM things like an explicit cross join or lateral join are not easy possible with the high level syntax. You’d have to resort to its raw queries (which are still easy to use).
For explicit joins you might have better luck with django-cte (have not yet looked into this package myself, but it is on my TODO list for one of my projects as a more expressive ORM interface).
Back on your data - I dont know your usage scenario, your business critical query load on the data (inserts & updates? the range filtering here?), and how big the table may grow. To not run into performance hell later on here a few remarks:
array value filtering >> inserts or updates: Since your range check is numerical, maybe already store as such to avoid the casting.
low selectivity on array value filtering per row: Here the cross join will perform really bad, lateral somewhat better. If the range borders never change, maybe already filter during insert&update to only store matching values in the first place. Then filtering reduces to a datetime match with annotated unnested array values (works again with django ORM).
High selectivity of date filtering: Thats the worst case, as it force the DBMS into a sequential scan of lots of rows with needed array unnesting, which on their own are quite big with 250 entries. If you hit the ground, where the unnesting gets a bottleneck, then a more space consuming table layout is needed - eg. save values separately per row as numerical, put an index on that column, and use normal filtering. Postgres wont choke on table >100M rows, but needs proper indexing to be still fast. If you have other predicates to work with, you can also shard such a big table (eg. by value ranges, by date range, or some outer constraint not yet reflected here).
Long story short - if your table grows big with your current array approach, performance will suffer very quick.
Hi @jerch,
Sorry for the late reply, but I was busy on another project.
The db grows exponentially day by day, in 6 months I have 7126406 rows of arrays of 250, obviously there are other fields used in query filtering, specifically there are 3 other fields, x, y, z, marked as indexes that are added to the lateral I was talking about above.
I can’t go for the first choice because the offset values could change, but I’m considering inserting an index into the “value” column. However, I’m not sure if it makes sense because each row might contain different values.
Honestly - thats nothing you should try to filter with array unnesting as shown above - note that you write " the conditions can also exist alone" - no, plz dont try that, such a query will not finish in reasonable time anymore.
With this amount of data per time you are in realms, where you should normalize things properly, and in a 2nd step might need to denormalize data with helper lookup tables (or views) to speedup a certain query again.
Hi @jerch,
for the moment I was trying with django’s extra method, but it doesn’t seem to format the query in the best way.
python: 3.9.16
django: 3.2.19
filter = {
'x': x,
'y': y,
'z': z
}
tables = [MyModel._meta.db_table]
if unnest:
tables.append('LATERAL unnest(value) AS value_unnest')
casted_value = 'CAST(value_unnest AS double precision)'
select = {'value_conv': '%s'}
select_params = (casted_value, )
discard_min = discards.get('discard_min', None)
discard_max = discards.get('discard_max', None)
where = []
if discard_min is not None and discard_max is not None:
where.append('%s BETWEEN %s AND %s')
params = (casted_value, discard_min, discard_max)
else:
if discard_min is not None:
where.append('%s > %s')
params = (casted_value, discard_min,)
if discard_max is not None:
where.append('%s <= %s')
params = (casted_value, discard_max,)
data = MyModel.objects.extra(
select=select,
where=where,
params=params,
tables=tables,
select_params=select_params
).filter(Q(**filter)).values('updateat', 'value_conv').last()
But this code produces
SELECT (%s) AS "value_conv", "collection_basic_arrays"."updateat"
FROM "collection_basic_arrays" , "LATERAL unnest(value) AS value_unnest"
WHERE ((%s BETWEEN %s AND %s)
AND "collection_basic_arrays"."x" = %s
AND "collection_basic_arrays"."y" = %s
AND "collection_basic_arrays"."z" = %s)
ORDER BY "collection_basic_arrays"."id" DESC LIMIT 1
Filled:
SELECT (CAST(value_unnest AS double precision)) AS "value_conv", "collection_basic_arrays"."updateat"
FROM "collection_basic_arrays" , "LATERAL unnest(value) AS value_unnest"
WHERE ((CAST(value_unnest AS double precision BETWEEN 0.001 AND 0.01)
AND "collection_basic_arrays"."x" = x
AND "collection_basic_arrays"."y" = y
AND "collection_basic_arrays"."z" = z)
ORDER BY "collection_basic_arrays"."id" DESC LIMIT 1
ERROR: relation "LATERAL unnest(value) AS value_unnest" does not exist
LINE 2: FROM "collection_basic_arrays", "LATERAL unnest(value) AS va...
^
SQL state: 42P01
Character: 134
Idk what to answer - I think you are patching at the wrong end of the issue. Your current unnest approaches are all doomed to turn into a big showstopper later on in your business model, once the sheer amount of data flows in and the first filtering happens.
I strongly suggest to go 2 steps back and to rethink the db schemes. Also do some db profiling with different index types. 7M * 250 values in 6 months are not easy cake anymore, esp. if you force the db into seq. scans by unfiltered unnesting. And most likely the app wont run for 6 months only, thus the perf will degrade more and more over time.