Issue
We are using model_cls.objects.raw
to fetch data from a Postgres Database and convert the data back to Django model instances.
However, after some experiments, we realized Django RawQuerySet
will fetch all data at once even we are using it as a Generator or call the .iterator
function. It makes streaming records from the database impossible.
What we want to do
We want to use some Postgres JSONB feature here like jsonb_path_query_array
and jsonb_path_exists
.
Method 1: Hack the RawQuerySet
We found some gist online to “hack” the RawQuerySet class’s internal cursor object like this:
But neither of these are sustainable.
Method 2: Use connection.cursor.execute
to construct the cursor flow
From this post: postgresql - How can I use server-side cursors with django and psycopg2? - Stack Overflow
with transaction.atomic(), connection.cursor() as cur:
cur.execute("""
DECLARE mycursor CURSOR FOR
SELECT *
FROM giant_table
""")
while True:
cur.execute("FETCH 1000 FROM mycursor")
chunk = cur.fetchall()
if not chunk:
break
for row in chunk:
process_row(row)
However, with this we can only get a raw data record, it is not in Django model format. The major blocker here is there is no exposed API in Django for us to convert a record to a Django model instance (with all the type set correctly). For example the JSON data field data coming from this method will be a string instead of a dict.
Note: a brand new model instance Model(**dict)
might contain different data type comparing to the model instance coming out from the Django queryset (Saved instance).
Method 3: Use model_cls.objects.extra
to build the raw query
From the documentation, we can see we can customize a Queryset
with the .extra
function. This way we can avoid using the RawQuerySet
.
We haven’t tested if this can solve the memory issue yet, but based on the documentation it should help. However it is listed as “to be deprecated”.
Method 4: Combination of cur.execute
and model_cls.objects.raw
Extending from Method 2, we declare a PG cursor with cur.execute
, but within the while loop we fetch the actual data with model_cls.objects.raw
.
The idea is like this:
with transaction.atomic(), connection.cursor() as cursor:
# Prepare a cursor
cursor.execute(f"DECLARE c CURSOR WITHOUT HOLD FOR {query}", params)
prev_size = chunksize
while prev_size >= chunksize:
# Use model_cls.objects.raw instead to obtain the Django model instance
raw_queryset = model_cls.objects.raw(f"FETCH FORWARD {chunksize} FROM c")
count = 0
# Use raw_queryset.iterator instead to avoid caching
# We assume the queryset won't be reused within the same request session
for model_instance in raw_queryset.iterator():
yield model_instance
count += 1
prev_size = count
Method 4 seems to be the best, but it might not be ideal if we need to maintain our own DECLARE CURSOR
logic.
Any advice?