RawQuerySet Memory Issue

Issue
We are using model_cls.objects.raw to fetch data from a Postgres Database and convert the data back to Django model instances.

However, after some experiments, we realized Django RawQuerySet will fetch all data at once even we are using it as a Generator or call the .iterator function. It makes streaming records from the database impossible.

What we want to do
We want to use some Postgres JSONB feature here like jsonb_path_query_array and jsonb_path_exists.

Method 1: Hack the RawQuerySet

We found some gist online to “hack” the RawQuerySet class’s internal cursor object like this:

But neither of these are sustainable.

Method 2: Use connection.cursor.execute to construct the cursor flow

From this post: postgresql - How can I use server-side cursors with django and psycopg2? - Stack Overflow

with transaction.atomic(), connection.cursor() as cur:
    cur.execute("""
        DECLARE mycursor CURSOR FOR
        SELECT *
        FROM giant_table
    """)
    while True:
        cur.execute("FETCH 1000 FROM mycursor")
        chunk = cur.fetchall()
        if not chunk:
            break
        for row in chunk:
            process_row(row)

However, with this we can only get a raw data record, it is not in Django model format. The major blocker here is there is no exposed API in Django for us to convert a record to a Django model instance (with all the type set correctly). For example the JSON data field data coming from this method will be a string instead of a dict.

Note: a brand new model instance Model(**dict) might contain different data type comparing to the model instance coming out from the Django queryset (Saved instance).

Method 3: Use model_cls.objects.extra to build the raw query
From the documentation, we can see we can customize a Queryset with the .extra function. This way we can avoid using the RawQuerySet.

We haven’t tested if this can solve the memory issue yet, but based on the documentation it should help. However it is listed as “to be deprecated”.

Method 4: Combination of cur.execute and model_cls.objects.raw

Extending from Method 2, we declare a PG cursor with cur.execute, but within the while loop we fetch the actual data with model_cls.objects.raw.
The idea is like this:


with transaction.atomic(), connection.cursor() as cursor:
    # Prepare a cursor
    cursor.execute(f"DECLARE c CURSOR WITHOUT HOLD FOR {query}",  params)

    prev_size = chunksize
    while prev_size >= chunksize:
        # Use model_cls.objects.raw instead to obtain the Django model instance
        raw_queryset = model_cls.objects.raw(f"FETCH FORWARD {chunksize} FROM c")
        count = 0
        # Use raw_queryset.iterator instead to avoid caching
        # We assume the queryset won't be reused within the same request session
        for model_instance in raw_queryset.iterator():
            yield model_instance
            count += 1
        prev_size = count

Method 4 seems to be the best, but it might not be ideal if we need to maintain our own DECLARE CURSOR logic.

Any advice?

First, I strongly suggest you ignore any SO question (or answer) more than 2 years old. (Actually, I suggest completely ignoring SO unless you’re already at a knowledge level to know what bad answers to avoid.) The newest of those two questions are targeted at Django 1.8, and we’re currently on 3.2 - and there have been a lot of changes in JSON support since then.

Beyond that, you can create custom functions for the ORM to use when constructing SQL. You can create a class that subclasses Func, that can then be used in a normal ORM query.
Example:

class JsonbPathQueryArray(Func):
    function = 'jsonb_path_query_array'
    arity = 2

(Note: That’s the exact definition that we’re currently using in production in one of our systems.)

See Func() expressions

Since you would be using the ORM in this case, you can also use iterator() on that queryset to enforce server-side cursors, allowing you to iterate over the responses in a more controlled environment.

2 Likes

Thanks for pointing me this direction. I did some research on the Django Model Expression. and here is what I got. (No need to refactor all the existing RawSQL to obtain a QuerySet object instead of the RawQuerySet object).

    # Mark the where_clause as an Queryset Annotation
    queryset = model.objects.annotate(condition=RawSQL(where_clause, params=params))
    # Filter on that annotation
    queryset = queryset.filter(condition=True)

    yield from queryset.iterator(chunk_size=chunk_size)

I haven’t tested the memory footprint of this yet, but it looks good to me.