Proposal: Querying objects in batches for large collection

betaflag · December 21, 2023, 1:23pm

Looping through large collection with the ORM is inefficient because a query can load a large volume of objects in memory. When I’m doing backfill on tables with millions of row, I have to slice the query in multiple chunks to control the size of the objects that are pulled from the database.

I came up with this helper that I use in many apps:

def batched_query(queryset: QuerySet, batch_size=1000):
    """
    Execute a queryset in batches in order to reduce size of the results that have to
    fits in memory. The output is a generator that can be used in a simple for loop.

    Example :

    ```
    for entry in batched_query(User.objects.all()):
        entry.name = entry.name.to_lower()
        entry.save()
    ```
    """
    name = queryset.model.__name__
    total = queryset.count()
    for start in range(0, total, batch_size):
        end = min(start + batch_size, total)
        logger.debug(f"batched_queryset {name} processing {start} - {end} of {total}")
        for entry in queryset[start:end]:
            yield entry

I think it would be a great addition to Django’s core and I’d love to hear your feedback.

If we integrate it to the ORM, it could have a different API, something like:

for user in User.objects.all().in_batches(1000):
    user.name = entry.name.to_lower()
    user.save()

What do you think?

sarahboyce · December 21, 2023, 2:52pm

I might be wrong, but is this like .iterator() with a chunk_size? QuerySet API reference | Django documentation | Django

charettes · December 21, 2023, 4:52pm

Walking through an entire table in batches using OFFSET and LIMIT is not going to perform well for any reasonably sized tables. You can read more about why this is the case here.

The right way to do so is going to be heavily dependant on the database you are using and if the data set you are iterating needs to be filtered and ordered by in a specific way.

When you don’t mind about filtering or ordering the most efficient way is usually to walk the b-tree of the primary key with logic of the form

# This assumes an integer primary key but any will do.
chunk_size = 1000
queryset = User.objects.order_by("pk")
last_pk = 0
while True:
    chunk = list(queryset.filter(pk__gt=last_pk)[:chunk_size])
    if not (chunk_len := len(chunk)):
        break
    yield from iter(chunk)
    if chunk_len < chunk_size:
        break
    last_pk = chunk[-1].pk

This will generate queries of the form

SELECT * FROM user WHERE id < $last_id LIMIT 1000

Which can be efficiently retrieved as ranges of rows can be found by id < $last_id particularly on backends that cluster the data on disk by primary key.

All that to say that even if I’m not convinced of the necessity to add such a feature to core if we did so we most definitely wouldn’t want to use OFFSET / LIMIT

betaflag · December 21, 2023, 6:13pm

@charettes correcting me on SQL is becoming a trend I put you in the 0.1% of developers who actually understands RDBMS at scale. I’ll try this change in our apps soon after the holidays and I can’t wait to see the improvements! However, this demonstrate why I think it should be added in the ORM, for the rest of us who may not be well-versed in these mystical arts!

charettes · December 21, 2023, 7:09pm

There might be a way to come up with a solution that works in most cases but I fear the maintenance burden of trying to define and maintain such solution.

Factors that must be considered are

The ordering must be total, this means that the set of fields used as the cursor must define a total ordering AKA unique together. e.g. 'pk' is total per definition, 'updated_at' is not while ('updated_at', 'pk') is.
Inserts against the cursor must be monotonically increasing it. In other words, there cannot be members inserted in the table with a value that is lesser than where the cursor is at otherwise you’ll miss them. You can get for free with auto-incrementing primary keys and it’s possible to get for other datatypes as well (e.g. ULID of uuids)
An indexed must be defined on the cursor and ideally the data is clustered by it so there are less page to seeks to fetch a chunk.
If the queryset is filtered then if it’s solely against members of the cursor it’s fine otherwise you’re likely better off a full table scan by ID ranges instead. That is instead of doing WHERE id > $last_id LIMIT N it might be preferable to walk the cursor entirely in ranges instead WHERE id BETWEEN $chunk_n_min AND $chunk_n_max
If the table is very sparse for the used cursor (e.g. the filtered rows represent a small percentage of the full table) you might want to define an upper bound for the cursor to return rows faster and avoid scrolling until you reach the limit everytime.

This is one thing that is very hard to do well without having details about the topology and size of the data across all backends.

adamchainz · December 21, 2023, 10:32pm

My package Django-MySQL has a “smart iteration” feature that is like a more advanced version of the snippet from @charettes . It copies its quite complicated strategy from the pt-online-schema-change tool and dynamically adjusts its chunk size to fit a certain time, and pauses based when the database server has too much load.

The feature is a lot of code. Whilst I’ve used it a bunch in production, it’s not fit for all use cases. It struggles with sparse tables, can’t handle non-PK iteration, and probably isn’t defensive enough. Extending it for other situations would be a major task, let alone other databases. So basically it’s in the situation that @charettes just shared: it’s hard to do well in all situations.

id > $last_id

betaflag · December 21, 2023, 11:24pm

Honestly, I used it so much in Rails that I naively thought it was simple to implement.

sarahboyce · December 22, 2023, 7:56am

Should we still accept tickets that are really hard to do as aspirational tickets? Some tickets are open for years while people think about a solution.

HamaBarhamou · December 22, 2023, 5:23pm

in my humble opinion, yes. One person may eventually get interested in the ticket and manage to solve it. I’ve seen tickets that have been around for more than 10 years and are now making progress.

betaflag · December 22, 2023, 6:08pm

Sorry if this wasn’t clear but I’m volunteering for the implementation! I was validating the idea here first before creating the ticket.

HamaBarhamou · December 23, 2023, 1:27pm

super. glad to hear it

diego351 · October 26, 2024, 7:20am

@betaflag I think this is really good idea and I’d happily use it

Topic		Replies	Views
ORM optimization Using the ORM	1	271	October 31, 2023
How to avoid the overhead of model instances in bulk_create Using the ORM	4	3376	November 22, 2023
Optimization of database requests to reduce page load speeds Using the ORM	1	46	October 16, 2024
DB querying performance optimization working with model methods. Using the ORM	3	24	December 31, 2024
Migration of large amount of data from one app to another Using Django	1	857	March 18, 2020

Proposal: Querying objects in batches for large collection

Related topics