TL;DR: how can I delete a large number of objects efficiently in an incremental way?
I need to periodically delete million of rows from a Django-managed table (that lives in PostgreSQL if that matters). These objects have a pre_delete signal (for most objects it does nothing, for some it makes an HTTP request to another service). It is fine if objects are not all deleted at the same time. Initially I had a cron job doing something like this:
for o in Model.objects.filter(is_expired=True):
o.delete()
This turned out impractical as it was consuming huge amounts of memory and taking forever running one query per object. Then I tried to bulk delete:
Model.objects.filter(is_expired=True).delete()
Contrarily to what the documentation suggests, the pre_delete signal does seem to be called (which is good, but maybe the documentation needs fixing?) however it still consumes a lot of memory and, when it fails, nothing is deleted and I need to start from scratch.
I ended up resolving my problem by batching:
paginator = Paginator(Model.objects.filter(is_expired=True).reverse(), batch_size)
for page_number in reversed(paginator.page_range):
ids = paginator.page(page_number).values('id')
Model.objects.filter(id__in=ids).delete()
This works well but is not very nice. I need to double-reverse to be able to delete the whole set while iterating over it. I need to build a separate QuerySet for each batch as DELETE does not support LIMIT/OFFSET.
Is there a better way to do this?
Django’s delete() already batches the DELETEs. But AFAICT the batch size is hardcoded and there is no way to do it outside of a transaction. Would it make sense to implement these features in Django? I imagine something like Model.objects.filters(is_expired=True).delete(allow_partial=True, batch_size=8012)
hello there!
I think you might have misread the documentation
The delete()
method does a bulk delete and does not call any delete()
methods on your models. It does, however, emit the pre_delete
and post_delete
signals for all deleted objects (including cascaded deletions).
On the topic of deletion itself.
I’m pretty sure this will not do what you want and will be relatively expensive given how poorly LIMIT
/ OFFSET
behaves on large data set.
First it won’t do what you expect (deleting all the rows) because as you page through the results (assuming you have defined a total order on the results) you will have deleted the page before already.
Say that you have 10,000 rows matching is_expired
ordered by their primary key then the first iteration would find the first page_range
rows and delete them. The thing is that when the first iteration runs the second page has become the first the third page has become the second so you’ll skip over the first. In other words this approach will only delete half the records as all initially even pages will be skipped.
A better approach would be keep fetching records until they exist, letting the database figure out what’s the most efficient way to do so. Something like
chunk_size = 1000
queryset = (
Model.objects.filter(
is_expired=True,
)
.order_by()
.values_list("pk", flat=True)
)
while True:
deleted, _ = Model.objects.filter(
pk__in=queryset[:chunk_size]
).delete()
if not deleted:
break
this will perform queries of the form
SELECT * FROM model WHERE id IN (
SELECT id FROM model WHERE is_expired LIMIT ?
)
---
DELETE FROM model WHERE id IN ?
for each iteration if you have any pre_delete
or post_delete
signal registered or have other models pointing at YouModel
with ForeignKey
s or
DELETE FROM model WHERE id IN (
SELECT id FROM model WHERE is_expired LIMIT ?
)
otherwise.
Thank you for your reply.
I think you might have misread the documentation
Ah. On this page it says:
Keep in mind that this will, whenever possible, be executed purely in SQL, and so the delete()
methods of individual object instances will not necessarily be called during the process.
That got me confused I guess.
First it won’t do what you expect (deleting all the rows)
If you delete forward, it indeeds behave the way you describe. That is why I delete from the back. It does delete all the rows as expected.
A better approach would be keep fetching records until they exist
That is an interesting idea that I might try. Thanks.
I like how the queryset[:chunk_size]
way makes the code simpler. I see how it’s supposed to be faster too but in practice for my use case it doesn’t change performances much as, at this point, the bottleneck is more on the deletions themselves.
It if the case it likely means you have a few models referring to the one you’re chunk-deleting and the dependency graph has to be fetched in memory or you have at least one (pre|post)_delete
signal receivers attached to your model that forces the fast deletion path (DELETE FROM IN (...)
) to be disabled.