Optimizing Celery task by making fewer queries

I save the hashes of each FileField upload in my model, and I compare new uploads to the other hashes to prevent any duplicate files.

I have a celery task to remove files which are past their retention. It automatically deletes a file from the disk and database if 7 days passed and a file was uploaded with 7 days retention, etc.

But if many entries in the database are pointing to the same file, I only want to delete the entry from the database and not the file itself from the disk.

This is my solution, but I know that it’s not great because it makes an additional query for each file.

@shared_task
def delete_expired_files():
“”“Automatically delete files past their retention date.”“”

expired_files = UploadFile.objects.annotate(
    expired=F("uploaded_at") + F("retention")
).filter(expired__lt=timezone.now())

for file in expired_files:
    if file.retention != timedelta(days=-1):
        count = UploadFile.objects.filter(file=file.file).count()

        if count == 1: # Unique file, delete on disk and database
            file.delete()
            delete_file(file.file.path)

        else: # Non unique file, delete database entry
            file.delete()

I am a beginner to Django, so I can’t create complicated queries, and I have used ChatGPT to help me, but I have not been successful implementing the suggestions. But it points to it being possible to make a query that checks if every file within expired_files is referenced anywhere else in the database without a for loop.

Additional notes I’ve written:

# Filter for files that have expired
# Check that the file associated with an expired entry is only present once in the entire database
# If the file is only once in the database, and expired delete it from disk and database
# If the file is in the database multiple times, but an entry related to it is expired, delete the entry

Question:

What if you have multiple entries for a file, and all entries for that file are expired?

What database engine are you using? (PostgreSQL hopefully)

What I’m thinking of is a query that aggregates count by file name, annotated by the number of expired files by that name. Then you could build a queryset for all the entries to be deleted and issue a single delete operation on that queryset.

1 Like

Thank you for pointing this out, I had to take some time to think about this and I wound up rewriting the function. It is still making a request for every distinct file, which could be optimized, but you may agree that this is better. Please point out any logic errors again.

@shared_task
def delete_expired_files():
“”“Automatically delete files past their retention date.”“”
# Filter for files which have a retention date BEFORE now
now = timezone.now()

expired_files = UploadFile.objects.annotate(
    expired=F("uploaded_at") + F("retention")
    ).filter(expired__lt=now,
    retention__gt=timedelta(seconds=-1)
    
# Make a set of distinct files that are expired
distinct_expired_files = set(file.file for file in expired_files)

# Delete all expired files from the database
expired_files.delete()

# Check for each file if it still has any references in the database
for file in distinct_expired_files:
    if not UploadFile.objects.filter(file=file).exists():
        delete_file(file.path)

As of now I’m using sqlite3, I plan on using PostgreSQL when I am finished with my website.

This sounds great, but I changed a lot of code, so let me know if you have a new take on this. A batch deletion is what I was trying to attain myself, but could not piece together a query to do it.