I save the hashes of each FileField upload in my model, and I compare new uploads to the other hashes to prevent any duplicate files.
I have a celery task to remove files which are past their retention. It automatically deletes a file from the disk and database if 7 days passed and a file was uploaded with 7 days retention, etc.
But if many entries in the database are pointing to the same file, I only want to delete the entry from the database and not the file itself from the disk.
This is my solution, but I know that it’s not great because it makes an additional query for each file.
@shared_task
def delete_expired_files():
“”“Automatically delete files past their retention date.”“”
expired_files = UploadFile.objects.annotate(
expired=F("uploaded_at") + F("retention")
).filter(expired__lt=timezone.now())
for file in expired_files:
if file.retention != timedelta(days=-1):
count = UploadFile.objects.filter(file=file.file).count()
if count == 1: # Unique file, delete on disk and database
file.delete()
delete_file(file.file.path)
else: # Non unique file, delete database entry
file.delete()
I am a beginner to Django, so I can’t create complicated queries, and I have used ChatGPT to help me, but I have not been successful implementing the suggestions. But it points to it being possible to make a query that checks if every file within expired_files is referenced anywhere else in the database without a for loop.
Additional notes I’ve written:
# Filter for files that have expired
# Check that the file associated with an expired entry is only present once in the entire database
# If the file is only once in the database, and expired delete it from disk and database
# If the file is in the database multiple times, but an entry related to it is expired, delete the entry