Accessing FileField data in a model form in order to parse pdf text

bixbyr · May 24, 2024, 6:33pm

I have a class Document which can have multiple DocumentFiles associated with it. I’m using a model form to create my DocumentForm instances, which is itself wrapped in a forest.

class DocumentFile(models.Model):
    document = models.ForeignKey("Document", on_delete=models.SET_NULL, null=True, blank=True, related_name="files")
    file = models.FileField(upload_to="%Y/%m/%d/", blank=False, null=False)
    document_text = models.TextField(blank=True, null=True)

If the user uploads something like a pdf or a docx I’d like to pass the raw file contents to an external library for accessing text so that I can store it in document_text to make it searchable later on.

All of the examples I have seen of processing uploaded file contents are in the view and uses request.FILES. That sounds ok, but I’m uploading multiple files at once with my formset and don’t want to mix up which file goes with which form. Is there a way I can access the file contents in the clean method of my form to cleanly (pun intended) separate the logic from the view?

Sort of related question while we are at it: if the user uploads a file larger than 2.5MB will I still be able to access the contents of it in memory? From the docs it sounds like there is a risk of chunking if I’m accessing it in memory, but I need to pass the whole thing into pypdf (or whatever library). If I could force writing the file to temp storage on disk whether it is big or small that would also be an option, I just don’t want to have two different cases.

Topic		Replies	Views
Can't parse InMemoryUploadedFile object in Django using API Using Django	25	9667	August 12, 2021
Setting FileField from UploadedFile Using Django	4	869	October 26, 2020
Saving file in model instance Using Django	3	1218	November 7, 2021
assign Document object to FileField Using Django	1	641	October 13, 2021
How to show pdf.filename instead of pdf.title Forms & APIs	9	773	July 16, 2023

Accessing FileField data in a model form in order to parse pdf text

Related topics