I have a class Document which can have multiple DocumentFiles associated with it. I’m using a model form to create my DocumentForm instances, which is itself wrapped in a forest.
class DocumentFile(models.Model):
document = models.ForeignKey("Document", on_delete=models.SET_NULL, null=True, blank=True, related_name="files")
file = models.FileField(upload_to="%Y/%m/%d/", blank=False, null=False)
document_text = models.TextField(blank=True, null=True)
If the user uploads something like a pdf or a docx I’d like to pass the raw file contents to an external library for accessing text so that I can store it in document_text to make it searchable later on.
All of the examples I have seen of processing uploaded file contents are in the view and uses request.FILES. That sounds ok, but I’m uploading multiple files at once with my formset and don’t want to mix up which file goes with which form. Is there a way I can access the file contents in the clean method of my form to cleanly (pun intended) separate the logic from the view?
Sort of related question while we are at it: if the user uploads a file larger than 2.5MB will I still be able to access the contents of it in memory? From the docs it sounds like there is a risk of chunking if I’m accessing it in memory, but I need to pass the whole thing into pypdf (or whatever library). If I could force writing the file to temp storage on disk whether it is big or small that would also be an option, I just don’t want to have two different cases.