Can't parse InMemoryUploadedFile object in Django using API

Hello everybody!

Hope you are doing well.
Currently I am trying to fix this bug in my code.
What I am trying to do: I am trying to process, parse and read a pdf file WITHOUT saving the file.
So like I would parse the pdf, get the contents of pdf using an API, and then store that results in models to the database (SQLite for now).

HTML File:

 <div class="create-wrapper"> 
        <form class="create-wrapper-form" name="create-form" action="{% url 'blog:create' %}" method="post" enctype="multipart/form-data">
                    {% csrf_token %}
                    <div>
                          <label class="select-file" for="id_doc">Select File</label>
                          {{ form.doc }}
                   </div>
            </div>
            <button class="create-button" type="submit">Create</button>
        </form>  
    </div>

Models:

class Blog(models.Model):
    doc = models.FileField(default='', blank=True)
    doc_text = models.TextField(blank=True)

Views.py file:

def create(request):
    if request.method == 'POST':

        form = forms.CreateBlog(request.POST, request.FILES)

        if form.is_valid(): 
            instance = form.save(commit=False)
            instance.author = request.user
            instance.save()
            return redirect('blog:list')
    else:     
        form = forms.CreateBlog()
    return render(request, 'workspace/create.html', {'form':form})

I have tried parsing this PDF file by doing it in the models, like getting the doc using self.doc and passing it into a function and parsing it there but this did not work. I tried handling this in the view using request.FILES[] and also file.read(). I tried every stack overflow and every link possible and could not get an answer due to whatever error such as “FileField object error/ InMemoryFileUpload error,” “encoding, Unicode, Continuous byte” error of any parsing I try to do.

Recap what do I want to do?

  • Get file
  • Parse file for API call (accepts byte format (have error even what trying this such as like “has no decode”)
  • Get the text from the PDF that was extracted from the API call, and save that result BUT DO NOT save the file.

Here is an example of what I tried:

from tika import parser

decoded_file = file.read().decode('utf-8')
io_string = io.StringIO(decoded_file)
read_pdf = parser.from_file(io_string) # parse.from_file() is the Apache tika API I am using to extract text. THIS PART HAS TO WORK but didn't because of some parsing error or file formatting. 
return read_pdf['content']

Your help would be greatly appreciated!

Thanks!

If you’re not looking to save this file, don’t associate the field with a Model. (Make sure your form - if it is a model form - doesn’t have a file field. The form field for the file should not be part of any model.)

See the docs on Uploaded Files and Upload Handlers for the API to reference the file having been uploaded.

Thanks for the quick reply @KenWhitesell. I will try that

I have tried that and it failed :frowning:

Here is the updated: Django saves temporary files as InMemory.

   <div class="create-wrapper"> 
        <form class="create-wrapper-form" name="create-form" action="{% url 'blog:create' %}" method="post" enctype="multipart/form-data">
                    {% csrf_token %}
                    <div>
                          <label class="select-file" for="id_doc">Select File</label>
                          <label class="select-file" for="id_doc">Select File</label>
                         <input id="id_doc" name="doc" type="file">   
                   </div>
            </div>
            <button class="create-button" type="submit">Create</button>
        </form>  
    </div>
def create(request):
    if request.method == 'POST':

        form = forms.CreateBlog(request.POST)

        file = request.FILES['doc'].read() 
        read_pdf = parser.from_file(file)
        print(read_pdf['content'])  

        if form.is_valid(): 
            instance = form.save(commit=False)
            instance.author = request.user
            instance.save()
            return redirect('workspace:list')
    else:     
        form = forms.CreateBlog()
    return render(request, 'workspace/create.html', {'form':form})

I get the same error as I did before: ‘InMemoryUploadedFile’ object has no attribute ‘decode’

Parsing a InMemoryUploaded File is the number 1 problem regardless of in models or not. I just can’t find a way to parse this.

Please post the complete traceback you’re receiving, along with the complete view and any user-written functions being executed by that view.

Something is trying to call “decode” on the file rather than on the data within the file.

There is an error when passing into the Apacha tika PDF parser api

HTML:

<div class="create-wrapper"> 
        <form class="create-wrapper-form" name="create-form" action="{% url 'workspace:create' %}" method="post" enctype="multipart/form-data">
            {% csrf_token %}
            <div class="create-container">
                    <label class="select-file" for="id_doc">Select File</label>
                    <input id="id_doc" name="doc" type="file">   
            </div>
            <button class="create-button" type="submit">Create</button>
        </form>  
    </div>

Views.py:

def stude_create(request):
    if request.method == 'POST':

        form = forms.CreateStude(request.POST)

        file = request.FILES['doc'].read() 
        read_pdf = parser.from_file(file)
        print(read_pdf['content'])  

        if form.is_valid(): 
            instance = form.save(commit=False)
            instance.author = request.user
            instance.save()
            return redirect('workspace:list')
    else:     
        form = forms.CreateStude()
    return render(request, 'workspace/stude_create.html', {'form':form})

Models:
It did not include it in models as you have suggested.

First, when a traceback is requested here, please post the traceback from the process running runserver (or runserver_plus) or what’s logged by uwsgi (or gunicorn) - not the debug page from the browser.

However, what I can see is that the error is being thrown at this line:

I’m not familiar with this api nor do I know what your reference to parser is - but I’m going to guess from the name of the method (from_file) that that method is expecting a file object to be passed to it.

However, the line previous to that:

Shows that file doesn’t contain a “file-like” object - it probably contains a byte object. (I don’t believe it contains a Python string - I’m fairly certain it would be a byte object.)

You need to verify your API usage to ensure you’re passing the right data into that method - or to look at it another way, that you find the right method for the data you have.

The API I am using is Apache Tika API which uses bytes to parse data. The problem is that the data format is correct but I am getting all these encoding errors. If you want you can test it by doing pip install tika and then from tika import parser.

Here is the full traceback.

Internal Server Error: /workspace/create/
Traceback (most recent call last):
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/django/contrib/auth/decorators.py", line 21, in _wrapped_view
    return view_func(request, *args, **kwargs)
  File "/Users/person/Desktop/Startup/studE_v1.0.0/stude/workspace/views.py", line 69, in stude_create
    read_pdf = parser.from_file(file)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tika/parser.py", line 40, in from_file
    output = parse1(service, filename, serverEndpoint, headers=headers, config_path=config_path, requestOptions=requestOptions)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tika/tika.py", line 327, in parse1
    path, file_type = getRemoteFile(urlOrPath, TikaFilesPath)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tika/tika.py", line 762, in getRemoteFile
    urlp = urlparse(urlOrPath)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/urllib/parse.py", line 389, in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/urllib/parse.py", line 125, in _coerce_args
    return _decode_args(args) + (_encode_result,)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/urllib/parse.py", line 109, in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/urllib/parse.py", line 109, in <genexpr>
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 10: ordinal not in range(128)

Would you happen to know a way to extract text from PDF using a different API you are aware of by any chance to get this to work? I have tried others but failed because of this decode error and ASCII. I tried even formatting using this and doing something like ignore but failed.

Main problem is that nothing is able to encode/decode or have a friendly format for a InMemoryUpload object.

Thanks for the help by the way!

Once you execute the following line:

The variable file is not a file object. It’s a bytes object. Are you absolutely sure that a method named from_file is going to work when passed a bytes object? Can you point to any documentation saying so?

Actually you are correct my bad, it can’t process bytes but file objects. I thought it did.
Tika in action: Parsing PDFs in Python with Tika - GeeksforGeeks
Would you happen to know how to turn it into a File object or at least something other than a InMemoryUploaded object for processing?
Thanks!

Do you have any actual documentation for this library other than blog posts? I would expect that other APIs would be identified. (The blog post you referenced only shows passing in a file name, not a file-like object - but since you’re not intending on saving this as a file in the file system, that’s not very helpful.)

Yes of course. Here are some links on Apache Tika parser interface:
https://tika.apache.org/1.11/parser.html

https://tika.apache.org/2.0.0/formats.html

At this point, this isn’t really a Django question. You might find more direct assistance in a group that supports that library.

To be honest I don’t really care about what API I use but rather how to parse the file. So once I get the file using file = request.FILES[‘doc’] I am not sure how to turn this to a file object or something readable.

This line:

gives you a “file-like” object. (Not an actual file in the file system)

This line:

gives you the bytes object with the data from the file.

Saving that file as a field in a model would give you a “real” file-system-based file - but you started out with the stipulation that you didn’t want to do that.

This handles the Django side of it.

Would if that’s the case, when I passed request.FILE[“doc”] into the API it said can’t read InMemoryUploaded object not file object.

Correct, because the from_file method accepts a file name parameter, not a file object.

Ok so I would need to find an API that accepts a file object instead I guess?