Can't parse InMemoryUploadedFile object in Django using API

There is an error when passing into the Apacha tika PDF parser api

HTML:

<div class="create-wrapper"> 
        <form class="create-wrapper-form" name="create-form" action="{% url 'workspace:create' %}" method="post" enctype="multipart/form-data">
            {% csrf_token %}
            <div class="create-container">
                    <label class="select-file" for="id_doc">Select File</label>
                    <input id="id_doc" name="doc" type="file">   
            </div>
            <button class="create-button" type="submit">Create</button>
        </form>  
    </div>

Views.py:

def stude_create(request):
    if request.method == 'POST':

        form = forms.CreateStude(request.POST)

        file = request.FILES['doc'].read() 
        read_pdf = parser.from_file(file)
        print(read_pdf['content'])  

        if form.is_valid(): 
            instance = form.save(commit=False)
            instance.author = request.user
            instance.save()
            return redirect('workspace:list')
    else:     
        form = forms.CreateStude()
    return render(request, 'workspace/stude_create.html', {'form':form})

Models:
It did not include it in models as you have suggested.

First, when a traceback is requested here, please post the traceback from the process running runserver (or runserver_plus) or what’s logged by uwsgi (or gunicorn) - not the debug page from the browser.

However, what I can see is that the error is being thrown at this line:

I’m not familiar with this api nor do I know what your reference to parser is - but I’m going to guess from the name of the method (from_file) that that method is expecting a file object to be passed to it.

However, the line previous to that:

Shows that file doesn’t contain a “file-like” object - it probably contains a byte object. (I don’t believe it contains a Python string - I’m fairly certain it would be a byte object.)

You need to verify your API usage to ensure you’re passing the right data into that method - or to look at it another way, that you find the right method for the data you have.

The API I am using is Apache Tika API which uses bytes to parse data. The problem is that the data format is correct but I am getting all these encoding errors. If you want you can test it by doing pip install tika and then from tika import parser.

Here is the full traceback.

Internal Server Error: /workspace/create/
Traceback (most recent call last):
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/django/contrib/auth/decorators.py", line 21, in _wrapped_view
    return view_func(request, *args, **kwargs)
  File "/Users/person/Desktop/Startup/studE_v1.0.0/stude/workspace/views.py", line 69, in stude_create
    read_pdf = parser.from_file(file)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tika/parser.py", line 40, in from_file
    output = parse1(service, filename, serverEndpoint, headers=headers, config_path=config_path, requestOptions=requestOptions)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tika/tika.py", line 327, in parse1
    path, file_type = getRemoteFile(urlOrPath, TikaFilesPath)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tika/tika.py", line 762, in getRemoteFile
    urlp = urlparse(urlOrPath)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/urllib/parse.py", line 389, in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/urllib/parse.py", line 125, in _coerce_args
    return _decode_args(args) + (_encode_result,)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/urllib/parse.py", line 109, in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
  File "/Users/person/.pyenv/versions/3.9.0/lib/python3.9/urllib/parse.py", line 109, in <genexpr>
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 10: ordinal not in range(128)

Would you happen to know a way to extract text from PDF using a different API you are aware of by any chance to get this to work? I have tried others but failed because of this decode error and ASCII. I tried even formatting using this and doing something like ignore but failed.

Main problem is that nothing is able to encode/decode or have a friendly format for a InMemoryUpload object.

Thanks for the help by the way!

Once you execute the following line:

The variable file is not a file object. It’s a bytes object. Are you absolutely sure that a method named from_file is going to work when passed a bytes object? Can you point to any documentation saying so?

Actually you are correct my bad, it can’t process bytes but file objects. I thought it did.
Tika in action: Parsing PDFs in Python with Tika - GeeksforGeeks
Would you happen to know how to turn it into a File object or at least something other than a InMemoryUploaded object for processing?
Thanks!

Do you have any actual documentation for this library other than blog posts? I would expect that other APIs would be identified. (The blog post you referenced only shows passing in a file name, not a file-like object - but since you’re not intending on saving this as a file in the file system, that’s not very helpful.)

Yes of course. Here are some links on Apache Tika parser interface:
https://tika.apache.org/1.11/parser.html

https://tika.apache.org/2.0.0/formats.html

At this point, this isn’t really a Django question. You might find more direct assistance in a group that supports that library.

To be honest I don’t really care about what API I use but rather how to parse the file. So once I get the file using file = request.FILES[‘doc’] I am not sure how to turn this to a file object or something readable.

This line:

gives you a “file-like” object. (Not an actual file in the file system)

This line:

gives you the bytes object with the data from the file.

Saving that file as a field in a model would give you a “real” file-system-based file - but you started out with the stipulation that you didn’t want to do that.

This handles the Django side of it.

Would if that’s the case, when I passed request.FILE[“doc”] into the API it said can’t read InMemoryUploaded object not file object.

Correct, because the from_file method accepts a file name parameter, not a file object.

Ok so I would need to find an API that accepts a file object instead I guess?

That is one of your three options, yes.

Ok awesome. What are the other 2 options, other then saving it then?

See my earlier response - Can't parse InMemoryUploadedFile object in Django using API - #17 by KenWhitesell

Ok awesome. Thank you so much! I will reply again is nothing works. Hopefully this is the last time.
Thanks @KenWhitesell

Just to add a note here: Ken is nudging you to look at the documentation, but you might be looking at the wrong documentation.

You linked the Apache Tika API documentation, however you are using the tika-python package which wraps around it. Have a look here and here, I suspect this will help combined with the info that Ken already shared.

Awesome thanks @awtimmering :pray: