Hello everybody!
Hope you are doing well.
Currently I am trying to fix this bug in my code.
What I am trying to do: I am trying to process, parse and read a pdf file WITHOUT saving the file.
So like I would parse the pdf, get the contents of pdf using an API, and then store that results in models to the database (SQLite for now).
HTML File:
<div class="create-wrapper">
<form class="create-wrapper-form" name="create-form" action="{% url 'blog:create' %}" method="post" enctype="multipart/form-data">
{% csrf_token %}
<div>
<label class="select-file" for="id_doc">Select File</label>
{{ form.doc }}
</div>
</div>
<button class="create-button" type="submit">Create</button>
</form>
</div>
Models:
class Blog(models.Model):
doc = models.FileField(default='', blank=True)
doc_text = models.TextField(blank=True)
Views.py file:
def create(request):
if request.method == 'POST':
form = forms.CreateBlog(request.POST, request.FILES)
if form.is_valid():
instance = form.save(commit=False)
instance.author = request.user
instance.save()
return redirect('blog:list')
else:
form = forms.CreateBlog()
return render(request, 'workspace/create.html', {'form':form})
I have tried parsing this PDF file by doing it in the models, like getting the doc using self.doc and passing it into a function and parsing it there but this did not work. I tried handling this in the view using request.FILES[] and also file.read(). I tried every stack overflow and every link possible and could not get an answer due to whatever error such as “FileField object error/ InMemoryFileUpload error,” “encoding, Unicode, Continuous byte” error of any parsing I try to do.
Recap what do I want to do?
- Get file
- Parse file for API call (accepts byte format (have error even what trying this such as like “has no decode”)
- Get the text from the PDF that was extracted from the API call, and save that result BUT DO NOT save the file.
Here is an example of what I tried:
from tika import parser
decoded_file = file.read().decode('utf-8')
io_string = io.StringIO(decoded_file)
read_pdf = parser.from_file(io_string) # parse.from_file() is the Apache tika API I am using to extract text. THIS PART HAS TO WORK but didn't because of some parsing error or file formatting.
return read_pdf['content']
Your help would be greatly appreciated!
Thanks!