Recommended Method to Process Files in Pandas

I am working on converting my collection of Python scrips to Django versions - most of these scrips would load files from a directory and be run from the terminal.

Now, instead of having files loaded from a terminal interface, I want to allow for a user to select files from a webform and then django will get those processed by pandas.

I have been reading the documentation for file uploads and can’t quite determine the best way to do it.

For reference, I don’t want anything stored to a database, ideally I would just have the files loaded into memory or a temporary directory on the server.

The process is: two files are selected (usually xlsx or csv) and then processed by pandas. Once this is complete, I don’t know the best way to take this dataframe and output it to a file that the browser will treat as a file download.

What I have so far:

html
<form class="" action="" method="post" enctype="multipart/form-data" >
        {% csrf_token %}
        <h3>Identify New Products</h3>
        <input type="file" name="sheet1" id="current_sheet" style="display:none;"/>
        <label for="current_sheet" class="btn btn-secondary">Attach Current Vendor Sheet</label>
        <input type="file" name="sheet2" id="new_sheet" style="display:none;"/>
        <label for="new_sheet" class="btn btn-secondary">Attach New Vendor Sheet</label>
        <input class="btn btn-primary" type="submit" value="Submit">
</form>
views
def newskus(request):
#get files from form and define for pandas
    if request.method == 'POST':
        current_sheet = request.FILES['current_sheet']
        new_sheet = request.FILES['new_sheet']

#set up pandas variables and define process
        current_colname = "VPN"
        new_colname = "PartNo"

        df_current = pd.read_excel(current_sheet, index_col=0, dtype=str)
        df_new = pd.read_excel(new_sheet, index_col=0, dtype=str)
        set_current = set(df_current[current_colname])
        df_output = pd.DataFrame(columns=df_new.columns)

        for i in range(len(df_new.index)):
            if not df_new[new_colname][i] in set_current:
                df_output.loc[len(df_output)] = df_new.iloc[i]

#dataframe is complete, now we have to output to a file for download
# ...
    return render(request, 'newskus.html', {})

I searches on the Django documentation site didn’t yield any good results, and most of what I have found searching online is for much older versions of Django.

Some of this may depend upon the operating system that your server is running on.
Generally speaking, you’ll want to create a temporary file into which you’ll write the data. If you know it’s going to be small enough, you can do it as a memory-based file.
But either way, you’re probably going to want to return the response as a FileResponse object rather than as a rendered template.

However, keep in mind that any files you create are not going to be automatically deleted. The file needs to exist through the duration of the response, so you can’t delete it in the view. You are likely going to want to track the files being created in a database to be able to perform periodic cleanup.

Yeah, I didn’t find the best way recommended to output a file from pandas, so I will look into your link.

I also found by digging in the documentation that request.files looks to have been replaced by httprequest.files so I will have to look into that too.

I think I had read anything over 2.5MB will go into the temporary files directory and not memory? I am going to guess that some files will be over that limit, so will have to consider that.

I was just going to empty out the temp folders on a weekly basis, but what you recommend sounds smart. I should look into dumping something into the admin accessible database so I can see what is going on and remove files.

Thanks!

That’s on the upload side, and it’s configurable - see File Uploads | Django documentation | Django and Settings | Django documentation | Django

I think you might be misinterpreting what you’re reading. HttpRequest is the object class, but the instance you receive in the view is traditionally named request.

Thanks for the correction!

Since request.FILES returns a dictionary like object for all files uploaded, like {name: UploadedFile object} the key is to get into each object and iterate through to avoid the annoying MultiValueDictKeyError that I am getting.

So, I know the name as it is passed through the form, but I am a bit confused on how to use chunks to iterate through a file to load it into a dataframe.

Assuming you have a file that can be chunked (aka file.multiple_chunks is true) how do you iterate through it to read the file?

I was thinking something like

  if request.method == 'POST':
      current_sheet = request.FILES['current_sheet']
      if current_sheet.multiple_chunks():
          for chunk in current_sheet.chunks():
              pass #read file somehow
      else:
          current_sheet = current_sheet.read()

Sorry for the followup

I’m not really sure you can - at least not effectively. I don’t know the format of the data you’re reading or how you load it, but the data you read as chunks is not necessarily line oriented. You’re going to get chunks of a consistent size, which isn’t likely to align with the expectations of whatever you’re using to process it.

But note - you don’t need to read it as chunks. As long as you know that the size is reasonable for your purposes, you can use read for the entire file. (Also note that if this is “line-oriented” data, you can use readlines.) See the notes at the bottom of The File class docs

No apologies necessary - this is a conversation on a topic, not an “ask one question / get one answer” site.

Try the code below - seems to be successful at uploading excel sheets into a directory for me. I think there are some checks you can do after the upload is completed to then load the files into pandas.

Granted it doesn’t load the file(s) into memory, but it can dump the file into a directory of your choosing that you can clean out regularly or until you can think of something better. I am still new to all of this myself.

html for a template:

<body>  
    <form method="POST" class="post-form" enctype="multipart/form-data">  
            {% csrf_token %}  
            {{ form }}  
            <button type="submit" class="save btn btn-default">Save</button>  
    </form>  
</body>

forms.py file for a form

from django import forms  

class UploadForm(forms.Form):  
    current_file = forms.FileField()
    new_file = forms.FileField()

here is a function that will save the file to a path you declare:


def handle_uploaded_file(f):  
    with open('path to an upload folder goes here'+f.name, 'wb+') as destination:  
        for chunk in f.chunks():  
            destination.write(chunk)

and then make sure you have a view that points to the template that was created above

From here you can work on importing the uploaded file into pandas.

I would guess that if you have multiple people using this at the same time, you would probably want to add some sort of identifier to the filename or have some sort of protection so that someone else doesn’t get their file overwritten with someone else’s - but then again check the docs as I have no idea!