Handle user uploaded zip file, validate zip and normalize it to zipfile.ZipFile: Discussion.

Hello! I am working on a wallpapers site project. I am developing this feature where the admin can perform “Bulk Upload” by uploading a zip file full of wallpapers. I’ve written a form that does it, below is simplified version:

class BulkUploadForm(forms.Form):
    zip_file = forms.FileField(
        label='Wallpapers Zip',
        required=True,
        max_length=64,
        allow_empty_file=False,
        help_text=f"Upload a zip file containing wallpapers that does not exceed {settings.MAX_BULK_UPLOAD_SIZE // mb} MB."
    )

    def clean_zip_file(self) -> zipfile.ZipFile:
        uploaded_file = cast(UploadedFile, self.cleaned_data['zip_file'])

        if uploaded_file.name is None or PurePath(uploaded_file.name).suffix != '.zip':
            raise ValidationError(
                "Only '.zip' extension is allowed.",
                code='invalid_zip_file_extension'
            )

        if uploaded_file.file is None:
            raise ValidationError(
                "zip file is required.",
                code='zip_file_required'
            )

        try:
            zip_file = zipfile.ZipFile(uploaded_file)
        except zipfile.BadZipfile as err:
            raise ValidationError(
                'Invalid zip file.',
                code='invalid_zip_file'
            )
        
        if (bad_file:=zip_file.testzip()) is not None:
            raise ValidationError(
                "Bad file found in zip: %(bad_file)s",
                code='bad_file_found_in_zip_file',
                params=dict(bad_file=bad_file)
            )

        return zip_file

This project is open source and hosted on my Github account.

I wanted some suggestion from community and optimizations tips for this type of “Large File Task”. This is because the zip file will typically have size around 300 to 400 MB in my application.

Please tell me if above implementation contains any bugs, memory leaks, security mistake.

One more question, exactly at which point the underlying Uploaded file (either TemporaryUploadedFile or InMemoryUploadedFile) is destroyed and release resources, assuming that I performs something in a view like below:

def bulk_upload(request: HttpRequest) -> HttpResponse:

    if request.method == 'POST':

        form = BulkUploadForm(request.POST, request.FILES)

        if form.is_valid():
            zip_file = cast(zipfile.ZipFile, form.cleaned_data['zip_file'])
            upload_process = BulkUploadProcess.upload_procedures.bulk_upload(zipfile.Path(zip_file))
            return render(request, 'success.html')
        else:
            return render(request, 'app/bulk_upload.html', dict(form=form))

    return render(request, 'app/bulk_upload.html', dict(form=BulkUploadForm()))

And bulk_upload function is following:

def bulk_upload(self, zip_file_path: ZipPath) -> "BulkUploadProcess":
        paths = []
        extensions = get_file_extensions_for_image_format(ImageFormat.JPEG)
        glob_patterns = tuple([f'**/*{extension}' for extension in extensions])

        for glob_pattern in glob_patterns: paths += [*zip_file_path.glob(glob_pattern)]
        
        group_result = cast(GroupResult, group(
                chain(

                    save_wallpaper.s(file.at, str(file.root.filename)),
                    
                    generate_and_save_dummy_wallpaper.s(),

                ) 
                for file in paths
            )(countdown=10)
        )
        group_result.save() # type: ignore[attr-defined]
        result_id = cast(str, group_result.id) # type: ignore[attr-defined]
        return BulkUploadProcess.objects.create(uuid=uuid.UUID(result_id))

The bulk_upload function uses celery to perform group task of chains. Each chain has two function: save_wallpaper and generate_and_save_dummy_wallpaper. The chain first saves wallpaper to database, then it prepares a “dummy” wallpaper to show on frontend. The dummy wallpaper will be small in size thus faster to load; for example original wallpaper can be 4-5 MB while dummy is 400 to 500 KB. When user downloads wallpaper, they get original wallpaper.

1 Like

I believe that this code could be vulnerable to any of a number of “zip-file attacks” such as a zip bomb. (Also see the referenced article at The Most Clever 'Zip Bomb' Ever Made Explodes a 46MB File to 4.5 Petabytes for some additional background.)

Also see https://nvd.nist.gov/vuln/detail/cve-2024-0450 and Decompression pitfalls

I’ve came up with a potential solution to avoid zip bomb for my code. Can you please tell if it works?

  • We constraint name of the zip.

  • We constraint file size of the zip.

  • We then loop through all files using zipfile.Path with at parameter set to / so only top level files are considered

    • Inside the loop: We constraint name and extension of the file.
    • Inside the loop: we maintain a counter which if exceeds some number (let’s say 50) - we raise error; so now we constrained the number of top level files.
    • Inside the loop: we accumulate sizes of considered files which if exceeds then raise error; so we now constrained the size of decompressed data.
  • We prohibit use of glob().

  • We prohibit use of extractall().

  • We keep list of considered filenames to latter use to validate as image file and save file to database and media root.

  • After everything is done, we manually use close on file to ensure that its destroyed from memory or temporary file system.

Is there any way to go stricter?

From my small reading session about zip bombs, they work because we extract the nested zip files as well. We allow un-sanitized and unsafe file names. We don’t count total decompressed size before decompressing it. All of which the above code safely handles.

Am I missing anything? It should be easy correct? at the end - its commonly used on whatsapp, emails and web-apps in general, so I guess people must’ve solved zip bombs and related zip attacks by writing code like above.

It’s tough for me to evaluate a description in the absence of seeing the actual code. The details really matter in a situation like this.

Note: I’m far from being an expert on these issues. We don’t do anything allowing for the upload of a zip file - all files are expected to be uploaded individually.

(post deleted by author)

First of all, thank you so much for guiding me to look for things that are typical pitfalls for a backend beginner like me and giving time to read my post.

I have some curious questions after reading to you responses, It would be great if you can reply again.

I enjoyed your last reply because you used the word “expert”, So naturally there is perspective gap between us - I say it because you have a lot of experience thus you saw this problem and concluded that you are not an expert to state “The solution” for my problem. On the other hand, I stated “The solution” by vaguely writing some steps and thought “that’s it” because from my perspective, I don’t know about any hidden variables.

My curious questions are:

  • why anyone needs to be an expert in “data compression science” to just handle zip file upload anyways? Isn’t it simple? - don’t allow bad filenames, calculate size before decompress, don’t use functions which has indefinite behaviour like glob() and extractall(), only work with top level files inside zip, never go downstairs.

  • If zip files indeed needs expertise then what are alternatives? Is it worth giving time to be an expert in zip - I am asking because I am pre-final year college student and soon seeking for job opportunities for backend developer job with python.

  • Do you have references to some books or articles series related to backend development in general so that I also get to know about “hidden/unknown variables” of system?

  • How do you get skill in this field? My current approach is “read docs, do some projects, stay in community to read new thing and stay attached with my work”.

First, to clarify, this all doesn’t really have anything to do with “data compression science”. It’s more an issue of the implementation of the tools and utilities that implement compression algorithms.

I don’t need to understand how the zip algorithm works - I just need to be aware that the zip file format has a number of unconstrained edge-cases creating the possibility of malicious files being created, and, that the libraries working with those files have flaws.

For example, the zip file format has been around since about 1990. One might have thought that it has been made pretty reliable in that time. However, the CVE I referenced above was published last year.

No, it really isn’t. Or, to be clear - it’s easy to provide a general description that appears to cover all the issues. But the details matter. It’s rarely the concept that is the problem, but rather, it’s the implementation that ends up being lacking.

Maintaining a focus on security is truly a full-time job. But, it doesn’t actually mean that there’s a need to implement every possible remediation to an identified vulnerability - all decisions regarding security controls should be made in the context of a risk assessment.

Using your zip file situation as an example, you should consider your user base.

If your system is designed as an internal system for corporate users, you can, in many cases, ignore things like this, because your corporate policies should make it extremely undesirable for people to try and abuse the system.

Or, if it is a public site, but is expected to be used by people who want to use it, they probably don’t want to cause harm either.

So you want to decide whether the convenience of allowing users to upload zip files is worth the risk of allowing them to do so - and what steps should you take to mitigate those risks.

(And it’s not just zip files either, You can find published exploits of things like PDFs, Excel spreadsheets, Word documents, etc.)

The general category is “Secure Software Development”. There’s a ton of material out there. (I don’t have any specific recommendations because my formal training in that area was more than 25 years ago.)

For web application security, and as a starting point of awareness in the topic, I can suggest https://owasp.org/ and their Top Ten project
(There’s a lot of overlap between this and “Information Security”, with this latter topic focusing more on the policy and procedural aspects - in particular, the evaluation of risks to determine how you’re going to respond to those risks.)

In the general case, you should try to learn about the basic types of exploits. You should at least be comfortable with the concepts and principles of things like buffer overflow and SQL injection attacks (among others).