Issue when saving files with long names containing dot in Django 5.1

Yesterday, I opened a ticket around an issue I encountered as we upgraded to Django 5.1, where we started to see SuspiciousFileOperation when the user uploaded a file with a name that contained a dot and which was too long for the corresponding FileField. We increased our field max_length as a quick fix, but we’re got a lot of FileFields in our project, and every week there seems to be a new instance of this problem.

When the file name is too long, Django tries to truncate the file name as follows:

  1. Split the extension and the extensionless name (file root)
  2. Truncating the extensionless name
  3. Appending the extension back

That’s been part of Django for a long time and worked for single extensions.

In 5.1, a long standing bug was fixed to support multiple extensions (e.g. .tar.gz). However, this had the side effect of treating anything after the first dot to be part of the extensions splitting, so john.doe - driving license.jpg would get the extensions .doe - driving license.jpg and file root would be john.

The problem is that later on, in the Django method to find a name, we truncate the file root based on the total name length:

# Truncate file_root if max_length exceeded.
truncation = len(name) - max_length
if truncation > 0:
    file_root = file_root[:-truncation]

In the earlier example it’s too much truncation, with a max length of e.g. 20, it gives a truncation of 9 (31 - 20), so when we do file_root[:-9], we get nothing left, which raises the seen exception.

The ticket was closed as won’t fix, with a comment that the suggested fix is too fragile and doesn’t belong in Django, which I 100% agree on.

However, I would like to discuss the possibilities to explore another way to fix this problem, as this has been working for a very long time, and it’s not uncommon to have a dot in the file name seems like a common use case.

First comment from #23759

I think a period is a valid character in a filename, so we cannot simply assume everything to the right of the first period is the file extension

That still seems right.

However, this:

>>> from pathlib import Path
>>> p = Path("a-file-name.with.dots.txt")
>>> p.suffixes
['.with', '.dots', '.txt']

From #35818:

Implementing any other algorithm would necessitate a comprehensive list of file extensions, which may not be practical (or possible!).

I think we probably could though manage a small list of most common file extensions (probably like 20 would be enough), making that a class attribute for easy subclassing, and then falling back to Path.suffixes if we don’t have a match.

:thinking:

Another idea: only allow a small number of characters for suffix parts, maybe just 3 for “tar”. Longer bits are assumed part of the name:

In [10]: def split_stem_suffix(name):
    ...:     bits = name.split('.')[::-1]
    ...:     bit_iter = iter(bits)
    ...:     suffixes = []
    ...:     stems = []
    ...:     for bit in bit_iter:
    ...:         if len(bit) <= 3:
    ...:             suffixes.append(bit)
    ...:         else:
    ...:             stems.append(bit)
    ...:             break
    ...:     stems.extend(bit_iter)
    ...:     suffix = ".".join(suffixes[::-1])
    ...:     stem = ".".join(stems[::-1])
    ...:     return stem, suffix
    ...:

In [11]: split_stem_suffix("John.Doe compressed.tar.gz")
Out[11]: ('John.Doe compressed', 'tar.gz')

That is surprisingly close from my initial implementation, I went with 5 (including the dot) in mine :smiley:

I wish Python exposed a higher level API to get file extensions, but I guess it’s much more complicated problem than it looks like…

1 Like

@adamchainz @browniebroke This is bound to go wrong in some case right?

It makes me think that the first-pass here should be a hook on Storage that folks can use to provide their own strategy. WDYT?

1 Like

Yes, looking at the list of extensions on Wikipedia, it feels like there is no “right” answer, some extensions can be quite long it turns out (.torrent, .numbers)…

Yes, I think it would be nice to make it easier to customise the strategy in user-land, right now the method users need to override is quite long and complex. In the context of an application, the variety of file extensions is much smaller than in the wider context of the framework, and users can afford to have a more specialised strategy.

Yes, let’s add a hook. But I do think we can provide more reasonable behaviour.

Right. We should probably adjust it so that the first suffix can be any length, then these long suffixes will be supported for most cases:

In [2]: def split_stem_suffix(name):
   ...:     bits = name.split('.')[::-1]
   ...:     if len(bits) == 1:
   ...:         return name, ""
   ...:     bit_iter = iter(bits)
   ...:     suffixes = [next(bit_iter)]
   ...:     stems = []
   ...:     for bit in bit_iter:
   ...:         if len(bit) <= 3:
   ...:             suffixes.append(bit)
   ...:         else:
   ...:             stems.append(bit)
   ...:             break
   ...:     stems.extend(bit_iter)
   ...:     suffix = ".".join(suffixes[::-1])
   ...:     stem = ".".join(stems[::-1])
   ...:     return stem, suffix
   ...:

In [3]: split_stem_suffix("ubuntu.torrent")
Out[3]: ('ubuntu', 'torrent')

In [4]: split_stem_suffix("John.Doe compressed.tar.gz")
Out[4]: ('John.Doe compressed', 'tar.gz')
1 Like

With my Django Fellow hat on, I would say:

  1. Any custom “path splitting” logic added to this class could potentially introduce a future DoS risk in Storage.get_available_name, so I would advise against adding split_stem_suffix to the base Storage class.
  2. The base Storage class should ideally be filesystem-ignorant, and any further customizations should be evaluated for FilesystemStorage. (I understand this isn’t strictly the case at present in the Django code base, but we definitely shouldn’t make it worse.)
  3. Previous discussions have highlighted the need for redesigning the Storage API to enhance its extensibility, and we should consider this when making decisions on the topic. Related ticket is #35607. I have some concrete thoughts on this and maybe we can schedule a chat or open office hours or similar to go over this?

(We could also incorporate into the chat Document StorageHandler as public API. which is in my inbox, waiting, with a big red PENDING label.)

I’ll try to chime in on your points:

  1. Good shout to consider the security aspect of the change. Just so I understand, are you worried that it might be too easy for users to shoot themselves in the foot with such a hook or did I miss your point? Unless I misunderstood, I don’t think the plan is to change the current Django logic, just to extract it.

    The current way I have is to copy a ~40 lines long method into my subclass to change a couple of lines. It’s arguably worse in terms of security: if tomorrow Django fixes a security vulnerability in another lines of that method, I might miss it and remain vulnerable.

  2. I don’t think I mention that before, but since FilesystemStorage is coming up, I’ve got this bug on production using S3Storage from django-storages, which I think inherit the base Storage class.

  3. This ticket #35607 seems like a wider initiative, and a step in the right direction. Is there a pull request or branch somewhere?

Generally speaking, I don’t know much about other Django shortcoming here, or what an ideal world would look like.

Coming back to my original issue, I’m not sure how to move this forward. Is there a slim chance that a hook might be added? Does this need to be reviewed by more (I already got the attention of 3 of you) people from the security team?

@nessita can you outline how you think/see/imagine the wider storage API affecting this? That seems a much wider issue than the current one, which is about the strategy used to generate file names. Granted the storage API is a bit tangled, as old things often are; I don’t have a clear picture in mind of how we might progress that.

It feels like here specifically we’ve introduced a breaking change, affecting long standing behaviour with no great workaround. Dots in file names are pretty common. We either need to revert that or (more likely) give folks an easy escape hatch. That should hit 5.2 (assuming we don’t take this as a regression.):thinking:

It would be great if you could outline your wider perspective. Thanks. :pray:

FWIW, it looks like a clear regression to me, so I would be +1 to find a fix ASAP for 5.1, even if we redesign the Storage API a bit later (targeting 5.2).

@carltongibson I’ll invest some time this week to think how I would envision a feasible API rework to improve the current API status.

@browniebroke I’ll work on my current WIP PRs about Storage API and get back to you with concrete answers.

@claudep I’m on the fence on treating this as a release blocker since a “simple” work around is to increase the max_length of the FieldField (which by default is 100, one could argue this is a too low for real life file names). The current issue is more noticeable/annoying when there are dots “early” in the file name and everything after it is as long or longer than the field’s max_length, so increasing this value would suffice.

Update: I started updating my old branches with possible API improvements but I haven’t been able to come up with a clear proposal. I will keep working on this, but my initial estimate was definitely too optimistic.

Thanks for the update @nessita.

In the short run, do you think we can proceed with the smaller hook suggestion?

@claudep thinks it should be treated as a regression. I’m not far off thinking that. (I’m not sure what the others would say.)

I’m unclear if @adamchainz has a definite and we should add this logic proposal (and thoughts on whether that should be thinking about 5.1 or 5.2)?

I do think we need to do something here before 5.2 as we will then hit the LTS updaters.

For 5.1, at the least we should add a mention of the breaking change in the release notes (which makes me thing that adding at least the hook, even if without “better logic” is needed).

:thinking: