Impact analysis for bypassing get_valid_filename

We’re are an enterprise application and it is important for us to retain the filenames, as provided by the clients. However, Django’s FileField renames every file during save using django.utils.text.get_valid_filename method.

Here are some sample filenames which we want to upload

  • Autorización de referencias.pdf
  • Certificado%20de%20Estatuto%20Actualizado.pdf

Referring to the function’s documentation

Return the given string converted to a string that can be used for a clean filename. Remove leading and trailing spaces; convert other spaces to underscores; and remove anything that is not an alphanumeric, dash, underscore, or dot.

Given that our cloud storage supports all file names, what is the impact of bypassing get_valid_filename on Django’s internal working? Do you see any major risks?

I’m confident that there must be a concrete reason to introduce filename standardization and don’t want to mess around with it before understanding the impact of this change.

Probably the biggest risk that comes to mind are race conditions between two different people uploading files by the same name within a short time window, combined with Django being unable to rely upon every file storage system to handle that situation in a sane manner. (Or, the same person uploading the same file at the same time in two different browser tabs.)

(There may be other risks as well.)

Does it really? Does it handle file names with linefeed characters? Nulls? Multibyte unicode? Unlimited file name length?

Keep in mind that the name of a file is not an intrinsic property of the file, and that the data being submitted is subject to being altered by the client before the upload occurs.

Also, one of your sample file names show it as being URL encoded. How are you going to know if the file name was submitted with the embedded spaces or if the original file name was stored with a URL encoded name? Guess wrong, and you’re returning the file name as different from how it was submitted.