Unicode has many special code points that allow to create strings that look identical, but have different encoding. Example include:
bidi overwrites (e.g. '\u202enimda' looks the same as 'admin')
zero-width characters (e.g. 'ad\u200bmin' looks the same as 'admin')
unicode normalization forms (e.g. '\u00C7xx' looks the same as '\u0043\u0327xx')
or just ASCII whitespace that gets collapsed when displayed in HTML ('admin user ' looks the same as 'admin user')
This is an issue for unique CharFields, because two different unique values can look exactly the same.
This is not an issue for unique SlugFields, because they restrict the character set in such a way that these attacks are not possible.
However, SlugFields might not be appropriate for every use case. So I wonder whether it would be possible to add a separate field that restricts the available characters (and maybe also does some normalization) but is more permissive than SlugField.
I am not sure about the exact restrictions that should apply. RFC 8264 could serve as a reference (or at least inspiration).
This is an interesting idea! I think itās actually a great candidate for a third-party package. This will give it plenty of space to iterate and settle on a great API without being held to Djangoās very slow pace and high backwards-compatibility requirements.
Hi @xim I understand your concern, however keep in mind that the validation which is performed at python level in django is not the same which is performed at database level. Also it is possible to put any character in a SlugField because validation is only performed by forms in django (ie mymodel.sluf = ā\u202enimdaā; mymodel.save()).
If you want to prevent this and you are sure that every input is handled by a form you can simply add a RegexValidator(or an other kind of validator) on your CharField. You may also try adding a CheckConstraint which ensures at database level that the field does not contain any unwanted character.
You have a messaging system where display names are unique. Alice gets a message from their friend Bob, asking for some personal information. Only it turns out the message did not actually come from Bob, but from someone whoās display name looks identical to Bobās.
I understand what homograph attacks are. What I am asking about is the specific attack scenario, particularly one that would realistically apply to Django and justify changes or additions in the core. The example given does not seem problematic since usernames are already restricted to alphanumeric characters. Whether this is worth fixing depends on the likelihood of a practical attack vector, and at the moment I do not see anything that could be easily abused.
OPs suggestion sounds like something that could just be added via a custom validator. In code or as a package. But not like something for core.
I second what @Lily-Foote said - this is great idea for a 3rd party package, but not for a general purpose webframework with high backward compatibility constraints.
Reasons:
Unicode is very convoluted and ever changing with new codepoints and clustering rules, no chance to get a stable subset covering most of those vectors (high maintenance burden as you gonna fight constantly against hackersā creativity)
needs lots of subrules to partially mitigate different attacks (e.g. excluding certain codepoint ranges with regexp, or explicit collisions checks against a dataset)
usage is always highly application domain dependent, e.g. you cannot simply exclude cyrillic codepoints, if the app is supposed to drive a Russian webpage, so a library solution must be highly configurable on what subrule to apply
exclusion / sanity checks could be offered at different stages:
widget input in the browser
form validation
model validation
on database level (e.g. CHECK contraint on postgres)
Depending on how much you wanna cover of that, a library solution most likely will end up as its own toolbox / framework.
Yes, the complexity of unicode is a problem. I still hope that there could be some useful default behavior, e.g. excluding control characters and doing NFCK normalization or something. But it is probably better to experiment in a library first.
Another idea could be to add a check for CharField(unique=True). But I am not sure about the signal-to-noise ratio there, either.
I hope to have more time for this soon and create a little library. If and when that happens, I will post a link. Thanks for all the ideas and feedback!