New model field to prevent unicode attacks on unique CharFields?

xim · August 26, 2025, 10:06am

Unicode has many special code points that allow to create strings that look identical, but have different encoding. Example include:

bidi overwrites (e.g. '\u202enimda' looks the same as 'admin')
zero-width characters (e.g. 'ad\u200bmin' looks the same as 'admin')
unicode normalization forms (e.g. '\u00C7xx' looks the same as '\u0043\u0327xx')
or just ASCII whitespace that gets collapsed when displayed in HTML ('admin user ' looks the same as 'admin user')

This is an issue for unique CharFields, because two different unique values can look exactly the same.

This is not an issue for unique SlugFields, because they restrict the character set in such a way that these attacks are not possible.

However, SlugFields might not be appropriate for every use case. So I wonder whether it would be possible to add a separate field that restricts the available characters (and maybe also does some normalization) but is more permissive than SlugField.

I am not sure about the exact restrictions that should apply. RFC 8264 could serve as a reference (or at least inspiration).

What are your thoughts on this?

Lily-Foote · August 26, 2025, 11:06am

This is an interesting idea! I think it’s actually a great candidate for a third-party package. This will give it plenty of space to iterate and settle on a great API without being held to Django’s very slow pace and high backwards-compatibility requirements.

sevdog · August 27, 2025, 4:21pm

Hi @xim I understand your concern, however keep in mind that the validation which is performed at python level in django is not the same which is performed at database level. Also it is possible to put any character in a SlugField because validation is only performed by forms in django (ie mymodel.sluf = ‘\u202enimda’; mymodel.save()).

If you want to prevent this and you are sure that every input is handled by a form you can simply add a RegexValidator(or an other kind of validator) on your CharField. You may also try adding a CheckConstraint which ensures at database level that the field does not contain any unwanted character.

doom · August 28, 2025, 2:16pm

> This is not an issue for unique SlugFields, because they restrict the character set in such a way that these attacks are not possible.

How would such an attack look like. Can you describe a theoretical situation that shows the danger in the status quo?

xim · August 29, 2025, 5:55am

Just a random example off the top of my head:

You have a messaging system where display names are unique. Alice gets a message from their friend Bob, asking for some personal information. Only it turns out the message did not actually come from Bob, but from someone who’s display name looks identical to Bob’s.

sevdog · August 29, 2025, 10:18am

Those are ”homograph attack” (ie: IDN homograph attack - Wikipedia).

This class of attacks is not fully avoidable, because by blocking some characters you may occur in other troubles.

Like the “old and bad joke” to “replace a random semicolon with a greek question mark” (; U+037E GREEK QUESTION MARK - Unicode Explorer) to make a C developer go mad.

doom · August 30, 2025, 2:16pm

I understand what homograph attacks are. What I am asking about is the specific attack scenario, particularly one that would realistically apply to Django and justify changes or additions in the core. The example given does not seem problematic since usernames are already restricted to alphanumeric characters. Whether this is worth fixing depends on the likelihood of a practical attack vector, and at the moment I do not see anything that could be easily abused.

OPs suggestion sounds like something that could just be added via a custom validator. In code or as a package. But not like something for core.

jerch · August 30, 2025, 2:46pm

I second what @Lily-Foote said - this is great idea for a 3rd party package, but not for a general purpose webframework with high backward compatibility constraints.

Reasons:

Unicode is very convoluted and ever changing with new codepoints and clustering rules, no chance to get a stable subset covering most of those vectors (high maintenance burden as you gonna fight constantly against hackers’ creativity)
needs lots of subrules to partially mitigate different attacks (e.g. excluding certain codepoint ranges with regexp, or explicit collisions checks against a dataset)
usage is always highly application domain dependent, e.g. you cannot simply exclude cyrillic codepoints, if the app is supposed to drive a Russian webpage, so a library solution must be highly configurable on what subrule to apply
exclusion / sanity checks could be offered at different stages:
- widget input in the browser
- form validation
- model validation
- on database level (e.g. CHECK contraint on postgres)

Depending on how much you wanna cover of that, a library solution most likely will end up as its own toolbox / framework.

xim · August 31, 2025, 9:45am

The example given does not seem problematic since usernames are already restricted to alphanumeric characters.

That’s why I said display names (first_name and last_name), not usernames.

xim · August 31, 2025, 9:49am

Yes, the complexity of unicode is a problem. I still hope that there could be some useful default behavior, e.g. excluding control characters and doing NFCK normalization or something. But it is probably better to experiment in a library first.

Another idea could be to add a check for CharField(unique=True). But I am not sure about the signal-to-noise ratio there, either.

I hope to have more time for this soon and create a little library. If and when that happens, I will post a link. Thanks for all the ideas and feedback!

doom · August 31, 2025, 5:39pm

That’s why I said display names (first_name and last_name), not usernames.

Your argument makes even less sense with those, because names are not unique. I don’t need Unicode magic to call myself “Jon”.

Topic		Replies	Views
Check database records before saving to the database Using Django	4	2677	June 2, 2020
Django Slugify Error Forms & APIs	5	1606	December 3, 2022
Slug vs UUID Using Django	5	6519	January 12, 2021
Do not save when either of the 2 fields are not unique Using Django	6	849	August 28, 2021
How to ensure that UUID slug will be always unique? Using the ORM	2	1501	February 8, 2022

New model field to prevent unicode attacks on unique CharFields?

Related topics