These two EmailValidator cleanup tickets have been open since 2016. The first is about simplifying the EmailValidator regular expressions, and the second would support non-ASCII email address internationalization (EAI, like Надія@example.ua
). But the discussion on the two has become intertwined, and they’ve accumulated a series of stalled or abandoned PRs.
- #26423 (Make EmailValidator use HTML5 validation rather than more complicated regular expressions) – Django
- #27029 (Make EmailValidator accept non-ASCII characters in local part) – Django
After discussion here, I’m hoping to either update (or replace) them with a consensus decision that would allow PRs to land, or to close them wontfix. (Or maybe one of each.)
[A “local-part” is the username part of an email address, before the @
.]
Simplifying EmailValidator regular expression
As currently written, ticket-26423 would replace the EmailValidator regexps with the HTML5 valid email address check. But the HTML5 rule:
- is ASCII-only, so would block internationalized domain names (IDNs) such as
editor@עִתוֹן.example.il
(which Django has long allowed). Handling IDNs as recommended in HTML5 would require a new library.[1] - deliberately disallows quoted-string local-parts, such as
"Lipinski, Hubert"@ccmail.example.net
.[2] EmailValidator currently allows these, so this change would require a deprecation strategy that isn’t described in the ticket. - doesn’t allow IP address domain literals such as
admin@[10.10.0.1]
(which were specifically added to Django in ticket-16166). - doesn’t help with EAI. (WHATWG has been discussing this.)
In the time since ticket-26423 was created, EmailValidator has been simplified somewhat by borrowing DomainNameValidator’s regexps after the @
. (But it still has a gnarly regexp for the local-part, plus a simpler one for IP address domain literals.)
Some options:
-
Use just the local-part half of the HTML5 rule, and keep the current EmailValidator logic for the domain.
The HTML5 local-part check can be implemented efficiently with set operations, avoiding regular expressions. Supporting EAI would require a separate (but straightforward) extension, option A in the next section.
Since this would reject some emails that EmailValidator currently allows, it would need a deprecation strategy.
-
Just check for an
@
with some characters on either side and get rid of the regexps.[3]Several people have suggested this, including in the original developers list discussion. (But mostly in comments and PR feedback on the other EAI ticket.)
This simplified check supports EAI. It allows all emails that EmailValidator already accepts, so maybe wouldn’t require deprecation? But it also permits some invalid emails, so maybe would need deprecation?
-
Check for an
@
, but only allow certain characters on either side.This is essentially what HTML5 does, but involves inventing our own rules for the allowable characters and/or patterns. (I anticipate a lot of debate on the details.)
Depending on the specific decisions, it may or may not support EAI, might or might not require deprecation, and could involve new regular expressions or might be implementable without them.
-
Stick with what we have and close ticket-26423 wontfix.
The current regexps are ugly, but not as bad as they used to be, and are reasonably well tested by now.
Supporting EAI non-ASCII local-parts
ticket-27029 would update the EmailValidator to permit non-ASCII characters in the local-part. This is one component of supporting EAI in Django. (It’s worth noting that use of EAI addresses is still quite low. But an important barrier to adoption is the lack of EAI support in web technologies.)
We’d follow RFC 6532 section 3.2, and update EmailValidator to allow UTF8-non-ascii characters everywhere alphabetic characters are currently valid in the local-part.
- With Option 1 above (check HTML5 valid local-part), we'd include UTF8-non-ascii in the set of allowable local-part characters. (There are some efficient ways to do this using set arithmetic and other non-regexp str operations.)
With Option 2 above (just check for
@
), EAI is already supported. We'd close ticket-27029 as duplicate.With Option 3 above (roll our own simplified rules), more discussion would be needed depending on the specific rules we come up with.
With Option 4 above (keep existing EmailValidator regular expressions), we'd update
EmailValidator.user_regex
to include UTF8-non-ascii, in both the dot-atom and quoted-string sections. (There are some PRs that have come close.)Or with any option for ticket-26423, we could instead decide not to support non-ASCII local-parts (for now), and close ticket-27029 wontfix.
Other wrinkles
-
Deprecation: Any option that needs a deprecation strategy would probably involve temporarily adding a new setting to select new vs. deprecated behavior.
-
Make a new validator: Several comments suggest putting the new behavior in a new validator with a different name (SimpleEmailValidator, UnicodeEmailValidator), so users could opt in without breaking existing code. I think that defeats the purpose of ticket-26423. Given the current adoption of EAI, it might be reasonable for ticket-27029, but it feels confusing to me. (I suppose we could also make it an init option, like DomainNameValidator’s
accept_idna
.) -
SMTP backend doesn’t handle EAI: Once ticket-27029 is implemented, you could end up with valid email addresses that Django can’t email. (That’s ticket-35714.) I think this is worth mentioning in the release notes, but shouldn’t block progress on EAI. (It is a valid email address.)
-
EAI security concerns: Allowing Unicode local-parts might introduce new security issues similar to the one fixed in Django 3.0.1. Unicode is complicated.
In ticket-27079, Collin Anderson pointed out some recommendations from UTS39 that could be helpful. We might also want to apply their recommended normalizations in forms.fields.EmailField.
A few paragraphs earlier, HTML5 recommends converting IDNs to ASCII before the valid email address check. WHATWG’s domain to ASCII algorithm specifies UTS46 encoding, which isn’t fully supported in any production Python library (though “somebody” seems to be working on it
). In any case, Django tends to resist adding new dependencies. ↩︎
The email spec RFC 5322 and predecessors have always allowed quoted-string local-parts. I think they’re mainly intended for gateways into non-Internet email systems with different username formats. I’ve never seen one in real-world use, and they’re a popular source of obscure bugs in various email implementations. (Also, don’t confuse this with a friendly/human-readable “display-name,” which is also often a quoted-string but is separate from the local-part.) ↩︎
Specifically, the simplified EmailValidator would keep the length check, keep the rsplit() on
@
, but replace everything else with a check that bothuser_part.strip()
anddomain_part.strip()
are non-empty. ↩︎