EmailValidator simplification and international email addresses

These two EmailValidator cleanup tickets have been open since 2016. The first is about simplifying the EmailValidator regular expressions, and the second would support non-ASCII email address internationalization (EAI, like Надія@example.ua). But the discussion on the two has become intertwined, and they’ve accumulated a series of stalled or abandoned PRs.

After discussion here, I’m hoping to either update (or replace) them with a consensus decision that would allow PRs to land, or to close them wontfix. (Or maybe one of each.)

[A “local-part” is the username part of an email address, before the @.]

Simplifying EmailValidator regular expression

As currently written, ticket-26423 would replace the EmailValidator regexps with the HTML5 valid email address check. But the HTML5 rule:

  • is ASCII-only, so would block internationalized domain names (IDNs) such as editor@עִתוֹן.example.il (which Django has long allowed). Handling IDNs as recommended in HTML5 would require a new library.[1]
  • deliberately disallows quoted-string local-parts, such as "Lipinski, Hubert"@ccmail.example.net.[2] EmailValidator currently allows these, so this change would require a deprecation strategy that isn’t described in the ticket.
  • doesn’t allow IP address domain literals such as admin@[10.10.0.1] (which were specifically added to Django in ticket-16166).
  • doesn’t help with EAI. (WHATWG has been discussing this.)

In the time since ticket-26423 was created, EmailValidator has been simplified somewhat by borrowing DomainNameValidator’s regexps after the @. (But it still has a gnarly regexp for the local-part, plus a simpler one for IP address domain literals.)

Some options:

  1. Use just the local-part half of the HTML5 rule, and keep the current EmailValidator logic for the domain.

    The HTML5 local-part check can be implemented efficiently with set operations, avoiding regular expressions. Supporting EAI would require a separate (but straightforward) extension, option A in the next section.

    Since this would reject some emails that EmailValidator currently allows, it would need a deprecation strategy.

  2. Just check for an @ with some characters on either side and get rid of the regexps.[3]

    Several people have suggested this, including in the original developers list discussion. (But mostly in comments and PR feedback on the other EAI ticket.)

    This simplified check supports EAI. It allows all emails that EmailValidator already accepts, so maybe wouldn’t require deprecation? But it also permits some invalid emails, so maybe would need deprecation?

  3. Check for an @, but only allow certain characters on either side.

    This is essentially what HTML5 does, but involves inventing our own rules for the allowable characters and/or patterns. (I anticipate a lot of debate on the details.)

    Depending on the specific decisions, it may or may not support EAI, might or might not require deprecation, and could involve new regular expressions or might be implementable without them.

  4. Stick with what we have and close ticket-26423 wontfix.

    The current regexps are ugly, but not as bad as they used to be, and are reasonably well tested by now.

Supporting EAI non-ASCII local-parts

ticket-27029 would update the EmailValidator to permit non-ASCII characters in the local-part. This is one component of supporting EAI in Django. (It’s worth noting that use of EAI addresses is still quite low. But an important barrier to adoption is the lack of EAI support in web technologies.)

We’d follow RFC 6532 section 3.2, and update EmailValidator to allow UTF8-non-ascii characters everywhere alphabetic characters are currently valid in the local-part.

  1. With Option 1 above (check HTML5 valid local-part), we'd include UTF8-non-ascii in the set of allowable local-part characters. (There are some efficient ways to do this using set arithmetic and other non-regexp str operations.)
  2. With Option 2 above (just check for @), EAI is already supported. We'd close ticket-27029 as duplicate.

  3. With Option 3 above (roll our own simplified rules), more discussion would be needed depending on the specific rules we come up with.

  4. With Option 4 above (keep existing EmailValidator regular expressions), we'd update EmailValidator.user_regex to include UTF8-non-ascii, in both the dot-atom and quoted-string sections. (There are some PRs that have come close.)

  5. Or with any option for ticket-26423, we could instead decide not to support non-ASCII local-parts (for now), and close ticket-27029 wontfix.

Other wrinkles

  • Deprecation: Any option that needs a deprecation strategy would probably involve temporarily adding a new setting to select new vs. deprecated behavior.

  • Make a new validator: Several comments suggest putting the new behavior in a new validator with a different name (SimpleEmailValidator, UnicodeEmailValidator), so users could opt in without breaking existing code. I think that defeats the purpose of ticket-26423. Given the current adoption of EAI, it might be reasonable for ticket-27029, but it feels confusing to me. (I suppose we could also make it an init option, like DomainNameValidator’s accept_idna.)

  • SMTP backend doesn’t handle EAI: Once ticket-27029 is implemented, you could end up with valid email addresses that Django can’t email. (That’s ticket-35714.) I think this is worth mentioning in the release notes, but shouldn’t block progress on EAI. (It is a valid email address.)

  • EAI security concerns: Allowing Unicode local-parts might introduce new security issues similar to the one fixed in Django 3.0.1. Unicode is complicated.

    In ticket-27079, Collin Anderson pointed out some recommendations from UTS39 that could be helpful. We might also want to apply their recommended normalizations in forms.fields.EmailField.


  1. A few paragraphs earlier, HTML5 recommends converting IDNs to ASCII before the valid email address check. WHATWG’s domain to ASCII algorithm specifies UTS46 encoding, which isn’t fully supported in any production Python library (though “somebody” seems to be working on it:grin:). In any case, Django tends to resist adding new dependencies. ↩︎

  2. The email spec RFC 5322 and predecessors have always allowed quoted-string local-parts. I think they’re mainly intended for gateways into non-Internet email systems with different username formats. I’ve never seen one in real-world use, and they’re a popular source of obscure bugs in various email implementations. (Also, don’t confuse this with a friendly/human-readable “display-name,” which is also often a quoted-string but is separate from the local-part.) ↩︎

  3. Specifically, the simplified EmailValidator would keep the length check, keep the rsplit() on @, but replace everything else with a check that both user_part.strip() and domain_part.strip() are non-empty. ↩︎

1 Like

It’s worth pointing out explicitly that the current Django behavior is an inconsistent mix of the HTML5 validation rule and the more lax EmailValidator, because a Django EmailField form field will apply the EmailValidator during server-side form validation, but will trigger a compliant user-agent to apply the HTML5 validation rule during client-side form validation (since EmailField defaults to EmailInput as its widget, which defaults its HTML input type to "email", which gets client-side HTML5 email validation).

It probably is not ideal that currently if you “behave” – by obeying client-side validation rules – there are email addresses that will be rejected, which would be accepted if you “misbehaved” and just sent the form payload regardless of what the client-side validation suggests.

Also, for the record I think Django should probably just go fully in on the HTML5 validation rule since that’s what compliant user-agents will be applying.

Just to be clear, literally following HTML5 requires pulling in a third-party UTS46 library. Or dropping support for IDNs. Both of those seem like non-starters.

Mostly following HTML5, but without new dependencies, would be option 1 or option 3 above. (And that’s why I’m bringing this up.)

As far as I can tell, the shortcoming of the idna package is just that it doesn’t support Unicode 16’s redefinition of UTS 46. But the latest released Python version – 3.13 – itself seems to still be on Unicode 15 (specifically 15.1.0). So does lack of Unicode 16 support in idna actually make a difference that would matter to us?

Also, if there’s a worry about adding a dependency, is there a way to get incrementally closer to where we’d want to be from just using Python’s own built-in punycode codec even though it’s not the latest-and-greatest version?

And separately from the IDNA issues, I do think Django should phase out support for some of the more unusual constructs you can use in email addresses. Losing support for quoted-string, IP addresses, etc. is a good thing in my opinion.

I don’t have many opinions about the right-side (domain side), but it would be nice to keep validating the domain name if possible without making things too complicated. And, yes, maybe excluding ip addresses by default would be nice?

For the local part, I’d personally suggest keeping the default validator to be ASCII-only (maybe removing quotes) and making Unicode local-part available as an opt-in, not on by default.

IETF’s Universal Acceptance Steering Group (UASG) says to do Option #2 for the local part: don’t validate at all. I personally don’t like it because of all of the wrinkles, but it is what it is. This would be easy to implement and I think Django should provide it as an opt-in option. If it’s not on by default then I don’t see it as a big problem. If/when browsers or HTML allow unicode local-part then I think it makes sense to change the default to match that.

I’m guessing Django probably shouldn’t be in the business of coming up with character restrictions (Option #3). We can still have character restrictions by default in the ASCII validator. I really wish there was a sane email standard that had restricted characters, NFC or NFKC normalization, case-insensitivity like domains names, but I don’t think that standard exists. UTS39 is the closet thing I’ve seen, and a UTS39EmailValidator option would be kinda nice, but I don’t think it would be easy to implement. — Edit: or maybe we just implement the easy parts of UTS39? Filtering Identifier_Status=Allowed is pretty easy because it’s included in unicodedata, and can even be done using regex. Mixed scripts/numbers data is not included in unicodedata.

Living in a country where accents are part of everyday life, despite the fact I agree that using UTF-8 in domains or email addresses is one of the worst ideas ever, the reality of things makes me say I am a big no for anything that would be ASCII-only by default.

I think that’s been a concern in past tickets that wanted to add the idna package.

Even if a new dependency were OK, the idna package only implements the “preprocessing” part of UTS46. WHATWG has a specific example that makes it clear preprocessing alone isn’t good enough. (I’m working on a complete Python uts46 implementation, but it’s very new.)

Not really. Python’s built-in idna codec implements IDNA 2003, which rejects some newer valid domains. (Real-world example: މިހާރު.com.)

(The punycode codec will encode almost anything to ASCII—it’s kind of like base64. IDNA 2003, IDNA 2008 and UTS46 are all IDN encodings that use Punycode, an xn-- prefix, and a bunch of additional validation. Although HTML5’s type=email language is a bit imprecise, it’s calling for some sort of “IDN” encoding, and almost certainly means the “domain to ASCII” algorithm from WHATWG’s URL Standard—a.k.a. UTS46 non-transitional.)

Substituting algorithms is what I’d call “mostly following HTML5.” If we go that route, we’d be losing at least some of the benefits of offloading these decisions to WHATWG. Which puts it under “option 3.”

This has come up in some of the past discussions, too. It’s a pretty compelling argument, that there’s an “impedance mismatch” of sorts between Django’s rendered <input type=email> and the server-side EmailValidator. And that mismatch will cause confusion (and bug reports).

The more I’ve looked into it, there are a bunch of problems that convince me we shouldn’t be trying to replicate HTML5’s valid email address in Django:

  • Practical difficulties of getting an accurate Python implementation without adding dependencies.

  • EmailValidator is not just for HTML form inputs. It’s also the default validator for the model EmailField, and users and third party libraries could be using it in all kinds of other contexts where HTML5’s concept of valid email address isn’t applicable.[1]

  • The three major browser engines don’t actually agree on the HTML5 rules. (And it’s been that way since at least 2016.) When an email input contains an IDN:

    • Chrome posts UTS46 encoded ASCII[2]
    • Firefox validates against UTS46, but posts the original Unicode IDN
    • Safari rejects IDNs in email inputs altogether, unless you use novalidate

    [Given these inconsistencies, I’d argue Django should switch the EmailInput widget from <input type=email> to <input type=text inputmode=email>. But that’s a whole different discussion.][3]

Considering all that, I’m turning pretty negative on any variation of the HTML5 rules in the EmailValidator.


  1. E.g., I have a Django app that imports received email messages and parses the From address into a DB EmailField. The HTML5 rules could break this for some addresses that are provably “valid” in that email was actually sent and received from them (with a valid signature). ↩︎

  2. It gets worse: Chrome actually uses obsolete UTS46 transitional processing for email inputs, which WHATWG disallowed in 2017, and which is different from how Chrome handles IDNs in URLs. And with novalidate, Chrome might post the raw Unicode value or might encode it with UTS46 anyway, depending on the particular IDN. ↩︎

  3. Also, there’s probably a bug that if you have an IDN email, Django’s password reset form will only work properly in the same browser you used for registration. ↩︎

The more I look into this, it really makes the most sense.

The reality of email is that interpretation of the local-part is entirely up to the receiving mailer. Only example.com knows whether these are the same, different, or invalid email recipients:

medmunds@example.com
MEdmunds@example.com
M.Edmunds@example.com
medmunds+django@example.com

(If you substitute gmail.com, they’ll all go to me. If you substitute outlook.com, some will and some won’t. There’s no standard for email aliases.)

Even without Unicode, there’s no way to know which local-parts will be “valid” at the receiving end. The best you can do is determine whether it’s something that can be validly sent over SMTP.

And the SMTP rules can’t be 100% represented in a regular expression. Django’s current EmailValidator comes pretty close. It’s good enough. But “non-empty” is probably almost as good, and is a lot simpler to maintain. (Either way, you’ll find out for sure when you try to send it.)

Unicode doesn’t change any of that. Yes, there are confusables (though we have those in ASCII too: google vs googIe). And there are multiple ways to spell the same characters. And abusable script combinations. That kind of thing matters a lot at the receiving end. If you’re issuing email addresses to users, then UTS39 is helpful and important.

But Django isn’t issuing addresses. It’s not responsible for deciding whether Gmail should let someone create an account named 𝑚𝑒𝑑𝑚𝑢𝑛𝑑𝑠 (spelled with mathematical symbols). If Gmail ever does do that (and I sure hope they don’t), then 𝑚𝑒𝑑𝑚𝑢𝑛𝑑𝑠@gmail.com would, in fact, be a valid email address, and Django’s EmailValidator should allow it.

And since Django can’t know what rules Gmail or Outlook or example.com or anyone else enforces around local-parts, trying to validate the local-part is a losing game.

2 Likes