Add support for internationalized email addresses #10522

aphillips · 2024-07-25T22:42:22Z

Replaces #5799.

Changes the ABNF and surrounding text to include support for non-ASCII characters on both the left and right side of the email address. Discussion in #4562 includes the genesis of these changes.

At least two implementers are interested (and none opposed):
- Firefox (edit by @annevk)
- …
Tests are written and can be reviewed and commented upon at:
- …
Implementation bugs are filed:
- Chrome: …
- Firefox: …
- Safari: …

(See WHATWG Working Mode: Changes for more details.)

/input.html ( diff )
/references.html ( diff )

- update branch - change 'valid host string' to 'valid domain string'

annevk · 2024-08-15T11:33:44Z

I would like to second the comment @hsivonen made in #5799 (comment). In particular I think that we need to replace the ABNF with two algorithms:

Is this input valid? This takes the email address, splits it on @ and attempts to validate both sides. This cannot rely on "valid domain string" as that is not a primitive browsers have. It will need to use the host parser.
Convert this input for submission. This needs to do the normalization @hsivonen discusses on the host side.

I suspect what we want is that 1 is a parsing job resulting in failure or a data structure and 2 is a serialization job of that data structure. That seems ideal.

WebKit can be considered supportive of fixing this and I'd also be willing to help nail something down along these lines, but the discussion in #4562 has made me rather unmotivated. 🙁

hsivonen · 2024-08-15T12:44:50Z

@annevk , We don't necessarily need two algorithms but one algorithm that converts input to submittable form but can fail. Also, we need clarity on what the value exposes to JS (when the conversion succeeds and when it fails) and what form pattern matches on.

@aphillips , What's the purpose of hosting "." out of utext when localpart is defined as in the PR? AFAICT, the ABNF would mean the same thing if "." was in utext. Am I reading the ABNF wrong or is it intentional that the ABNF doesn't end up ruling out leading, trailing, or consecutive dots? (In this comment, I'm not taking a position on whether it should.)

hsivonen · 2024-08-15T13:20:29Z

The algorithm could look like this:

Find the last instance of character U+0040 @ in input. If there isn’t one, return failure.

Let the part of input before the last U+0040 @ character be local and let the part after the last U+0040 @ character be domain.

If local is the empty string, return failure.

If local contains any of
Unpaired surrogates
U+0020 or below
U+007F
U+0022 ”
U+0028 (
U+0029 )
U+002C ,
U+003A :
U+003B ;
U+003C <
U+003E >
U+0040 @
U+005B [
U+005C
U+005D ]
return failure.

Let normalizedDomain be the result of https://url.spec.whatwg.org/#concept-domain-to-ascii with domain and true.

If normalizedDomain is failure, return failure.

Let normalizedLocal be local normalized to Unicode Normalization Form C.

Return the concatenation of normalizedLocal, U+0040 @, and normalizedDomain.

hsivonen · 2024-08-15T13:29:28Z

I think pattern should match on the submittable string.

hsivonen · 2024-08-16T08:51:54Z

Per discussion with @zcorpan , the algorithm should probably return failure if the label in the TLD position consists entirely of ASCII digits.

annevk · 2024-08-16T11:23:03Z

Oh, I missed that you don't run the host parser on the final argument. Why is that?

hsivonen · 2024-08-19T13:48:16Z

Oh, I missed that you don't run the host parser on the final argument. Why is that?

Two reasons:

Host parsing runs percent decoding, which seems inappropriate for the email address validation use case.
Host parsing and RFC 5321 disagree on how IPv4 addresses are represented, and it would be weird to support IPv6 but not IPv4 after the @ sign. Since there aren't strong use cases for either flavor of IP addresses after the @ sign in the HTML form context, it seems simpler to support neither.

Rejecting domains whose TLD consists entirely of ASCII digits ends up rejecting things that look like URL-style IPv4 addresses that are wrong in email addresses.

hsivonen · 2024-08-20T08:32:17Z

Demo of the above-described algorithm for experimentation

Code

The "Presentable" display isn't (at least at this time) a suggestion for browsers to change the user-entered visible contents of the field to the "Presentable" form, but it's what a server could derive from the submittable value by running ToUnicode (or mere labelwise Punycode decode) on the domain.

The point is to offer an opportunity to experiment with different input to see if the rejections seem appropriate.

Key differences compared to the current state of the PR:

Algorithmic rather than BNF.
The local part is normalized to NFC.
The submittable representation of the domain part is ASCII.
If the TLD is all ASCII digits, reject.

Notable non-differences:

What's valid in the local part is intended to match the current state of the PR.
What's valid in the domain part is intended to match the current state of the PR with the exception of the digit TLD check.

hsivonen · 2024-08-20T08:33:57Z

Oh, and let's surface the only comment I put in the code:

Do we want to support TLDs with MX records or require at least one more level in the domain name?

annevk · 2024-08-20T08:43:58Z

I looked a bit more at not using the host parser:

We end up passing beStrict is true (aside: we'd have to make this a named argument if we're going to invoke this from outside of URL in this way). This ends up preventing domains such as test@test_test.example.com from sending email. This does not seem great to me. It would also be the first instance where we have a normative dependency on beStrct for user agents, I think.
We end up missing the forbidden code point check. That's okay as it's redundant with beStrict, but see above.
We end up not percent-decoding. Seems okay.
I think the TLD check you mention should be https://url.spec.whatwg.org/#ends-in-a-number-checker for parity.

I also see the host parser has an assert on non-empty input. Not sure if ToASCII relies on that, but we probably want to double check input such as test@ or @ is handled properly.

hsivonen · 2024-08-20T08:56:15Z

I looked a bit more at not using the host parser:

We end up passing beStrict is true (aside: we'd have to make this a named argument if we're going to invoke this from outside of URL in this way). This ends up preventing domains such as test@test_test.example.com from sending email. This does not seem great to me.

Not catching various non-DNS characters as typos when the domain is supposed to be for DNS naming isn't great, either. In the Gecko case, it would be easy to apply an ASCII deny list that's the STD3 ASCII deny list with the exception of allowing the underscore if underscore in domains is a thing that's relevant to input type=email. (From DNS usage that I've seen so far, the underscore seems to occur in various DNS-based protocols that look up stuff stored in TXT records, not MX records.)

It would also be the first instance where we have a normative dependency on beStrct for user agents, I think.

Part of what I'm trying to demo is showing how little code this demo needs when using the libraries that Gecko uses. That is, in terms of Gecko-developer-facing code, beStrict=true is not onerous. (Not sure what the binary size impact is if new combinations of code get inlined, but the code is supposed to be designed to be defensive against excessive repetition caused by inlining.)

I also see the host parser has an assert on non-empty input.

This is implied by beStrict=true by the way of VerifyDnsLength.

Not sure if ToASCII relies on that, but we probably want to double check input such as test@ or @ is handled properly.

As can be seen from my demo, these get rejected.

hsivonen · 2024-08-20T15:12:39Z

AFAICT, an underscore in the domain part currently results in .validity reporting typeMismatch: true in Firefox, Safari, and Chrome. As far as I'm aware, this isn't a source of complaints, so it seems OK to keep using the STD3 ASCII deny list.

aphillips · 2024-08-22T23:23:36Z

@hsivonen asked:

@aphillips , What's the purpose of hosting "." out of utext when localpart is defined as in the PR? AFAICT, the ABNF would mean the same thing if "." was in utext. Am I reading the ABNF wrong or is it intentional that the ABNF doesn't end up ruling out leading, trailing, or consecutive dots? (In this comment, I'm not taking a position on whether it should.)

I think it's a mistake, since backtracking through various specs leads to localpart using something like 1*utext *("." *utext). I'm liking what you propose elsewhere on the thread.

hsivonen · 2024-09-17T06:54:39Z

@hsivonen asked:

@aphillips , What's the purpose of hosting "." out of utext when localpart is defined as in the PR? AFAICT, the ABNF would mean the same thing if "." was in utext. Am I reading the ABNF wrong or is it intentional that the ABNF doesn't end up ruling out leading, trailing, or consecutive dots? (In this comment, I'm not taking a position on whether it should.)

I think it's a mistake, since backtracking through various specs leads to localpart using something like 1*utext *("." *utext).

I now see that the BNF oddity comes from the pre-existing BNF in the spec. It seems prudent not to reject ASCII local parts that the ASCII-only formulation currently in the spec accepts, and the BNF in this PR and the algorithmic formulation in my previous comment here accomplish that.

I'm liking what you propose elsewhere on the thread.

Great!

@annevk said:

I think the TLD check you mention should be https://url.spec.whatwg.org/#ends-in-a-number-checker for parity.

I'm OK with this.

aphillips added 4 commits July 25, 2024 15:17

Apply changes requested.

d557d04

- update branch - change 'valid host string' to 'valid domain string'

Add missing ref and missing href

0966db0

Remove extra whitespace

816ee0f

Change 2119-keyword in note

1abc9b3

This was referenced Jul 31, 2024

Upcoming WHATNOT meeting on 2024-07-25 #10496

Closed

Upcoming WHATNOT meeting on 2024-08-01 #10534

Closed

annevk added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Aug 15, 2024

annevk mentioned this pull request Aug 15, 2024

Fix #4562: add support for internationalized email addresses #5799

Closed

3 tasks

w3cbot mentioned this pull request Aug 15, 2024

Add support for internationalized email addresses w3c/i18n-activity#1898

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for internationalized email addresses #10522

Add support for internationalized email addresses #10522

aphillips commented Jul 25, 2024 •

edited by pr-preview bot

Loading

annevk commented Aug 15, 2024

hsivonen commented Aug 15, 2024 •

edited

Loading

hsivonen commented Aug 15, 2024

hsivonen commented Aug 15, 2024 •

edited

Loading

hsivonen commented Aug 16, 2024

annevk commented Aug 16, 2024

hsivonen commented Aug 19, 2024

hsivonen commented Aug 20, 2024

hsivonen commented Aug 20, 2024

annevk commented Aug 20, 2024

hsivonen commented Aug 20, 2024

hsivonen commented Aug 20, 2024

aphillips commented Aug 22, 2024

hsivonen commented Sep 17, 2024

Add support for internationalized email addresses #10522

Are you sure you want to change the base?

Add support for internationalized email addresses #10522

Conversation

aphillips commented Jul 25, 2024 • edited by pr-preview bot Loading

annevk commented Aug 15, 2024

hsivonen commented Aug 15, 2024 • edited Loading

hsivonen commented Aug 15, 2024

hsivonen commented Aug 15, 2024 • edited Loading

hsivonen commented Aug 16, 2024

annevk commented Aug 16, 2024

hsivonen commented Aug 19, 2024

hsivonen commented Aug 20, 2024

hsivonen commented Aug 20, 2024

annevk commented Aug 20, 2024

hsivonen commented Aug 20, 2024

hsivonen commented Aug 20, 2024

aphillips commented Aug 22, 2024

hsivonen commented Sep 17, 2024

aphillips commented Jul 25, 2024 •

edited by pr-preview bot

Loading

hsivonen commented Aug 15, 2024 •

edited

Loading

hsivonen commented Aug 15, 2024 •

edited

Loading