-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for internationalized email addresses #10522
base: main
Are you sure you want to change the base?
Conversation
- update branch - change 'valid host string' to 'valid domain string'
I would like to second the comment @hsivonen made in #5799 (comment). In particular I think that we need to replace the ABNF with two algorithms:
I suspect what we want is that 1 is a parsing job resulting in failure or a data structure and 2 is a serialization job of that data structure. That seems ideal. WebKit can be considered supportive of fixing this and I'd also be willing to help nail something down along these lines, but the discussion in #4562 has made me rather unmotivated. 🙁 |
@annevk , We don't necessarily need two algorithms but one algorithm that converts input to submittable form but can fail. Also, we need clarity on what the @aphillips , What's the purpose of hosting |
The algorithm could look like this: Find the last instance of character U+0040 @ in input. If there isn’t one, return failure. Let the part of input before the last U+0040 @ character be local and let the part after the last U+0040 @ character be domain. If local is the empty string, return failure. If local contains any of Let normalizedDomain be the result of https://url.spec.whatwg.org/#concept-domain-to-ascii with domain and true. If normalizedDomain is failure, return failure. Let normalizedLocal be local normalized to Unicode Normalization Form C. Return the concatenation of normalizedLocal, U+0040 @, and normalizedDomain. |
I think |
Per discussion with @zcorpan , the algorithm should probably return failure if the label in the TLD position consists entirely of ASCII digits. |
Oh, I missed that you don't run the host parser on the final argument. Why is that? |
Two reasons:
Rejecting domains whose TLD consists entirely of ASCII digits ends up rejecting things that look like URL-style IPv4 addresses that are wrong in email addresses. |
Demo of the above-described algorithm for experimentation The "Presentable" display isn't (at least at this time) a suggestion for browsers to change the user-entered visible contents of the field to the "Presentable" form, but it's what a server could derive from the submittable value by running ToUnicode (or mere labelwise Punycode decode) on the domain. The point is to offer an opportunity to experiment with different input to see if the rejections seem appropriate. Key differences compared to the current state of the PR:
Notable non-differences:
|
Oh, and let's surface the only comment I put in the code: Do we want to support TLDs with MX records or require at least one more level in the domain name? |
I looked a bit more at not using the host parser:
I also see the host parser has an assert on non-empty input. Not sure if ToASCII relies on that, but we probably want to double check input such as |
Not catching various non-DNS characters as typos when the domain is supposed to be for DNS naming isn't great, either. In the Gecko case, it would be easy to apply an ASCII deny list that's the STD3 ASCII deny list with the exception of allowing the underscore if underscore in domains is a thing that's relevant to
Part of what I'm trying to demo is showing how little code this demo needs when using the libraries that Gecko uses. That is, in terms of Gecko-developer-facing code, beStrict=true is not onerous. (Not sure what the binary size impact is if new combinations of code get inlined, but the code is supposed to be designed to be defensive against excessive repetition caused by inlining.)
This is implied by beStrict=true by the way of
As can be seen from my demo, these get rejected. |
AFAICT, an underscore in the domain part currently results in |
@hsivonen asked:
I think it's a mistake, since backtracking through various specs leads to localpart using something like |
I now see that the BNF oddity comes from the pre-existing BNF in the spec. It seems prudent not to reject ASCII local parts that the ASCII-only formulation currently in the spec accepts, and the BNF in this PR and the algorithmic formulation in my previous comment here accomplish that.
Great! @annevk said:
I'm OK with this. |
Replaces #5799.
Changes the ABNF and surrounding text to include support for non-ASCII characters on both the left and right side of the email address. Discussion in #4562 includes the genesis of these changes.
(See WHATWG Working Mode: Changes for more details.)
/input.html ( diff )
/references.html ( diff )