Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Rigerously specify e-mail address validation #236

Open
jchadwick-buf opened this issue Aug 5, 2024 · 0 comments
Open
Labels
Feature New feature or request

Comments

@jchadwick-buf
Copy link
Member

jchadwick-buf commented Aug 5, 2024

Feature description:
Email address validation is underspecified and underdocumented, and protovalidate implementations in different languages use very different e-mail parsing codepaths leading to different validation results in edge cases. E-mail validation should be rigorously specified and implemented consistently across languages, as the results of validation should be consistent across programming languages.

Furthermore, the e-mail validation should be as minimally surprising as possible, so we should leverage existing industry standards as much as possible, particularly ones that reflect the real world and don't hinder e.g. internationalization.

Also, the conformance test suite should be expanded to ensure that the edge cases are consistent across implementations.

Proposed implementation or solution:
I suggest we use the e-mail validation specified in the WHATWG HTML standard, for the following reasons:

  • It is the validation format adopted by web browsers for <input type="email">
  • RFC 5322, the standard that authoritatively defines e-mail address formatting, is woefully out of touch with real-world implementations.
  • Standards that build on RFC 5322, like RFC 6531 which adds support for internationalized e-mail addresses, are often incomplete and ambiguous, and often themselves not standardized.
  • We can lean on regex engines to implement it if we want. Chrome uses it this way, and it is a simple enough regex that it should work fine in more restrictive engines like re2. Since the grammar is very simple and has few productions, hand-written parsers should also be very easy to implement.

I did some exploration into what it would look like to implement RFC 5322-based e-mail address validation, which I will provide here:

Exploring RFC 5322 for e-mail address validation

RFC 5322 rules

Here is a summary of the grammar productions relevant to the local-part of an e-mail address, according to RFC 5322. Per our current validation, productions beginning with 'obs-' should probably be disallowed, as well as productions allowing folding whitespace within e-mail addresses.

We'll ignore the address part, since protovalidate already has an approach to validating hostnames anyways.

; rfc5234 rules
ALPHA           =   %x41-5A / %x61-7A  ; A-Z / a-z
CR              =   %x0D               ; carriage return
LF              =   %x0A               ; linefeed
CRLF            =   CR LF              ; Internet standard newline
DIGIT           =   %x30-39            ; 0-9
DQUOTE          =   %x22               ; " (Double Quote)
HTAB            =   %x09               ; horizontal tab
SP              =   %x20
VCHAR           =   %x21-7E            ; visible (printing) characters
WSP             =   SP / HTAB          ; white space
; folding whitespace
obs-FWS         =   1*WSP *(CRLF 1*WSP)
FWS             =   ([*WSP CRLF] 1*WSP) /  obs-FWS
ctext           =   %d33-39 /          ; Printable US-ASCII
                    %d42-91 /          ;  characters not including
                    %d93-126 /         ;  "(", ")", or "\"
                    obs-ctext
ccontent        =   ctext / quoted-pair / comment
comment         =   "(" *([FWS] ccontent) [FWS] ")"
CFWS            =   (1*([FWS] comment) [FWS]) / FWS
; atom
atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                    "!" / "#" /        ;  characters not including
                    "$" / "%" /        ;  specials.  Used for atoms.
                    "&" / "'" /
                    "*" / "+" /
                    "-" / "/" /
                    "=" / "?" /
                    "^" / "_" /
                    "`" / "{" /
                    "|" / "}" /
                    "~"
atom            =   [CFWS] 1*atext [CFWS]
; quoted string
qtext           =   %d33 /             ; Printable US-ASCII
                    %d35-91 /          ;  characters not including
                    %d93-126 /         ;  "\" or the quote character
                    obs-qtext
quoted-pair     =   ("\" (VCHAR / WSP)) / obs-qp
qcontent        =   qtext / quoted-pair
quoted-string   =   [CFWS]
                    DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                    [CFWS]
word            =   atom / quoted-string
; obsolete productions
obs-NO-WS-CTL   =   %d1-8 /            ; US-ASCII control
                    %d11 /             ;  characters that do not
                    %d12 /             ;  include the carriage
                    %d14-31 /          ;  return, line feed, and
                    %d127              ;  white space characters
obs-ctext       =   obs-NO-WS-CTL
obs-qtext       =   obs-NO-WS-CTL
obs-qp          =   "\" (%d0 / obs-NO-WS-CTL / LF / CR)
obs-local-part  =   word *("." word)
; dot-atom
dot-atom-text   =   1*atext *("." 1*atext)
dot-atom        =   [CFWS] dot-atom-text [CFWS]
; local part
local-part      =   dot-atom / quoted-string / obs-local-part

Simplified RFC 5322 Rules

Here's a version of the above rules with whitespace disallowed outside of quotes and escapes and with obsolete productions removed.

; rfc5234 rules
ALPHA           =   %x41-5A / %x61-7A  ; A-Z / a-z
CR              =   %x0D               ; carriage return
LF              =   %x0A               ; linefeed
CRLF            =   CR LF              ; Internet standard newline
DIGIT           =   %x30-39            ; 0-9
DQUOTE          =   %x22               ; " (Double Quote)
HTAB            =   %x09               ; horizontal tab
SP              =   %x20
VCHAR           =   %x21-7E            ; visible (printing) characters
WSP             =   SP / HTAB          ; white space
; folding whitespace
FWS             =   ([*WSP CRLF] 1*WSP)
ctext           =   %d33-39 /          ; Printable US-ASCII
                    %d42-91 /          ;  characters not including
                    %d93-126           ;  "(", ")", or "\"
ccontent        =   ctext / quoted-pair / comment
comment         =   "(" *([FWS] ccontent) [FWS] ")"
CFWS            =   (1*([FWS] comment) [FWS]) / FWS
; atom
atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                    "!" / "#" /        ;  characters not including
                    "$" / "%" /        ;  specials.  Used for atoms.
                    "&" / "'" /
                    "*" / "+" /
                    "-" / "/" /
                    "=" / "?" /
                    "^" / "_" /
                    "`" / "{" /
                    "|" / "}" /
                    "~"
; quoted string
qtext           =   %d33 /             ; Printable US-ASCII
                    %d35-91 /          ;  characters not including
                    %d93-126.          ;  "\" or the quote character
quoted-pair     =   ("\" (VCHAR / WSP))
qcontent        =   qtext / quoted-pair
quoted-string   =   DQUOTE *([FWS] qcontent) [FWS] DQUOTE
; dot-atom
dot-atom        =   1*atext *("." 1*atext)
; local part
local-part      =   dot-atom / quoted-string

Regular expression translation

It is possible to express this entire grammar using regular expressions, since it doesn't need backtracking or recursion.

; quoted string
qtext           =   /[\x21\x23-\x5b\x5d-\x7e]/
quoted-pair     =   /\\[ \t\x21-\x7E]/
qcontent        =   /[\x21\x23-\x5b\x5d-\x7e]|\\[ \t\x21-\x7E]/
quoted-string   =   /"((([ \t]*[\r\n])?[ \t]+)?[\x21\x23-\x5b\x5d-\x7e]|\\[ \t\x21-\x7E])*(([ \t]*[\r\n])?[ \t]+)?"/
; dot-atom
atext           =   /[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]/
dot-atom        =   /[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+(\.[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+)*/
; local part
local-part      =   /[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+(\.[A-Za-z0-9!#$%&'*+-/=?^_`{|}~]+)*|"((([ \t]*[\r\n])?[ \t]+)?[\x21\x23-\x5b\x5d-\x7e]|\\[ \t\x21-\x7E])*(([ \t]*[\r\n])?[ \t]+)?"/

Pseudo-code form

The above regular expression is unreadable and probably pretty slow. Here is the same grammar parsed with Go-like pseudo-code.

matchLocalPart returns the email address after the '@' if the local-part is valid, or an empty string if it is not.

Note that RFC 5322 does not allow for localpart to contain non-US ASCII characters yet. RFC 6531 proposes allowing non-ASCII characters, but it is still in the proposal stage. Either way, we can work on the byte level since we do not care about codepoints above 0x7F. (If we want to adopt the RFC 6531 behavior at any point, I believe we just want to allow >= 0x80 in qtext and atext.)

func matchLocalPart(email string) string {
	if len(email) == 0 {
		return ""
	}
	if email[0] == '"' {
		if email = matchQuotedString(email); len(email) == 0 {
			return ""
		}
	} else if isAText(email[0]) {
		if email = matchDotAtom(email); len(email) == 0 {
			return ""
		}
	}
	if email[0] != '@' {
		return ""
	}
	return email[1:]
}

func matchQuotedString(email string) string {
	email = email[1:]
	for {
		if len(email) == 0 {
			return ""
		}
		switch email[0] {
		case '"':
			return email[1:]
		case '\\':
			if email = email[1:]; len(email) == 0 {
				return ""
			}
			if !isQuotedPair(email[0]) {
				return ""
			}
			email = email[1:]
		default:
			if !isQText(email[0]) && !isWSP(email[0]) {
				return ""
			}
			email = email[1:]
		}
	}
}

func matchDotAtom(email string) string {
	for {
		if len(email) == 0 {
			return ""
		}
		switch email[0] {
		case '@':
			return email
		case '.':
			if email = email[1:]; len(email) == 0 {
				return ""
			}
			fallthrough
		default:
			if !isAText(email[0]) {
				return ""
			}
			email = email[1:]
		}
	}
}

func isAText(b byte) bool {
	return (b >= 'a' && b <= 'z') ||
		(b >= 'A' && b <= 'Z') ||
		(b >= '0' && b <= '9') ||
		b == '!' || b == '#' || b == '$' || b == '%' ||
		b == '&' || b == '*' || b == '+' || b == '-' ||
		b == '/' || b == '=' || b == '?' || b == '^' ||
		b == '_' || b == '`' || b == '{' || b == '|' ||
		b == '}' || b == '~' || b == '\''
}

func isQText(b byte) bool {
	return b == '!' || (b >= '#' && b <= '[') || (b >= ']' && b <= '~')
}

func isQuotedPair(b byte) bool {
	return b == ' ' || b == '\t' || (b >= 0x21 && b <= 0x7e)
}

func isWSP(b byte) bool {
	return b == ' ' || b == '\t' || b == '\r' || b == '\n'
}

Here is a similar implementation in Python. This is written to work on a memoryview since it is more efficient to slice a memoryview than a str. Unlike the Go version, this version uses exception handling for errors.

from typing import Sequence

_AT = ord('@')
_DQUOTE = ord('"')
_BACKSLASH = ord('\\')
_PERIOD = ord('.')

def _match_local_part(email: Sequence[int]) -> Sequence[int]:
    if len(email) == 0:
        raise Exception('Empty address')
    if email[0] == _DQUOTE:
        email = _match_quoted_string(email)
    elif _is_atext(email[0]):
        email = _match_dot_atom(email)
    if email[0] != _AT:
        raise Exception('Invalid address')
    return email[1:]

def _match_quoted_string(email: Sequence[int]) -> Sequence[int]:
    email = email[1:]
    while True:
        if len(email) == 0:
            raise Exception('Unexpected end of address')
        elif email[0] == _DQUOTE:
            return email[1:]
        elif email[0] == _BACKSLASH:
            email = email[1:]
            if len(email) == 0:
                raise Exception('Unexpected end of address')
            if not _is_quoted_pair(email[0]):
                raise Exception('Invalid quoted pair')
            email = email[1:]
        else:
            if not _is_qtext(email[0]) and not _is_wsp(email[0]):
                raise Exception('Invalid local part')
            email = email[1:]

def _match_dot_atom(email: Sequence[int]) -> Sequence[int]:
    while True:
        if len(email) == 0:
            raise Exception('Unexpected end of address')
        if email[0] == _AT:
            return email
        elif email[0] == _PERIOD:
            email = email[1:]
            if len(email) == 0:
                raise Exception('Unexpected end of address')
        if not _is_atext(email[0]):
            raise Exception('Invalid character')
        email = email[1:]

def _is_atext(b: int) -> bool:
    return (
        (b >= 0x61 and b <= 0x7a) or
        (b >= 0x41 and b <= 0x5a) or
        (b >= 0x30 and b <= 0x39) or
        b == 0x21 or b == 0x23 or b == 0x24 or b == 0x25 or
        b == 0x26 or b == 0x27 or b == 0x2a or b == 0x2b or
        b == 0x2d or b == 0x2f or b == 0x3d or b == 0x3f or
        b == 0x5e or b == 0x5f or b == 0x60 or b == 0x7b or
        b == 0x7c or b == 0x7d or b == 0x7e
    )

def _is_qtext(b: int) -> bool:
    return b == 0x21 or (b >= 0x23 and b <= 0x5b) or (b >= 0x5d and b <= 0x7e)

def _is_quoted_pair(b: int) -> bool:
    return b == 0x20 or b == 0x09 or (b >= 0x21 and b <= 0x7e)

def _is_wsp(b: int) -> bool:
    return b == 0x20 or b == 0x09 or b == 0x0d or b == 0x0a

Summary

Implementing RFC 5322 rules in a readable fashion is doable in most target languages using a hand-written parser. It can be done in under 100 lines.

However, while this parser is strict enough to adhere to RFC 5322, it has the caveat that it may be both more strict and more lenient than some real world mail servers in some situations, so it is far from ideal.

An implementation of the WHATWG HTML would be very trivial. The local-part of the HTML version is a strict subset of the RFC 5322 version; specifically, it is almost identical to the dot-atom-text production, and the matchDotAtom/_match_dot_atom psuedo-code examples should be a near match (after allowing codepoints above 0x7f in atext.) Meanwhile, the hostname portion of the e-mail in the WHATWG HTML standard seems to also be a near-exact match for our existing hostname validation that we already also use for e-mail.

@jchadwick-buf jchadwick-buf added the Feature New feature or request label Aug 5, 2024
rodaine pushed a commit to bufbuild/protovalidate-cc that referenced this issue Aug 7, 2024
Updates Protobuf to v27 and protovalidate to v0.7.1, and fixes all of
the resulting compilation and conformance failures.

As one would expect, there was a tremendous amount of troubleshooting
involved in this thankfully-relatively-small PR. Here's my log of what
happened. I'll try to be succinct, but I want to capture all of the
details so my reasoning can be understood in the future.

- First, I tried to update protobuf. This led to pulling a newer version
of absl. The version of cel-cpp we use did not compile with this version
of absl.

- Next, I tried to update cel-cpp. However, the latest version of
cel-cpp is broken on macOS for two separate reasons
<sup>[1](google/cel-cpp#831),
[2](https://github.com/google/cel-cpp/issues/832)</sup>.

- After taking a break to work on other protovalidate implementations I
returned and tried another approach. This time, instead of updating
cel-cpp, I just patched it to work with newer absl. Thankfully, this
proved surprisingly viable. The `cel_cpp.patch` file now contains this
fix too.

- Unfortunately, compilation was broken in CI on a non-sense compiler
error:
    ```
error: could not convert template argument 'ptr' from 'const
google::protobuf::Struct& (* const)()' to 'const
google::protobuf::Struct& (* const)()'
    ```
    It seemed likely to be a compiler issue, thus I was stalled again.

- For some reason it finally occurred to me that I probably should just
simply update the compiler. In a stroke of accidental rubber-ducking
luck, I noticed that GitHub's `ubuntu-latest` had yet to actually move
to `ubuntu-24.04`, which has a vastly more up-to-date C++ toolchain than
the older `ubuntu-22.04`. This immediately fixed the problem.

- E-mail validation is hard. In other languages we fall back on standard
library functionality, but C++ puts us at a hard impasse; the C++
standard library hardly concerns itself with application-level
functionality like SMTP standards. Anyway, I channeled my frustration at
the lack of a consistent validation scheme for e-mail, which culminated
into bufbuild/protovalidate#236.

For the new failing test cases, we needed to improve the validation of
localpart in C++. Lacking any specific reference point, I decided it
would be acceptable if the C++ version started adopting ideas from
WHATWG HTML email validation. It doesn't move the `localpart` validation
to _entirely_ work like WHATWG HTML email validation, as our version
still has our specific checks, but now we are a strict subset in
protovalidate-cc, so we can remove our additional checks later if we can
greenlight adopting the WHATWG HTML standard.

- The remaining test failures are all related to ignoring validation
rules and presence. The following changes were made:
- The algorithm for ignoring empty fields is improved to match the
specified behavior closer.
- The `ignore` option is now taken into account in addition to the
legacy `skipped` and `ignore_empty` options.
      - Support is added for `IGNORE_IF_DEFAULT_VALUE`
- An edge case is added to ignore field presence on synthetic `Map`
types. I haven't traced down why, but `has_presence` seems to always be
true for fields of synthetic `Map` types in the C++ implementation.
(Except in proto3?)

And with that I think we will have working Editions support.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant