Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse fails on non-ASCII Subject #82

Closed
joee opened this issue Oct 24, 2024 · 4 comments · Fixed by #83
Closed

parse fails on non-ASCII Subject #82

joee opened this issue Oct 24, 2024 · 4 comments · Fixed by #83
Assignees

Comments

@joee
Copy link

joee commented Oct 24, 2024

This parsing fails with the error "Failed reading: satisfy", I assume because the Subject contains non-ASCII characters.

While this might not be email-RFC-compliant, messages like this do make their way through my postfix -> sieve pipeline and arrive in my inbox. Ideally, I would like a way to process them using a lenient purebred-email parsing option.

import Data.Hex
import Data.ByteString.Char8
import Data.Either
import Data.MIME

main = do
  let e = unhex "5375626A6563743A20C2BD20697320612068616C660D0A546F3A203C6578616D706C65406578616D706C652E636F6D3E0D0A0D0A626F64790D0A" :: Either String ByteString
  let b = fromRight "" e
  let e' = parse (message mime) b
--  Data.ByteString.Char8.putStrLn b
  print e'
@frasertweedale
Copy link
Member

frasertweedale commented Oct 25, 2024

There is an RFC (of course) - https://datatracker.ietf.org/doc/html/rfc6532. Implementing it properly would require significant API breakage.

However, for just Subject and other unstructured headers, we can get away without any API change. The header value will be a ByteString that is valid UTF-8. I can get a draft PR for you to review during the weekend.

@frasertweedale frasertweedale self-assigned this Oct 25, 2024
@joee
Copy link
Author

joee commented Oct 27, 2024

FWIW, I have also noticed this problem with the display name of an email address, such as:

From: non-ASCII here <[email protected]>

I understand that might be more difficult to fix.

frasertweedale added a commit that referenced this issue Oct 28, 2024
Partial support for RFC 6532.  For now, only unstructured headers
(e.g. Subject) allow UTF-8.

Fixes: #82
frasertweedale added a commit that referenced this issue Oct 28, 2024
Partial support for RFC 6532.  For now, only unstructured headers
(e.g. Subject) allow UTF-8.

Fixes: #82
@frasertweedale
Copy link
Member

@joee please try this PR: #83

Indeed, adding support for internationalised email addresses will be a more intrusive change.

@joee
Copy link
Author

joee commented Oct 28, 2024

This works great. In my little test corpus of 1,000 emails it parses 100%, including UTF-8 in both the subject line and the email display-name.

frasertweedale added a commit that referenced this issue Oct 28, 2024
Partial support for RFC 6532.  For now, only unstructured headers
(e.g. Subject) allow UTF-8.

Fixes: #82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants