Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC822 string exporter #16

Draft
wants to merge 37 commits into
base: master
Choose a base branch
from
Draft

Conversation

ronaldmannak
Copy link

@ronaldmannak ronaldmannak commented Jun 14, 2022

Description

With Apple MailKit extensions it is more common to both parse mime messages and export those back to a RFC822 formatted string. This draft PR adds a RFC822 compliant exporter to MimeParser.

In addition, this PR makes a number of initializers public, to accommodate raw -> Mime -> sign/encrypt -> raw workflow.

Testing

This is a draft PR has no unit tests yet

Known issues

  • The order of the headers is not preserved.
  • There are some unresolved edge cases, including mails not getting exported (or rather parsed) correctly by Apple Mail on macOS when mailed from and to the same Gmail account. (The mail will be correctly parsed by Gmail webmail and Apple Mail for iOS, as will mails between two different Gmail accounts).

Feedback

Happy to receive feedback on both the RFC822 formatting (it's a pain) and whether this is direction is in line with the creator's vision.

@miximka
Copy link
Owner

miximka commented Jun 14, 2022

Dear Ronald,

thank you very much for your PR. It would be a great addition not only be able to parse, but also to generate mimes directly in Swift.

I know, it's not easy to produce a valid-formatted mime. Sure, the simplest case is easily done - it's just a bunch of string concatenations. But in a general case, we need to consider what data we are getting from the user and enforce the mime/rfc822 rules accordingly. There are a couple of important requirements, for instance, header fields only allowed to contain ascii characters otherwise they need to be encoded. Or, some characters need to be quoted. Or, header fields must be folded if they exceed length limitation, and so on.

What I am trying to say is, the implementation of such a feature requires deeper study of the RFC specification even than it did for the parser. Because when parsing goes wrong, you usually end up causing troubles only for yourself, but with a wrongly generated mimes, you may end up unknowingly causing troubles somewhere at the receiver end, which is much worse. At the moment, user data validation is not so much present, and it would be great to see at least some of the most important validations/transformations covered in this PR.

Likewize, it would be great if you would add unit tests for mime generation, as it is already the case for the parsing part. This way, you can ensure the correctness of the generated mime and allow for code to be easily changed without fearing to introduce regression bugs in the future.

I hope I didn't sound too grumpy and would really love to see this PR improved and merged very soon!! 👍

Best regards,
Maksim

@ronaldmannak
Copy link
Author

Hi Maksim,

Thanks for your quick response, I appreciate it.

You don't sound grumpy at all and, in fact, I agree with you :) The reason this is a draft PR is exactly because it will take some effort and iterations to make it fully comply with RFC822/2044/2045, work with all edge cases and work with all mail clients (there are some discrepancies in Mime parsing across the clients).

However, the current implementation already works for the vast majority of cases. I'm already using it in an early beta Mail extension myself.

The RFCs are not that long and fairly readable. There are also a lot of existing libraries in C and JS available.

I also agree adding unit tests should be priority #1. A simple roundtrip of the existing mail templates should give us a good baseline. By roundtrip I mean parsing the existing mail messages into Mime, exporting them back to an RFC822 string and compare them to the original message. If the two strings are exactly the same, we know the export isn't adding or removing characters or tokens that will mess up the parsing of the message for mail lcients.

I am happy to add the tests. However, it would be helpful for the unit tests if we could keep the order of the header fields of the message. Right now, content transfer encoding, content type and content disposition are out of order since they're not stored in the MimeHeader.other array.

What are your thoughts on that? Would it make sense to preserve the header order? If so, what would be the easiest way? If not, how can we compare the exported messages to the original in the unit test? Let me know your thoughts!

@miximka
Copy link
Owner

miximka commented Jun 14, 2022

Yeah, I already saw the order issue coming, but in my use-case it did not matter, so I didn't pay much attention to it earlier. I guess we could just remove the three hard-coded fields and other array, replace them with something like fields array and provide an easy access to the particular fields using something like header.field(type: .contentType), header.field(type: "Content-Type"), or even something more sophisticated like header[.contentType] using subscripts or key paths etc. But really, I don't have much preference, if you have some other idea, you are welcome!!

@ronaldmannak
Copy link
Author

Just a quick initial draft to fix to preserve the header order.

public struct MimeHeader {
    public let contentTransferEncoding: ContentTransferEncoding?
    public let contentType: ContentType?
    public let contentDisposition: ContentDisposition?

    // Propose to add new EmailHeader struct to store parsed to, cc and bcc headers since there are scenarios where these have to be edited
    public var to: [Email]? 
    public var cc: [Email]?
    public var bcc: [Email]? 

    // There might be a reason to keep other next to all. Not sure.
    @available(*, deprecated, message: "Use all property instead")
    public let other: [RFC822HeaderField]

    public let all: [RFC822HeaderField] // Add all property that preserves order
}

@ronaldmannak
Copy link
Author

ronaldmannak commented Jun 14, 2022

Our messages crossed :) I totally understand order wasn't necessary in the original use case. I also thought about a header protocols but not sure if that would be over engineering but it would sure break existing code. I personally wouldn't mind breaking backwards compatibility, but other projects who use the library might have a different opinion.

@ronaldmannak
Copy link
Author

This might be closer to what you meant. I just wrapping all fields in an enum. There's room for optimization but it might be cleaner in the end.

enum HeaderType {
    case contentTransferEncoding(ContentTransferEncoding)
    case contentType(ContentType)
    case contentDisposition(ContentDisposition)
    case to([EmailField])
    case cc([EmailField])
    case bcc([EmailField])
    case other(RFC822HeaderField)
}

public struct MimeHeader {
    public var contentTransferEncoding: ContentTransferEncoding? {
        // code to fetch field from fields array
    }
    ...

    @available(*, deprecated, message: "Use fields property instead")
    public let other: [RFC822HeaderField]

    public let fields: [HeaderType]
}

@ronaldmannak
Copy link
Author

ronaldmannak commented Jun 15, 2022

@miximka I have implemented a version of the second example that both parses and exports the headers, while providing backwards compatibility to existing code. I've moved all header related code to a new file called MimeHeader.swift, you can can see the changes there and in HeaderParser.swift

I've also added unit tests in RFC822ExportTest.swift that tests a roundtrip for every mail file included. The unit tests fail at the moment. I can already see two issues:

  1. The header parser unfolds header values that are folded (span over multiple lines). This is visible in the Received: header field. I'm not sure if there's any technical reason to unfold the header values. I can't reconstruct the folds. And technically, both folded and unfolded headers are valid.
  2. I'm using four spaces (LWSP-chars in RFC822 parle) to fold headers like Content-Type, in line with Apple Mail's internal Mime parser. The examples messages use an arbitrary number of spaces, from 1 to 4.

I assume both are deliberately to test the parsers with different standards/messages. However, for the roundtrip tests it doesn't work.

I can think of two options:

  1. I create a copy of every resource that removes arbitrary folds in all fields except Content-Type and use four spaces consistently for folds in the Content-Type field.
  2. We find a better way to compare strings. Something like the Kaleidoscope app that compares two documents character-by-character. But I'm not sure how that would work.

What do you think?

Edit: just added a quick method to visualize tokens (\r\n, \n, \t only for now). I can see the example messages and a new file I just copy&pasted from Apple Mail only use LF (\n) instead of CRLF (\r\n). That's not conform RFC822. Could it be that the CRLF is somehow removed by the copy and paste command? Or was \r\n made redundant in one of the later RFCs?

@miximka
Copy link
Owner

miximka commented Jun 16, 2022

@ronaldmannak

I can think of two options:

  1. I create a copy of every resource that removes arbitrary folds in all fields except Content-Type and use four spaces consistently for folds in the Content-Type field.
  2. We find a better way to compare strings. Something like the Kaleidoscope app that compares two documents character-by-character. But I'm not sure how that would work.

Option 1 sounds reasonable to me 👍.

Edit: just added a quick method to visualize tokens (\r\n, \n, \t only for now). I can see the example messages and a new file I just copy&pasted from Apple Mail only use LF (\n) instead of CRLF (\r\n). That's not conform RFC822. Could it be that the CRLF is somehow removed by the copy and paste command? Or was \r\n made redundant in one of the later RFCs?

It must be one of the Apple's we-do-our-own-thing again 🤦‍♂️. Because even if you save an email to *.eml file and look inside it with some hex-editor, you will only find \n line endings inside it. I just cross-checked with Gmail (you can download an *.eml file from Gmail's web-ui directly) and the file only had \r\n line endings for the same message...

@ronaldmannak
Copy link
Author

@miximka I can confirm that as well. I used a Mail extension to get the raw message directly (just in case the exporter encodes the message or something) last night and found only \n line endings as well.

The core issue is that MIME is tricky standard and some apps (like Apple Mail) don't seem to be 100% compliant. No unit test can help here. I think we should aim to make it 100% work with Apple Mail (and Gmail) first, and then take it from there.

@jasonm23 jasonm23 mentioned this pull request Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants