Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation of MultipartFormData streaming encoder #200

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

andreiltd
Copy link
Contributor

@andreiltd andreiltd commented Jan 21, 2025

I did a quick comparison of encoded FormData between SM and Chrome using following JS code:

Simple FormData
async function readStream(stream) {
 const reader = stream.getReader();
 const chunks = [];

 while (true) {
   const { done, value } = await reader.read();
   if (done) {
     break;
   }
   chunks.push(value);
 }

 let totalLen = 0;
 for (const chunk of chunks) {
   totalLen += chunk.length;
 }

 const result = new Uint8Array(totalLen);

 let offset = 0;
 for (const chunk of chunks) {
   result.set(chunk, offset);
   offset += chunk.length;
 }

 return result;
}

async function main() {
 const form = new FormData();
 form.append('field1', 'value1');
 form.append('field2', 'value2');

 const file = new File(['Hello World!'], 'dummy.txt', { type: 'foo' });
 form.append('file1', file);

 const req = new Request('https:example.com', {
   method: 'POST',
   body: form,
 });

 const type = req.headers.get('Content-Type');
 const boundary = type.split('boundary=')[1];

 // assert(type.startsWith('multipart/form-data; boundary='), "Content-Type is multipart/form-data");
 const data = await readStream(req.body);
 const dec = new TextDecoder();
 const bodyStr = dec.decode(data);

 console.log(bodyStr);
}

main();

The results are:

  • Chrome:
------WebKitFormBoundaryhpShnP1JqrBTVTnC
Content-Disposition: form-data; name="field1"

value1
------WebKitFormBoundaryhpShnP1JqrBTVTnC
Content-Disposition: form-data; name="field2"

value2
------WebKitFormBoundaryhpShnP1JqrBTVTnC
Content-Disposition: form-data; name="file1"; filename="dummy.txt"
Content-Type: foo

Hello World!
------WebKitFormBoundaryhpShnP1JqrBTVTnC--

  • SM:
----BoundaryjXo5N4HEAXWcKrw7
Content-Disposition: form-data; name="field1"

value1
----BoundaryjXo5N4HEAXWcKrw7
Content-Disposition: form-data; name="field2"

value2
----BoundaryjXo5N4HEAXWcKrw7
Content-Disposition: form-data; name="file1"; filename="dummy.txt"
Content-Type: foo

Hello World!
----BoundaryjXo5N4HEAXWcKrw7--

I also did some manual fuzzing to test the streaming logic by varying the buffer sizes the encoder writes into. Specifically, I tested these buffer sizes in bytes: 1, 2, 4, 8, 32, 1024, and 8192 by changing the size in the implementation. Though, having some unit test framework would be nice.

Relevant RFC: https://www.rfc-editor.org/rfc/rfc2046#section-5.1.1

@andreiltd andreiltd marked this pull request as draft January 21, 2025 14:21
@andreiltd andreiltd marked this pull request as ready for review January 22, 2025 14:02
@andreiltd
Copy link
Contributor Author

andreiltd commented Jan 22, 2025

It looks like the CI failure is unrelated to this PR (I've observed the same failure on main branch). This is a problematic test:

    const response = await fetch("https://http-me.glitch.me/meow?header=cat:é");
    strictEqual(response.headers.get('cat'), "é");

@tschneidereit
Copy link
Member

@andreiltd, is this ready to review, or are you still expecting to make changes to it?

@andreiltd
Copy link
Contributor Author

Yes, this is ready to review :)

Copy link
Member

@tschneidereit tschneidereit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either I'm missing something in looking at the diff, or there is some code missing—see the inline comment on form-data-encoder.h.

One thing that I can't fully evaluate, because it'll presumably be part of the implementation of encode_stream: IIUC, field names, will always be first have their newlines normalized to CRLF, and then have those escaped. It'd be good to fold those into one operation, if possible.

Otherwise, I left a few comments and suggestions. I'm a bit concerned about allocation and general failure handling, so it'd be good to go over those aspects in some detail.

builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
Comment on lines +32 to +34
static size_t query_length(JSContext *cx, HandleObject self);
static JSObject *encode_stream(JSContext *cx, HandleObject self);
static JSObject *create(JSContext *cx, HandleObject form_data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of these seems to be missing? (And in the case of query_length I also can't find any uses.)

Copy link
Contributor Author

@andreiltd andreiltd Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... that's weird:

I totally forgot about query_length, I'm sorry about that! The reason for this is that the fetch spec doesn't say what to do with Content-Length in case of FormData.

Set length to unclear, see html/6424 for improving this.

See here: https://fetch.spec.whatwg.org/#bodyinit-unions. It technically should be possible to calculate an encoded size beforehand without doing an actual encoding, but it's unclear to me if this is needed or not.

bool MultipartFormDataImpl::handle_entry_header(JSContext *cx, StreamContext &stream) {
auto entry = stream.entries->begin()[chunk_idx_];
auto header = fmt::memory_buffer();
auto name = escape_newlines(entry.name).value();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a null check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is overloaded but the version that doesn't take JSContetext is infallible. I can change this to separate function like escape_name and escape_name_val if we prefer explicit return value. Anyway, I've added the assertion for now.

builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
// Hex encode bytes to string
auto bytes = std::move(res.unwrap());
auto bytes_str = std::string_view((char *)(bytes.ptr.get()), bytes.size());
auto base64_str = base64::forgivingBase64Encode(bytes_str, base64::base64EncodeTable);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a null check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns just a std::string.

builtins/web/form-data/form-data-encoder.cpp Show resolved Hide resolved
builtins/web/form-data/form-data-encoder.cpp Outdated Show resolved Hide resolved
auto leftover = datasz - written;
if (leftover > 0) {
MOZ_ASSERT(remainder_.empty());
remainder_.assign(first + written, last);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an infallible operation? It seems like it should need to allocate memory, so I'm not sure I understand how that works. And ideally, if it does allocate, we should make that fallible—or at least assert that it didn't fail and abort execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it technically can throw bad_alloc because it is internally calling std::string::reserve if needed - I guess this will abort excution.

https://cplusplus.com/reference/string/string/assign/

A bad_alloc exception is thrown if the function needs to allocate storage and fails.

@andreiltd
Copy link
Contributor Author

andreiltd commented Jan 27, 2025

Hi @tschneidereit, as always, thank you so much for a through review!

One thing that I can't fully evaluate, because it'll presumably be part of the implementation of encode_stream: IIUC, field names, will always be first have their newlines normalized to CRLF, and then have those escaped. It'd be good to fold those into one operation, if possible.

So the spec says this:

For each entry of entry list:

  1. Replace every occurrence of U+000D (CR) not followed by U+000A (LF), and every occurrence of U+000A (LF) not preceded by U+000D (CR), in entry's name, by a string consisting of a U+000D (CR) and U+000A (LF).
  2. If entry's value is not a File object, then replace every occurrence of U+000D (CR) not followed by U+000A (LF), and every occurrence of U+000A (LF) not preceded by U+000D (CR), in entry's value, by a string consisting of a U+000D (CR) and U+000A (LF).

It's really hard to follow the spec closely because the implementation is state-machine-based, and a lot of actions that the spec calls for correspond to different states in the implementation and cannot be done together. For example, processing both names and values: names are part of the entry-header state, whereas values are handled in the entry-body state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants