Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: string.base64.parse #1158

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Abion47
Copy link
Contributor

@Abion47 Abion47 commented Oct 2, 2024

This PR adds the string.base64.parse keyword submodule which parses a base64 string into a Uint8Array.

  • Code is up-to-date with the main branch
  • You've successfully run pnpm prChecks locally
  • There are new or updated unit tests validating the change

Notes:

The parsing method for converting a base64 string into a Uint8Array is an adapted solution from the base64-js package since Buffer.from isn't available in browsers and btoa is notoriously slow.

I ran a NodeJS benchmark comparing this solution with btoa and Buffer.from, and got the following results (the test base64 string was a 31KB image):

┌─────────┬───────────┬──────────┬────────────────────┬──────────┬─────────┐
│ (index) │ Task Name │ ops/sec  │ Average Time (ns)  │ Margin   │ Samples │
├─────────┼───────────┼──────────┼────────────────────┼──────────┼─────────┤
│ 0       │ 'btoa'    │ '852'    │ 1173658.2565185907 │ '±0.27%' │ 51123   │
│ 1       │ 'buffer'  │ '3,448'  │ 289958.9299943234  │ '±0.11%' │ 206926  │
│ 2       │ 'b64-js'  │ '48,179' │ 20755.868277051777 │ '±1.26%' │ 2890749 │
└─────────┴───────────┴──────────┴────────────────────┴──────────┴─────────┘

The btoa result isn't surprising, but if these results are to be believed (and that's a huge "if"), the base64-js solution is an order of magnitude faster than Buffer.from. I'm not sure I trust that conclusion, but seeing as Buffer.from isn't an option anyway, I'm willing to take the win.

TODO:

  • After discussion regarding the API, implement string.base64.url.parse
    • Alternatively, have string.base64.parse handle both base64 and base64url strings (which is sort of already supported by the base64-js source).

@Abion47
Copy link
Contributor Author

Abion47 commented Oct 2, 2024

Something I discovered while writing the tests is that I intended to base the tests on the TypedArray/Uint8 tests, but those tests don't appear to have been written yet. Furthermore, when I try to use the return value of b64parse(...) directly in attest(b64parse(...)).snap(...), it statically fails as snap appears to type the expected value as some kind of primitive array rather than correctly as a Uint8Array. I'm not sure if this is because I wrote either the module or the test wrong, but I've written the test to encode the value back into UTF8 to check the string value for now (which does pass).

@ssalbdivad ssalbdivad changed the title feat: Added string.base64.parse feat: string.base64.parse Oct 4, 2024
Copy link
Member

@ssalbdivad ssalbdivad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, my instinct is that the costs of adding the new implementation to the default scope outweigh the benefits, although I'm open to feedback there.

Despite btoa being painfully slow, it still seems like providing an option that uses builtins in the default scope would be valuable, and perhaps we can follow up with a PR adding this version to an @ark/extras (name pending) package so people can opt-in?

Also the CI tests should work once you rebase

@@ -43,6 +43,17 @@ contextualize(() => {
attest(b64url("fn5+").toString()).equals(
'must be base64url-encoded (was "fn5+")'
)

const b64parse = type("string.base64.parse")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this group of tests for base64 to its own file like ip.test.ts or similar.

@@ -43,6 +43,17 @@ contextualize(() => {
attest(b64url("fn5+").toString()).equals(
'must be base64url-encoded (was "fn5+")'
)

const b64parse = type("string.base64.parse")
attest(Buffer.from(b64parse("fn5+") as Uint8Array).toString("utf8")).snap(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally snap would recognize Uint8Array as a builtin here and not map over it, but seems like you'd still have to convert it to a string to easily snapshot it. This seems fine to me for now unless you want to make a change to snapshotting in attest to add builtin logic for this.

ark/type/keywords/string/base64.ts Show resolved Hide resolved
const byteLength = (validLen: number, placeHoldersLen: number) =>
((validLen + placeHoldersLen) * 3) / 4 - placeHoldersLen

const parseB64 = (b64: string) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm tempted to just say let's use the builtin (btoa) even if it's much slower because people complaining about the perf could just write their own version of the parse that uses one of these libraries, but this is a lot to maintain and add to the bundle size for a perf improvement (albeit significant) on a relatively niche feature.

Another option for this kind of type would be to create a new package like @ark/extras where they could be imported directly if users need them. If they're builtin to the default scope, there's no way to tree shake it so the cost is high.

I do appreciate the work that went into adapting an optimized version of this, so would rather see it available through a package like that than to go to waste.

Copy link
Contributor Author

@Abion47 Abion47 Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that the feature wouldn't be that niche as uploading binary data in the form of a base64 string is fairly common, and I'd hate for people to use string.base64.parse on the promise that it automatically validates and parses the string into a buffer-like object only to discover it's using an approach that's 10,000% slower than an available alternative (or even worse, not knowing it's an implementation detail and assuming it's arktype itself that is so slow). In my honest opinion, from a user experience perspective, I think it would be better for the feature to not exist at all than exist in such a disadvantageous state.

That said, the maintenance and bundle size arguments are valid, so maybe an @ark/extras package is warranted if this PR is to go further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, especially until we can support precompilation as a build step for people who care about bundle size, we have to be pretty conservative about this sort of thing.

If you want to migrate this with the changes mentioned to @ark/extras, I would merge that (although the name might change before publishing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants