Explorations: Stream API #851

adamziel · 2023-12-09T00:55:09Z

⚠️ Do not merge ⚠️ I use this large PR to explore, but I'd rather merge these changes a series of smaller PRs that are easier to discuss and review:

Description

Implements stream-based ZIP encodes and decoder using the CompressionStream and DecompressionStream class.

Here's what we get:

Native ZIP support without having to rely on PHP's ZipArchive
Download and unzip WordPress plugins at the same time. Before this PR we had to download the entire bundle, pass it to PHP, run PHP code, and only then the file would be unzipped.
Partial download of large zip files.

To that last point:

ZIP as a remote, virtual filesystem

This change enables fast previewing of even 10GB-large zipped exports via partial downloads.

Imagine previewing a large site export with many photos and videos. The decodeRemoteZip function knows how to request just the list of files first, filter out the large ones, and then issue multiple fetch() requests to download the rest.

Effectively, we would only download ~5MB - 10MB of data for the initial preview, and then only download these larger assets once they're needed.

Technical details

Here's a few interesting functions shipped by this PR. Note the links point to a specific commit and may get outdated:

nextZipEntry() that decodes a zipped file
decodeRemoteZip() lists ZIP files in a remote archive, filters them, and then downloads just the subset of bytes we need to get those files
encodeZip() turns a stream of File objects into a zip archive (as stream of bytes)

Remaining work

There's a few more things to do here, but I still wanted to get some reviews in before spending the time on these just in case the API would substantially change:

Add unit tests.
Solve conflicts
Get the CI checks to pass.
Test in Safari where the support for streams seems to be limited somehow (MDN shows red X-es on a few pages, let's investigate this)
This PR updates Blueprints and the GitHub integration to demonstrate the impact on the codebase and give the reviewers more context. These changes will be offloaded to a series of follow-up PRs before this one is merged.
Merge!
Release new npm packages.
Refactor progress monitoring to use ReadableStream.tee() instead of the workaround we use now – once https://bugs.chromium.org/p/chromium/issues/detail?id=1512548 is fixed in chromium.

API changes

Breaking changes

This PR isn't a breaking change yet. One of the follow-up PRs will very likely propose some breaking changes, but this one only extends the available API.

Without this PR

Without this PR, unzipping a file requires writing it to Playground, calling PHP's unzip, and removing the temporary zip file:

const response = await fetch(remoteUrl);
// Download the entire byte array first
const bytes = new Uint8Array(await response.arrayBuffer());
// Copy those bytes into Playground memory
await writeFile(playground, {
	path: tmpZipPath,
	data: zipFile,
});
// Run PHP code and use `ZipArray` via unzip()
await unzip(playground, {
	zipPath: tmpZipPath,
	extractToPath: targetPath,
});
// Only now is the ZIP file extracted.
// We still need to clean up the temporary file:
await playground.unlink(tmpZipPath);

With this PR

With this PR, unzipping works like this

const response = await fetch(remoteUrl);
// We can now unzip as we stream response bytes
decodeZip( response.body )
	// We also write the stream of unzipped files to PHP as new entries become available
	.pipeTo( streamWriteToPhp( playground, targetPath ) )

More examples

Here's what else the streaming API unlocks. Not all of these functions are shipped here, but they are quite easy to implement:

// In the browser, fetch a zip file:
(await fetch(url))
	.body
	.pipeThrough(decodeZip())
	.pipeTo(streamWriteToPhp(php, pluginsDirectory))

// In the browser, install from a VFS directory:
iteratorToStream(iteratePhpFiles(path))
	.pipeTo(streamWriteToPhp(php, pluginsDirectory))

// In the browser, install from a .zip inside VFS:
streamReadPhpFile(php, path)
	.pipeThrough(decodeZip())
	.pipeTo(streamWriteToPhp(php, pluginsDirectory))

// Funny way to do a recursive copy
iteratorToStream(iteratePhpFiles(php, fromPath))
	.pipeTo(streamWriteToPhp(php, toPath))

// Process a doubly zipped artifact from GitHub CI
(await fetch(artifactUrl))
	.body
	.pipeThrough(decodeZip())
	.pipeThrough(readBody())
	.pipeThrough(decodeZip())
	.pipeTo(streamWriteToPhp(php, pluginsDirectory))

// Export Playground files as zip
iteratorToStream(iteratePhpFiles(php, fromPath))
	.pipeThrough(encodeZip())
	.pipeThrough(concatBytes())
	.pipeTo(downloadFile('playground.zip'))

// Export Playground files to OPFS
iteratorToStream(iteratePhpFiles(php, fromPath))
	.pipeTo(streamWriteToOpfs('/playground'))

// Compute changeset to export to GitHub
changeset(
	iterateGithubFiles(org, repo, branch, path),
	iteratePhpFiles(php, fromPath)
);

// Read a subdirectory from a GitHub repo
decodeRemoteZip(
	zipballUrl,
	({ path }) => path.startsWith("themes/adventurer")
)
	.pipeThrough(enterDirectory('themes/adventurer'))
	.pipeTo(streamWriteToPhp(php, joinPath(themesPath, 'adventurer')))

// Write a single file from the zip into a path in PHP
decodeRemoteZip(
	artifactUrl,
	({ path }) => path.startsWith("path/to/README.md")
)
	.pipeTo(streamWriteToPhp(php, '/wordpress'))

// In node.js, install a plugin from a disk directory
iteratorToStream(iteratePhpFiles(php, path))
	.pipeTo(streamWriteToPhp(php, pluginsDir))
;

Open questions

How can streams be expressed in Blueprints?

Solving this is out of scope for this PR, but it's one of the next problems to explore so we might as well use this PR as a space to contemplate. I'll move this into a separate issue once we solidify the shape of the API here and ship this PR.

URI notation

Perhaps the Blueprint steps could accept a protocol://URI notation? This is the option I'm leaning towards, unless a better alternative comes up.

Upsides:

Succinct
Somewhat familiar, e.g. git does git+https://git@...

Downsides:

It still looks a bit foreign and requires some getting used to
It cannot easily express nested resources like "A path inside a zip" or "A stream inside a zip"

{
	"steps": [
		{
			step: 'installPlugin',
			pluginDir: 'zipdir+plugin://hellodolly',
		},
		{
			step: 'installPlugin',
			pluginDir: 'zipdir+https://mysite.com/plugin.zip:hello-dolly/',
		}
	]
}

Object-based DSL

We can have unlimited flexibility with a custom DSL like below

Upsides:

Flexibility

Downsides:

Even more complex than the URI proposal. Requires training, looks foreign, noone would be able to write it without consulting the documentation.

{
	"steps": [
		{
			step: 'installPlugin',
			pluginFiles: {
				"resource": "unzip",
				"pathInside": "hello-dolly/",
				"zipFile": {
					"resource": "https",
					"url": "https://mysite.com/hello-dolly.zip"
				}
			}
		}
	]
}

Piping DSL

How about we mimic the JavaScript API using the JSON notation?

Upsides:

Flexibility
Less complex than the object-based DSL above
Looks similar to the JavaScript counterpart

Downsides:

We're essentially JavaScript and creating a micro-language in JSON. I don't want people to express their code in JSON, I want them to have a convenient tool that can easily express complex concepts.

{
	"steps": [
		{
			step: 'installPlugin',
			pluginFiles: { "pipe": [
				{ "fetch": "https://mysite.com/hello-dolly.zip" },
				"unzip",
				{ "enterDirectory": "hello-dolly/" }
			} ]
		}
	]
}

cc @dmsnell

…them into PHP(). Also, explore using file iterators as a basic abstraction for passing the file trees around. Work in progress

dmsnell

Very nice to see the elimination of the intermediate unpacked folder. This should bring a nice improvement in the latency of installing ZIP files, leading to things finishing quite a bit faster during installs and boot.

dmsnell · 2023-12-10T12:41:30Z

packages/playground/blueprints/src/lib/zip.ts

+	entry['uncompressedSize'] = await stream.readUint32();
+	entry['fileNameLength'] = await stream.readUint16();
+	entry['extraLength'] = await stream.readUint16();
+	entry['fileName'] = new TextDecoder().decode(


it appears that Emscripten is the real culprit here, but filenames are not encoded. they are bytes separated by ASCII /. this will end up breaking filenames and inserting � in places.

dmsnell · 2023-12-10T12:46:53Z

packages/playground/blueprints/src/lib/zip.ts

+		});
+	}
+
+	const data = await stream.read(header.compressedSize);


curious: what happens if we attempt to read more bytes than are available?

I removed the custom Stream class implementation. Right now we read n bytes using the native ReadableStreamBYOBReader class which sends a "read request" to the data source. Every data source is free to process it how it wants, but typically it will only the buffer up to Math.min(Buffer.size, numberOfAvailableDataBytes).

…end of the central directory

…nfig, and the sqlite-database-integration plugin (#872) ## Description Includes the `wp-config.php` file and the default WordPress theme (like `twentytwentythree`) in the zip file exported via the "Download as .zip" button. Without those files, the exported bundle isn't self-contained. It cannot be hosted, and any the importer needs to provide the missing files on its own. The theme and the plugins are easy to backfill, but the data stored in `wp-config.php` is lost. ## How is the problem addressed? This PR adds a temporary private `selfContained` option to the `importWpContent`. It is `false` by default to ensure those files are not exported with GitHub PRs (the export flow relies on the same logic). The zip download button sets it to `true`. This is a temporary workaround as we currently don't have any better tools to deal with this problem. Once the streaming/iterators API ships in #851, we'll be able to get rid of this hack and just filter the stream of files. ## Testing Instructions Unfortunately, this PR ships no unit tests as without #895 there isn't an easy way to test the `zipWpContent` function. Here's the manual testing steps to take: 1. Open Playground 2. Make a change in the site content 3. Export Playground into a zip file 4. Confirm that zip file contains the `wp-content.php` file as well as the `twentytwentyfour` theme and the `sqlite-database-integration` plugin 5. Refresh the Playground tab and import that zip 6. Confirm it worked and the website is functional and has the content update from before 7. Export it to GitHub, check the "include zip file" checkbox 8. Confirm the GitHub PR has no `twentytwentyfour` theme, or the `wp-config.php` file, or the `sqlite-database-integration` plugin. 9. Do the same for the zip bundled with the GitHub PR 10. Import that PR and confirm it imports cleanly

Adds a new `@php-wasm/node-polyfills` package to polyfill the features missing in Node 18 and/or JSDOM environments. The goal is to make wp-now and other Playground-based Node.js packages work in Node 18, which is the current LTS release. The polyfilled JavaScript features are: * `CustomEvent` class * `File` class * `Blob.text()` and `Blob.arrayBuffer()` methods * `Blob.arrayBuffer()` and `File.text()` methods * `Blob.stream()` and `File.stream()` methods * Ensures `File.stream().getReader({ mode: 'byob' })` is supported – this is relevant for #851 I adapted the Blob methods from https://github.com/bjornstar/blob-polyfill/blob/master/Blob.js as they seemed to provide just the logic needed here and they also worked right away. This PR is a part of #851 split out into a separate PR to make it easier to review and reason about. Supersedes #865 ## Testing instructions Confirm the unit tests pass. This PR ships a set of vite tests to confirm the polyfills work both in vanilla Node.js and in jsdom runtime environments.

Stream Compression introduced in #851 has no dependencies on WordPress and can be used in any JavaScript project. It also makes sense as a dependency for some `@php-wasm` packages. This commit, therefore, moves it from the `wp-playground` to the `php-wasm` npm namespace, making it reusable across the entire project. In addition, this adds a new `iterateFiles` function to the `@php-wasm/universal` package, which allows iterating over the files in the PHP filesystem. It uses the `stream-compression` package, which was some of the motivation for the move. ## Testing instructions Since the package isn't used anywhere yet, only confirm if the CI checks pass.

) Stream Compression introduced in #851 has no dependencies on WordPress and can be used in any JavaScript project. It also makes sense as a dependency for some `@php-wasm` packages. This commit, therefore, moves it from the `wp-playground` to the `php-wasm` npm namespace, making it reusable across the entire project. In addition, this adds a new `iterateFiles` function to the `@php-wasm/universal` package, which allows iterating over the files in the PHP filesystem. It uses the `stream-compression` package, which was some of the motivation for the move. This PR also ships eslint rules to keep the `stream-compression` package independent from the heavy `@php-wasm/web` and `@php-wasm/node` packages. This should enable using it in other project with a minimal dependency overhead of just `@php-wasm/util` and `@php-wasm/node-polyfills`. ## Testing instructions Since the package isn't used anywhere yet, only confirm if the CI checks pass.

This small commit brings a part of #851 into trunk for easier review. ## Testing instructions Confirm the CI tests pass

…898) This small commit brings a part of #851 into trunk for easier review. ## Testing instructions Confirm the CI tests pass

## What is this PR doing? #851 explores migrating file handling in Playground from buffering to JS-native streams, and this PR brings over small, administrative changes from the main branch to unclutter it. This is mostly about updating `package.json` files, updating configs, and imports. ## Testing Instructions Confirm the CI checks pass.

adamziel · 2024-01-08T14:25:46Z

Safari doesn't support BYOB streams which is a blocker here. Perhaps there's a way to polyfill them? I experimented with that at 2b0acd0#diff-888d17d202646c67c560411a8e150946f30fec742ff592a7915d6c8370ab030e

adamziel · 2024-01-15T11:44:21Z

Surfacing the performance concerns from #919 (comment).

Buffering the entire .zip file and passing it to PHP.wasm in one go takes around 300ms on my machine. However, using the browser streams and writing each file as it becomes available takes around 2s where:

~50% is spent decoding the ZIP file, using ArrayBuffers, ReadableStreams, and such
~50% is spent sending each decompressed file to the PHP worker via postMessage

There's potentially a lot of room for improvement, but I'm not too keen about tweaking this.

Tweaking the stream-based implementation would take a lot of time, and I'm not convinced about the benefits. The lower bound for the execution time seems to be set by the native version of libzip and libz. My intuition is that:

JavaScript–based zip decoder is essentially a lot of overhead on top of that lower bound including executing JavaScript code, copying the data in JS VM more than we need to, marshalling and sending the data via postMessage many times over etc.
WASM zip decoder is that lower bound times a factor like wasmSlowerThanNative (which could be nuanced, non-linear etc).

All of this is hand wavy and based on my intuition. I don't have any actual measurements and perhaps I could spend a dozen or more hours here and either prove or disprove those assumptions, but I think there's a more promising avenue to explore instead.

I wonder if we could stream all the downloaded bytes directly to WASM memory, and stream-handle them there. JavaScript would become an API layer over WASM operations, much like numpy in python. We wouldn't be passing data around, just orchestrating what happens on the lower level. The API could look like this:

php
    .acceptDataStream( fetch( pluginFileURL ).body )
    .unzip()
    .writeTo( "/wordpress/wp-content/plugins/gutenberg" );

adamziel · 2024-04-03T23:02:43Z

Blueprints v2 make integrating JavaScript ZIP streaming into Blueprint steps unnecessary as the data is processed using PHP streams instead of JavaScript ones. Let's still keep the stream processing package around, though, as it's independent and useful to have.

Explore unzipping files using DecompressionStream instead of passing …

4b7dacf

…them into PHP(). Also, explore using file iterators as a basic abstraction for passing the file trees around. Work in progress

adamziel added [Type] Enhancement New feature or request [Type] Developer Experience [Focus] Developer Tools [Package][@wp-playground] Blueprints labels Dec 9, 2023

adamziel marked this pull request as draft December 9, 2023 00:55

Remove custom importing logic, use Blueprint steps instead

b776c7b

dmsnell reviewed Dec 10, 2023

View reviewed changes

adamziel added 22 commits December 11, 2023 02:44

Scanning remote zip file

17f4bb4

Working zip scanning through repeated fetch() calls!

3e21b3c

Fast chunked download

c2abb74

Simplify

035b3f1

Explore stream concatenation

92c4ad0

Use makeReadableByteStream

019ff4e

Move more towards streams

1646ef6

Read central directory headers in one go

27bfce3

API Cleanup

a382076

Experiment: listZipFiles returns a stream

5fa7b47

Getting somewhere with the API shape

bbac923

Embrace streams for partitioning and fetching the partitioned data

9eee842

Precompute lastByteAt

c05a0bf

Save one fetch() request by reusing the bytes downloaded to find the …

53826ee

…end of the central directory

Scan for the signature from the end to the start

afb62d3

Tidied up API

de89933

Implement iterateFromUrl

2f8838e

Clean up types

ca16410

Download 110KB when scanning the directory index

2e4f891

Clean up

ff59b06

Don't do ranges if the no predicate is specified

37aab97

Check for ranges query support

4045568

adamziel added 3 commits December 18, 2023 11:53

Adjust error messages in unit tests

919a17b

Fix unit tests

3548789

Wait a minute before failing the Gutenberg plugin installed E2E test

0cd54f2

adamziel changed the title ~~Stream-based zip encoder and decoder~~ Exploration: Stream-based zip encoder and decoder Dec 18, 2023

adamziel changed the title ~~Exploration: Stream-based zip encoder and decoder~~ Explorations: Stream API Dec 18, 2023

adamziel mentioned this pull request Dec 18, 2023

Stream-based zip encoder and decoder #880

Merged

3 tasks

adamziel mentioned this pull request Dec 22, 2023

Move the stream-compression package from wp-playground to php-wasm #895

Merged

adamziel added a commit that referenced this pull request Dec 22, 2023

Blueprints internals – split the common.ts file into smaller files

5899c26

This small commit brings a part of #851 into trunk for easier review. ## Testing instructions Confirm the CI tests pass

adamziel mentioned this pull request Dec 22, 2023

Blueprints internals – split the common.ts file into smaller files #898

Merged

adamziel mentioned this pull request Jan 8, 2024

Playground not loading on Android and Chrome #564

Closed

adamziel added a commit that referenced this pull request Jan 8, 2024

Blueprints internals – split the common.ts file into smaller files (#…

71100f1

…898) This small commit brings a part of #851 into trunk for easier review. ## Testing instructions Confirm the CI tests pass

adamziel added 3 commits January 8, 2024 12:52

Merge branch 'trunk' into explore-decompression-streams-and-iterators

0d50cc7

Move the stream-compression logic to php-wasm

3e2d4cd

Clean up package-lock.json and remove unnecessary file-polyfill.ts

41cffb5

adamziel mentioned this pull request Jan 8, 2024

Stream API: Port small changes from the main branch #918

Merged

Merge branch 'trunk' into explore-decompression-streams-and-iterators

4d0a0e2

adamziel mentioned this pull request Jan 31, 2024

Blueprints: Streaming zip/unzip for browser and server #482

Open

adamziel mentioned this pull request Feb 9, 2024

Snapshot Import Protocol v1 #1007

Closed

adamziel closed this Apr 3, 2024

adamziel mentioned this pull request Jun 25, 2024

Download all WordPress assets on boot #1532

Merged

adamziel mentioned this pull request Oct 3, 2024

Early draft of storing temp sites in OPFS #1838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explorations: Stream API #851

Explorations: Stream API #851

adamziel commented Dec 9, 2023 •

edited

Loading

dmsnell left a comment

dmsnell Dec 10, 2023

dmsnell Dec 10, 2023

adamziel Dec 17, 2023 •

edited

Loading

adamziel commented Jan 8, 2024

adamziel commented Jan 15, 2024

adamziel commented Apr 3, 2024 •

edited

Loading

Explorations: Stream API #851

Explorations: Stream API #851

Conversation

adamziel commented Dec 9, 2023 • edited Loading

Description

ZIP as a remote, virtual filesystem

Technical details

Remaining work

API changes

Breaking changes

Without this PR

With this PR

More examples

Open questions

URI notation

Object-based DSL

Piping DSL

dmsnell left a comment

Choose a reason for hiding this comment

dmsnell Dec 10, 2023

Choose a reason for hiding this comment

dmsnell Dec 10, 2023

Choose a reason for hiding this comment

adamziel Dec 17, 2023 • edited Loading

Choose a reason for hiding this comment

adamziel commented Jan 8, 2024

adamziel commented Jan 15, 2024

adamziel commented Apr 3, 2024 • edited Loading

adamziel commented Dec 9, 2023 •

edited

Loading

adamziel Dec 17, 2023 •

edited

Loading

adamziel commented Apr 3, 2024 •

edited

Loading