Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text chunk handlers are deceptively difficult to use correctly #255

Open
kornelski opened this issue Jan 1, 2025 · 2 comments
Open

Text chunk handlers are deceptively difficult to use correctly #255

kornelski opened this issue Jan 1, 2025 · 2 comments

Comments

@kornelski
Copy link
Contributor

kornelski commented Jan 1, 2025

Text chunks can be subdivided into smaller pieces by input boundaries in rewriter.write() and the buffer in TextDecoder. Our own tests incorrectly assumed this never happens (#256).

This arbitrary splitting makes the text chunk handlers much more complicated to use than they seem, because the handlers don't get an equivalent of a single DOM text node. They may be invoked many times on arbitrarily small pieces of text, which could be as small as a single codepoint.

Mutations like .before() and .after() are performed for each arbitrary fragment the handler has been invoked on, not before/after the full run of text between tags. Similarly .replace() replaces each individual bit of text, not the whole run of text, so simply calling chunk.replace("new text") is insufficient and incorrect. You have to have a stateful handler that calls chunk.replace("") on all other pieces.

Splits make text search very tricky. You can't use chunk.as_str().contains("needle"), because the handler could be invoked on "n", "ee", "dle". Search can't be done efficiently with just a state machine, because by the time you find the needle, you may have already "handled" the earlier chunks. So text search requires buffering of the text and removing all text chunks proactively until the match.

This behavior makes text chunk handlers quite different from comment and element handlers.

kornelski added a commit that referenced this issue Jan 1, 2025
kornelski added a commit that referenced this issue Jan 1, 2025
@kornelski
Copy link
Contributor Author

Some ideas how to mitigate this:

  1. Buffer the entire text nodes if needed. This would add output latency and unbounded memory use, but the text handlers would be easy to write, and mutations would do the obvious thing.

  2. Intentionally fragment text chunks into small pieces. This would reliably expose invalid assumptions in the text chunk handlers, and force users to face the complexity. The current implementation fragments into 1KB pieces, which isn't small enough to notice the boundaries in most cases.

  3. Cleverly share state of Mutations between text chunks to make the text handlers more idempotent, so that .before() is executed on the first chunk only, .after() is executed on the last chunk only, and .replace() runs once and then removes all following chunks automatically. This would make naive text handlers work as expected, except that chunk.as_str() would still be unreliable, and the overall behavior would be even more complex.

  4. Buffer text only up to $x KB to keep latency and memory usage bounded. This would make incorrect text handlers run into the problem less often, but make the problem even more obscure.

  5. Introduce a mode or new kind of a text handler that doesn't give any input text to text handlers, and the handler can only prepend/append/replace entire text between tags. That would be like the option 3, but without the footgun of being invoked many times with fragmented inputs. Note that we can't offer any text-based selector, since that would require buffering until a match is found, so all text would be buffered in the common case.

  6. Each of the above as a config option or a new type of handler.

This was referenced Jan 1, 2025
@kornelski
Copy link
Contributor Author

@inikulin @orium What do you think about this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant