Text chunk handlers are deceptively difficult to use correctly #255

kornelski · 2025-01-01T13:40:18Z

Text chunks can be subdivided into smaller pieces by input boundaries in rewriter.write() and the buffer in TextDecoder. Our own tests incorrectly assumed this never happens (#256).

This arbitrary splitting makes the text chunk handlers much more complicated to use than they seem, because the handlers don't get an equivalent of a single DOM text node. They may be invoked many times on arbitrarily small pieces of text, which could be as small as a single codepoint.

Mutations like .before() and .after() are performed for each arbitrary fragment the handler has been invoked on, not before/after the full run of text between tags. Similarly .replace() replaces each individual bit of text, not the whole run of text, so simply calling chunk.replace("new text") is insufficient and incorrect. You have to have a stateful handler that calls chunk.replace("") on all other pieces.

Splits make text search very tricky. You can't use chunk.as_str().contains("needle"), because the handler could be invoked on "n", "ee", "dle". Search can't be done efficiently with just a state machine, because by the time you find the needle, you may have already "handled" the earlier chunks. So text search requires buffering of the text and removing all text chunks proactively until the match.

This behavior makes text chunk handlers quite different from comment and element handlers.

The text was updated successfully, but these errors were encountered:

#255

kornelski · 2025-01-01T14:11:10Z

Some ideas how to mitigate this:

Buffer the entire text nodes if needed. This would add output latency and unbounded memory use, but the text handlers would be easy to write, and mutations would do the obvious thing.
Intentionally fragment text chunks into small pieces. This would reliably expose invalid assumptions in the text chunk handlers, and force users to face the complexity. The current implementation fragments into 1KB pieces, which isn't small enough to notice the boundaries in most cases.
Cleverly share state of Mutations between text chunks to make the text handlers more idempotent, so that .before() is executed on the first chunk only, .after() is executed on the last chunk only, and .replace() runs once and then removes all following chunks automatically. This would make naive text handlers work as expected, except that chunk.as_str() would still be unreliable, and the overall behavior would be even more complex.
Buffer text only up to $x KB to keep latency and memory usage bounded. This would make incorrect text handlers run into the problem less often, but make the problem even more obscure.
Introduce a mode or new kind of a text handler that doesn't give any input text to text handlers, and the handler can only prepend/append/replace entire text between tags. That would be like the option 3, but without the footgun of being invoked many times with fragmented inputs. Note that we can't offer any text-based selector, since that would require buffering until a match is found, so all text would be buffered in the common case.
Each of the above as a config option or a new type of handler.

kornelski · 2025-01-06T16:06:19Z

@inikulin @orium What do you think about this issue?

kornelski added a commit that referenced this issue Jan 1, 2025

Test with smaller buffers

7d2e346

#255

kornelski added a commit that referenced this issue Jan 1, 2025

Fix text chunk tests

6c660d0

#255

This was referenced Jan 1, 2025

Unboxed tokens #257

Open

Text chunk handlers complications #256

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text chunk handlers are deceptively difficult to use correctly #255

Text chunk handlers are deceptively difficult to use correctly #255

kornelski commented Jan 1, 2025 •

edited

Loading

kornelski commented Jan 1, 2025

kornelski commented Jan 6, 2025

Text chunk handlers are deceptively difficult to use correctly #255

Text chunk handlers are deceptively difficult to use correctly #255

Comments

kornelski commented Jan 1, 2025 • edited Loading

kornelski commented Jan 1, 2025

kornelski commented Jan 6, 2025

kornelski commented Jan 1, 2025 •

edited

Loading