-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Suggestion] Add tokens to tree construction tests #100
Comments
Browsers can benefit from this too, AFAIK most browsers have preparsers that are used for content pre-load. With this addition there will be a convenient way to test it. @RReverser The only concern I have is that it will be way to hard to add new tests written by hand. Can we have some tool that generates token list for given markup? |
You mean for literally new tests? I guess writing them by hand shouldn't be much worse than writing a tree itself or any other tokenizer tests. |
Really if we do this, we could probably eliminate the tokenizer tests (and really, they're not that great: there's no way to run them against some implementations, like browsers, easily, and hence AFAIK no browser vendor runs them). At the same time, we can't just programmatically generate trees using some implementation because for some the interesting part of the test is something the tree constructor will act like it was there anyway (e.g., |
Well we're not talking about trees, but about pure lists of tokens (unless you mean something else), which should still contain only actual tokens from the source, in the same order, with the only difference being that the tokenizer states were adjusted. For example, this is how one test generated by mentioned parse5 script currently looks like: {
"description": "<script><div></script></div><title><p></title><p><p>",
"input": "<script><div></script></div><title><p></title><p><p>",
"output": [
[
"StartTag",
"script",
{}
],
[
"Character",
"<div>"
],
[
"EndTag",
"script"
],
[
"EndTag",
"div"
],
[
"StartTag",
"title",
{}
],
[
"Character",
"<p>"
],
[
"EndTag",
"title"
],
[
"StartTag",
"p",
{}
],
[
"StartTag",
"p",
{}
]
]
}, |
I guess... but for the start, would you accept a PR that adds |
I meant for the
Yes. |
Motivation: Some HTML parsers (e.g. parse5 and our internal parser) provide a streaming mode in which tokenizer works as if it's executed together with tree construction algorithm, and so tokenizer states are correctly adjusted on certain tags.
Such adjustment is necessary for tokenizer to correctly tokenize contents of
<script>
,<textarea>
and other special tags while still allowing streaming pass-through without the overhead of tree construction which might also need to buffer content to move nodes around etc.Problem: For now, tests for such "tokenizer + parser feedback" combination have to be auto-generated at build time by running tree construction tests in a full parser and observing tokens produced (see https://github.com/inikulin/parse5/blob/master/scripts/generate_parser_feedback_test/index.js for an example). This is potentially error-prone and hardly reusable between different parsers (currently you need to copy the produced tests around).
Suggestion: add
tokens
property to the existing tree construction tests that would contain adjusted tokens as they should be seen after changing states as per HTML5 spec.I'm happy to prepare PR myself if this proposal is accepted as it would greatly benefit existing and future streaming HTML parsers, and also, I think, can be used by regular parsers for extra checks.
The text was updated successfully, but these errors were encountered: