[Suggestion] Add tokens to tree construction tests #100

RReverser · 2017-07-21T12:01:52Z

Motivation: Some HTML parsers (e.g. parse5 and our internal parser) provide a streaming mode in which tokenizer works as if it's executed together with tree construction algorithm, and so tokenizer states are correctly adjusted on certain tags.

Such adjustment is necessary for tokenizer to correctly tokenize contents of <script>, <textarea> and other special tags while still allowing streaming pass-through without the overhead of tree construction which might also need to buffer content to move nodes around etc.

Problem: For now, tests for such "tokenizer + parser feedback" combination have to be auto-generated at build time by running tree construction tests in a full parser and observing tokens produced (see https://github.com/inikulin/parse5/blob/master/scripts/generate_parser_feedback_test/index.js for an example). This is potentially error-prone and hardly reusable between different parsers (currently you need to copy the produced tests around).

Suggestion: add tokens property to the existing tree construction tests that would contain adjusted tokens as they should be seen after changing states as per HTML5 spec.

I'm happy to prepare PR myself if this proposal is accepted as it would greatly benefit existing and future streaming HTML parsers, and also, I think, can be used by regular parsers for extra checks.

The text was updated successfully, but these errors were encountered:

inikulin · 2017-07-21T12:07:51Z

Browsers can benefit from this too, AFAIK most browsers have preparsers that are used for content pre-load. With this addition there will be a convenient way to test it.

@RReverser The only concern I have is that it will be way to hard to add new tests written by hand. Can we have some tool that generates token list for given markup?

RReverser · 2017-07-21T12:13:22Z

You mean for literally new tests? I guess writing them by hand shouldn't be much worse than writing a tree itself or any other tokenizer tests.

gsnedders · 2017-07-21T18:16:01Z

Really if we do this, we could probably eliminate the tokenizer tests (and really, they're not that great: there's no way to run them against some implementations, like browsers, easily, and hence AFAIK no browser vendor runs them). At the same time, we can't just programmatically generate trees using some implementation because for some the interesting part of the test is something the tree constructor will act like it was there anyway (e.g., <p>foo</p> v. <p>foo)… that said, if we just replaced all known tags with <fake-tag> we should still test everything, I think?

RReverser · 2017-07-21T19:11:07Z

At the same time, we can't just programmatically generate trees

Well we're not talking about trees, but about pure lists of tokens (unless you mean something else), which should still contain only actual tokens from the source, in the same order, with the only difference being that the tokenizer states were adjusted.

For example, this is how one test generated by mentioned parse5 script currently looks like:

{
    "description": "<script><div></script></div><title><p></title><p><p>",
    "input": "<script><div></script></div><title><p></title><p><p>",
    "output": [
        [
            "StartTag",
            "script",
            {}
        ],
        [
            "Character",
            "<div>"
        ],
        [
            "EndTag",
            "script"
        ],
        [
            "EndTag",
            "div"
        ],
        [
            "StartTag",
            "title",
            {}
        ],
        [
            "Character",
            "<p>"
        ],
        [
            "EndTag",
            "title"
        ],
        [
            "StartTag",
            "p",
            {}
        ],
        [
            "StartTag",
            "p",
            {}
        ]
    ]
},

RReverser · 2017-07-21T19:13:00Z

Really if we do this, we could probably eliminate the tokenizer tests (and really, they're not that great: there's no way to run them against some implementations, like browsers, easily, and hence AFAIK no browser vendor runs them).

I guess... but for the start, would you accept a PR that adds tokens to tree-construction tests, and then we can discuss tokenizer tests separately?

gsnedders · 2017-08-10T13:44:40Z

At the same time, we can't just programmatically generate trees

Well we're not talking about trees, but about pure lists of tokens (unless you mean something else), which should still contain only actual tokens from the source, in the same order, with the only difference being that the tokenizer states were adjusted.

I meant for the tokenizer tests.

I guess... but for the start, would you accept a PR that adds tokens to tree-construction tests, and then we can discuss tokenizer tests separately?

Yes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Add tokens to tree construction tests #100

[Suggestion] Add tokens to tree construction tests #100

RReverser commented Jul 21, 2017 •

edited

Loading

inikulin commented Jul 21, 2017

RReverser commented Jul 21, 2017 •

edited

Loading

gsnedders commented Jul 21, 2017

RReverser commented Jul 21, 2017 •

edited

Loading

RReverser commented Jul 21, 2017

gsnedders commented Aug 10, 2017 •

edited

Loading

[Suggestion] Add tokens to tree construction tests #100

[Suggestion] Add tokens to tree construction tests #100

Comments

RReverser commented Jul 21, 2017 • edited Loading

inikulin commented Jul 21, 2017

RReverser commented Jul 21, 2017 • edited Loading

gsnedders commented Jul 21, 2017

RReverser commented Jul 21, 2017 • edited Loading

RReverser commented Jul 21, 2017

gsnedders commented Aug 10, 2017 • edited Loading

RReverser commented Jul 21, 2017 •

edited

Loading

RReverser commented Jul 21, 2017 •

edited

Loading

RReverser commented Jul 21, 2017 •

edited

Loading

gsnedders commented Aug 10, 2017 •

edited

Loading