Parser data #232

jaytaph · 2023-11-05T11:13:24Z

No description provided.

This parserdata structure is a gateway between the tokenizer and the parser. In a certain case (just one), the tokenizer needs to know the state of the parser to generate a correct token. The current setup has the tokenizer and parser in such a way, that we cannot easily reference eachother without borrow/check issues. THerefor we add this "hack", which finds out the data beforehand, and calls the tokenizer with this data. This means the call is done for each tokenizer call, instead of only when needed, but it saves a big refactor of the tokenizer/parser. In the future, we probably should separate the tokenizer, parser, and tree builder/sink structure so this is not an issue anymore.

src/html5/parser.rs

src/testing/tree_construction/parser.rs

src/html5/parser/helper.rs

src/testing/tree_construction/generator.rs

CharlesChen0823 · 2023-11-05T13:23:08Z

IMO, after modify according my review comments, except scripted test cases, all have done. @jaytaph

jaytaph · 2023-11-05T17:07:50Z

@CharlesChen0823 I'll add your suggestions to this pr.

jaytaph · 2023-11-05T19:25:17Z

@CharlesChen0823 All your suggestions have been added and we have now a 99.89% pass rate. There are 4 tests that cannot pass because they require scripting to be implemented which we do not have yet.

Many thanks for the help on the parser and test fixes. Hope to have your input on other parts of the gosub browser.

benches/tree_construction.rs

src/html5/tokenizer.rs

src/html5/tokenizer/character_reference.rs

src/testing/tree_construction/parser.rs

tests/tree_construction.rs

jaytaph · 2023-11-07T20:06:31Z

src/bytes.rs

@@ -114,15 +114,6 @@ pub struct CharIterator {
    pub has_read_eof: bool, // True when we just read an EOF
 }

-pub enum SeekMode {


I've removed the possibility to seek a stream. For now, that is ok enough, and it saves a lot of time processing newlines and column end calculations.

jaytaph · 2023-11-07T20:07:38Z

src/bytes.rs

@@ -361,15 +306,26 @@ impl CharIterator {
        // If we still can move forward in the stream, move forwards
        if self.position.offset < self.length {
            let c = self.buffer[self.position.offset];
-            self.seek(SeekMode::SeekCur, 1);
+            if c == Ch('\n') {


even though we cannot "seek" in a stream, we can unwind a single character. This would be easy, but at a start of the line, we must know the column ending. We store this in line_offsets so we don't need to calculate it.

jaytaph · 2023-11-07T20:09:28Z

src/html5/parser.rs

@@ -444,6 +444,11 @@ impl<'chars> Html5Parser<'chars> {
        let mut handle_as_script_endtag = false;

        match &self.current_token.clone() {
+            Token::Text(value) if self.current_token.is_mixed() => {


This is now a common codeblock when dealing with text.

At these points, special cases are needed for whitespace tokens, null tokens, or both. This section will check if there are mixed characters in the current token, if so, it will split these and insert them in the token queue. From that point on, the parser continues, and on the next loop, we are back to this point again. However, now the is_mixed() will return false (since the tokens have been separated), and we continue with the following match arms.

jaytaph · 2023-11-07T20:10:14Z

src/html5/parser.rs


+                            let tokens = self.split_mixed_token(&pending_chars);


special case where we split tokens manually. After this, we call handle_in_body for each of those tokens before continuing the rest of the flow.

jaytaph · 2023-11-07T20:10:55Z

src/html5/parser.rs

            Token::Text(..) => {
                self.reconstruct_formatting();

                self.insert_text_element(&self.current_token.clone());

-                self.frameset_ok = false;
+                // If this mixed token does not have whitespace chars, set frameset_ok to false


We don't care about mixed values here. If there is a whitespace, frames are ok.

jaytaph · 2023-11-07T20:11:47Z

src/html5/parser.rs

+    /// The idea is that large blobs of javascript for instance will not be split into separate
+    /// tokens, but still be seen and parsed as a single TextToken.
+    ///
+    fn split_mixed_token(&self, text: &String) -> Vec<Token> {


I think this can be optimized more, but for now it's ok enoughl

jaytaph · 2023-11-07T20:12:37Z

src/html5/tokenizer/token.rs

@@ -30,6 +30,31 @@ pub enum Token {
    Eof,
 }

+impl Token {
+    pub(crate) fn is_mixed(&self) -> bool {


Note we only want to check on ascii-whitespace, not unicode-whitespace

jaytaph · 2023-11-07T21:01:39Z

src/html5/parser.rs

@@ -630,23 +635,24 @@ impl<'chars> Html5Parser<'chars> {
    fn process_html_content(&mut self) {
        if self.ignore_lf {
            if let Token::Text(value) = &self.current_token {
-                if value.eq(&"\n".to_string()) {
-                    self.current_token = self.fetch_next_token();
+                if value.starts_with('\n') {


instead of fetching the next token, we check if the current token has a \n. If so, we modify the current token to remove the \n and continue as usual.

jaytaph · 2023-11-07T21:03:40Z

src/html5/parser.rs

-                for c in value.chars() {
-                    self.token_queue.push(Token::Text(c.to_string()));
-                }
+                self.token_queue.push(Token::Text(value));


We can safely push the whole value directly as a single token instead of splitting each character as a separate token. This saves us A LOT of time parsing

jaytaph · 2023-11-07T21:04:29Z

src/html5/parser/helper.rs

-        let node = self.create_node(token, HTML_NAMESPACE);
-        let node_id = self.document.get_mut().add_new_node(node);
+        // Skip empty text nodes
+        if let Token::Text(text) = token {


It seems possible that empty tokens arrive here. We have to make sure we don't add them to the tree

jaytaph · 2023-11-07T21:04:53Z

src/html5/tokenizer.rs

@@ -50,6 +50,30 @@ pub struct Tokenizer<'stream> {
    pub error_logger: Rc<RefCell<ErrorLogger>>,
 }

+impl<'stream> Tokenizer<'stream> {
+    pub(crate) fn insert_tokens_at_queue_start(&mut self, first_tokens: Vec<Token>) {


This inserts a set of tokens at the front of the queue.

emwalker

I'd pitch for merging this sooner rather than later.

CharlesChen0823 · 2023-11-08T01:01:47Z

src/html5/parser.rs

        match self.insertion_mode {
            InsertionMode::Initial => {
                let mut anything_else = false;

                match &self.current_token.clone() {
+                    Token::Text(value) if self.current_token.is_mixed() => {


IMO, current_token.is_mixed shoud move fetch_next_token, and insert_tokens_at_queue_start could not need.

@CharlesChen0823 I'm not sure what you exactly mean. Do you mean that we should do this "is_mixed" thing inside the fetch_next_token and not directly everywhere in the code?

@CharlesChen0823 for now, I want to merge this PR.. if it's ok with you, can you describe in a PR draft what your idea is to change this?

jaytaph added 8 commits November 4, 2023 14:31

Refactor of the tree-construction test harness

12ac8a8

Copying attributes from extra body tags

6833635

removed some get_ prefixes from functions

52f5d63

renamed globaltests to totaltests

fa84cf8

using default clone derive

46a1c9e

benchmark fix

9f60c08

removed obsolete code and refactored loop

dca4907

CharlesChen0823 reviewed Nov 5, 2023

View reviewed changes

src/html5/parser.rs Show resolved Hide resolved

CharlesChen0823 reviewed Nov 5, 2023

View reviewed changes

src/testing/tree_construction/parser.rs Outdated Show resolved Hide resolved

CharlesChen0823 reviewed Nov 5, 2023

View reviewed changes

src/html5/parser/helper.rs Show resolved Hide resolved

CharlesChen0823 reviewed Nov 5, 2023

View reviewed changes

src/testing/tree_construction/generator.rs Outdated Show resolved Hide resolved

fixed suggestions from @CharlesChen0823

6e2ddf0

jaytaph force-pushed the parser-data branch from bc633ba to 6e2ddf0 Compare November 5, 2023 19:22

jaytaph marked this pull request as ready for review November 5, 2023 19:23

jaytaph requested review from emwalker, Kiyoshika and neuodev November 6, 2023 15:09

emwalker approved these changes Nov 7, 2023

View reviewed changes

fixed the tokenizer to run faster

fb881ec

jaytaph commented Nov 7, 2023

View reviewed changes

jaytaph requested review from CharlesChen0823 and emwalker November 7, 2023 21:06

jaytaph mentioned this pull request Nov 7, 2023

Splitting of textTokens on special chars #83

Closed

jaytaph added 2 commits November 7, 2023 22:09

fixed clippy issue

d29cccc

moved line-offset to line-columns and using hashmap

72b7e4f

emwalker approved these changes Nov 8, 2023

View reviewed changes

CharlesChen0823 approved these changes Nov 8, 2023

View reviewed changes

jaytaph added 2 commits November 8, 2023 08:15

Removing text splitting in body tags

9e19554

Added extra splitter for \0 only

d90da26

jaytaph merged commit 1af9891 into main Nov 8, 2023
4 checks passed

jaytaph deleted the parser-data branch November 8, 2023 08:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser data #232

Parser data #232

jaytaph commented Nov 5, 2023

CharlesChen0823 commented Nov 5, 2023 •

edited

Loading

jaytaph commented Nov 5, 2023

jaytaph commented Nov 5, 2023

jaytaph Nov 7, 2023

jaytaph Nov 7, 2023

jaytaph Nov 7, 2023 •

edited

Loading

jaytaph Nov 7, 2023

jaytaph Nov 7, 2023

jaytaph Nov 7, 2023

jaytaph Nov 7, 2023

jaytaph Nov 7, 2023

jaytaph Nov 7, 2023

jaytaph Nov 7, 2023

jaytaph Nov 7, 2023

emwalker left a comment

CharlesChen0823 Nov 8, 2023

jaytaph Nov 8, 2023

jaytaph Nov 8, 2023

Parser data #232

Parser data #232

Conversation

jaytaph commented Nov 5, 2023

CharlesChen0823 commented Nov 5, 2023 • edited Loading

jaytaph commented Nov 5, 2023

jaytaph commented Nov 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaytaph Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emwalker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CharlesChen0823 commented Nov 5, 2023 •

edited

Loading

jaytaph Nov 7, 2023 •

edited

Loading