Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser data #232

Merged
merged 14 commits into from
Nov 8, 2023
Merged

Parser data #232

merged 14 commits into from
Nov 8, 2023

Conversation

jaytaph
Copy link
Member

@jaytaph jaytaph commented Nov 5, 2023

No description provided.

This parserdata structure is a gateway between the tokenizer and the
parser. In a certain case (just one), the tokenizer needs to know the
state of the parser to generate a correct token. The current setup has
the tokenizer and parser in such a way, that we cannot easily reference
eachother without borrow/check issues.

THerefor we add this "hack", which finds out the data beforehand, and
calls the tokenizer with this data. This means the call is done for each
tokenizer call, instead of only when needed, but it saves a big refactor
of the tokenizer/parser.

In the future, we probably should separate the tokenizer, parser, and
tree builder/sink structure so this is not an issue anymore.
@CharlesChen0823
Copy link
Contributor

CharlesChen0823 commented Nov 5, 2023

IMO, after modify according my review comments, except scripted test cases, all have done. @jaytaph

@jaytaph
Copy link
Member Author

jaytaph commented Nov 5, 2023

@CharlesChen0823 I'll add your suggestions to this pr.

@jaytaph jaytaph marked this pull request as ready for review November 5, 2023 19:23
@jaytaph
Copy link
Member Author

jaytaph commented Nov 5, 2023

@CharlesChen0823 All your suggestions have been added and we have now a 99.89% pass rate. There are 4 tests that cannot pass because they require scripting to be implemented which we do not have yet.

Many thanks for the help on the parser and test fixes. Hope to have your input on other parts of the gosub browser.

benches/tree_construction.rs Show resolved Hide resolved
src/html5/tokenizer.rs Show resolved Hide resolved
src/html5/tokenizer/character_reference.rs Show resolved Hide resolved
src/testing/tree_construction/parser.rs Show resolved Hide resolved
tests/tree_construction.rs Show resolved Hide resolved
@@ -114,15 +114,6 @@ pub struct CharIterator {
pub has_read_eof: bool, // True when we just read an EOF
}

pub enum SeekMode {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the possibility to seek a stream. For now, that is ok enough, and it saves a lot of time processing newlines and column end calculations.

@@ -361,15 +306,26 @@ impl CharIterator {
// If we still can move forward in the stream, move forwards
if self.position.offset < self.length {
let c = self.buffer[self.position.offset];
self.seek(SeekMode::SeekCur, 1);
if c == Ch('\n') {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though we cannot "seek" in a stream, we can unwind a single character. This would be easy, but at a start of the line, we must know the column ending. We store this in line_offsets so we don't need to calculate it.

@@ -444,6 +444,11 @@ impl<'chars> Html5Parser<'chars> {
let mut handle_as_script_endtag = false;

match &self.current_token.clone() {
Token::Text(value) if self.current_token.is_mixed() => {
Copy link
Member Author

@jaytaph jaytaph Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a common codeblock when dealing with text.

At these points, special cases are needed for whitespace tokens, null tokens, or both. This section will check if there are mixed characters in the current token, if so, it will split these and insert them in the token queue. From that point on, the parser continues, and on the next loop, we are back to this point again. However, now the is_mixed() will return false (since the tokens have been separated), and we continue with the following match arms.


let tokens = self.split_mixed_token(&pending_chars);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special case where we split tokens manually. After this, we call handle_in_body for each of those tokens before continuing the rest of the flow.

Token::Text(..) => {
self.reconstruct_formatting();

self.insert_text_element(&self.current_token.clone());

self.frameset_ok = false;
// If this mixed token does not have whitespace chars, set frameset_ok to false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about mixed values here. If there is a whitespace, frames are ok.

/// The idea is that large blobs of javascript for instance will not be split into separate
/// tokens, but still be seen and parsed as a single TextToken.
///
fn split_mixed_token(&self, text: &String) -> Vec<Token> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be optimized more, but for now it's ok enoughl

@@ -30,6 +30,31 @@ pub enum Token {
Eof,
}

impl Token {
pub(crate) fn is_mixed(&self) -> bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we only want to check on ascii-whitespace, not unicode-whitespace

@@ -630,23 +635,24 @@ impl<'chars> Html5Parser<'chars> {
fn process_html_content(&mut self) {
if self.ignore_lf {
if let Token::Text(value) = &self.current_token {
if value.eq(&"\n".to_string()) {
self.current_token = self.fetch_next_token();
if value.starts_with('\n') {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of fetching the next token, we check if the current token has a \n. If so, we modify the current token to remove the \n and continue as usual.

for c in value.chars() {
self.token_queue.push(Token::Text(c.to_string()));
}
self.token_queue.push(Token::Text(value));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can safely push the whole value directly as a single token instead of splitting each character as a separate token. This saves us A LOT of time parsing

let node = self.create_node(token, HTML_NAMESPACE);
let node_id = self.document.get_mut().add_new_node(node);
// Skip empty text nodes
if let Token::Text(text) = token {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems possible that empty tokens arrive here. We have to make sure we don't add them to the tree

@@ -50,6 +50,30 @@ pub struct Tokenizer<'stream> {
pub error_logger: Rc<RefCell<ErrorLogger>>,
}

impl<'stream> Tokenizer<'stream> {
pub(crate) fn insert_tokens_at_queue_start(&mut self, first_tokens: Vec<Token>) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inserts a set of tokens at the front of the queue.

Copy link
Collaborator

@emwalker emwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd pitch for merging this sooner rather than later.

match self.insertion_mode {
InsertionMode::Initial => {
let mut anything_else = false;

match &self.current_token.clone() {
Token::Text(value) if self.current_token.is_mixed() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, current_token.is_mixed shoud move fetch_next_token, and insert_tokens_at_queue_start could not need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CharlesChen0823 I'm not sure what you exactly mean. Do you mean that we should do this "is_mixed" thing inside the fetch_next_token and not directly everywhere in the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CharlesChen0823 for now, I want to merge this PR.. if it's ok with you, can you describe in a PR draft what your idea is to change this?

@jaytaph jaytaph merged commit 1af9891 into main Nov 8, 2023
4 checks passed
@jaytaph jaytaph deleted the parser-data branch November 8, 2023 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants