-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser data #232
Parser data #232
Conversation
This parserdata structure is a gateway between the tokenizer and the parser. In a certain case (just one), the tokenizer needs to know the state of the parser to generate a correct token. The current setup has the tokenizer and parser in such a way, that we cannot easily reference eachother without borrow/check issues. THerefor we add this "hack", which finds out the data beforehand, and calls the tokenizer with this data. This means the call is done for each tokenizer call, instead of only when needed, but it saves a big refactor of the tokenizer/parser. In the future, we probably should separate the tokenizer, parser, and tree builder/sink structure so this is not an issue anymore.
IMO, after modify according my review comments, except scripted test cases, all have done. @jaytaph |
@CharlesChen0823 I'll add your suggestions to this pr. |
@CharlesChen0823 All your suggestions have been added and we have now a 99.89% pass rate. There are 4 tests that cannot pass because they require scripting to be implemented which we do not have yet. Many thanks for the help on the parser and test fixes. Hope to have your input on other parts of the gosub browser. |
@@ -114,15 +114,6 @@ pub struct CharIterator { | |||
pub has_read_eof: bool, // True when we just read an EOF | |||
} | |||
|
|||
pub enum SeekMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the possibility to seek a stream. For now, that is ok enough, and it saves a lot of time processing newlines and column end calculations.
@@ -361,15 +306,26 @@ impl CharIterator { | |||
// If we still can move forward in the stream, move forwards | |||
if self.position.offset < self.length { | |||
let c = self.buffer[self.position.offset]; | |||
self.seek(SeekMode::SeekCur, 1); | |||
if c == Ch('\n') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though we cannot "seek" in a stream, we can unwind a single character. This would be easy, but at a start of the line, we must know the column ending. We store this in line_offsets so we don't need to calculate it.
@@ -444,6 +444,11 @@ impl<'chars> Html5Parser<'chars> { | |||
let mut handle_as_script_endtag = false; | |||
|
|||
match &self.current_token.clone() { | |||
Token::Text(value) if self.current_token.is_mixed() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now a common codeblock when dealing with text.
At these points, special cases are needed for whitespace tokens, null tokens, or both. This section will check if there are mixed characters in the current token, if so, it will split these and insert them in the token queue. From that point on, the parser continues, and on the next loop, we are back to this point again. However, now the is_mixed() will return false (since the tokens have been separated), and we continue with the following match arms.
|
||
let tokens = self.split_mixed_token(&pending_chars); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
special case where we split tokens manually. After this, we call handle_in_body for each of those tokens before continuing the rest of the flow.
Token::Text(..) => { | ||
self.reconstruct_formatting(); | ||
|
||
self.insert_text_element(&self.current_token.clone()); | ||
|
||
self.frameset_ok = false; | ||
// If this mixed token does not have whitespace chars, set frameset_ok to false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't care about mixed values here. If there is a whitespace, frames are ok.
src/html5/parser.rs
Outdated
/// The idea is that large blobs of javascript for instance will not be split into separate | ||
/// tokens, but still be seen and parsed as a single TextToken. | ||
/// | ||
fn split_mixed_token(&self, text: &String) -> Vec<Token> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be optimized more, but for now it's ok enoughl
@@ -30,6 +30,31 @@ pub enum Token { | |||
Eof, | |||
} | |||
|
|||
impl Token { | |||
pub(crate) fn is_mixed(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note we only want to check on ascii-whitespace, not unicode-whitespace
@@ -630,23 +635,24 @@ impl<'chars> Html5Parser<'chars> { | |||
fn process_html_content(&mut self) { | |||
if self.ignore_lf { | |||
if let Token::Text(value) = &self.current_token { | |||
if value.eq(&"\n".to_string()) { | |||
self.current_token = self.fetch_next_token(); | |||
if value.starts_with('\n') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of fetching the next token, we check if the current token has a \n. If so, we modify the current token to remove the \n and continue as usual.
for c in value.chars() { | ||
self.token_queue.push(Token::Text(c.to_string())); | ||
} | ||
self.token_queue.push(Token::Text(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can safely push the whole value directly as a single token instead of splitting each character as a separate token. This saves us A LOT of time parsing
let node = self.create_node(token, HTML_NAMESPACE); | ||
let node_id = self.document.get_mut().add_new_node(node); | ||
// Skip empty text nodes | ||
if let Token::Text(text) = token { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems possible that empty tokens arrive here. We have to make sure we don't add them to the tree
@@ -50,6 +50,30 @@ pub struct Tokenizer<'stream> { | |||
pub error_logger: Rc<RefCell<ErrorLogger>>, | |||
} | |||
|
|||
impl<'stream> Tokenizer<'stream> { | |||
pub(crate) fn insert_tokens_at_queue_start(&mut self, first_tokens: Vec<Token>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inserts a set of tokens at the front of the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd pitch for merging this sooner rather than later.
match self.insertion_mode { | ||
InsertionMode::Initial => { | ||
let mut anything_else = false; | ||
|
||
match &self.current_token.clone() { | ||
Token::Text(value) if self.current_token.is_mixed() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, current_token.is_mixed shoud move fetch_next_token, and insert_tokens_at_queue_start could not need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CharlesChen0823 I'm not sure what you exactly mean. Do you mean that we should do this "is_mixed" thing inside the fetch_next_token
and not directly everywhere in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CharlesChen0823 for now, I want to merge this PR.. if it's ok with you, can you describe in a PR draft what your idea is to change this?
No description provided.