-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
html5 parser / tree-construction test refactor #223
Conversation
for result in results { | ||
for test in fixture.tests.iter() { | ||
for &scripting_enabled in test.script_modes() { | ||
let result = harness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We run a single test (loaded from the fixtures). It returns a result which holds all information which can be evaluated manually
@@ -1203,7 +1203,7 @@ impl<'chars> Html5Parser<'chars> { | |||
Token::EndTag { name, .. } | |||
if name == "tbody" || name == "tfoot" || name == "thead" => | |||
{ | |||
if !self.is_in_scope(name, Scope::Table) { | |||
if !self.is_in_scope(name, HTML_NAMESPACE, Scope::Table) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an issue in is_in_scope where we did check for a tag, but DIDN'T check the namespace. In these specific tests, the tag was "TR", but the namespace was NOT html, but mathml. This is changed so is_in_scope also received the namespace on where to check. This is always HTML though, but it's more flexible to have it this was.. a "tag" by itself says nothing, and now we are more flexible.
@@ -2145,8 +2147,25 @@ impl<'chars> Html5Parser<'chars> { | |||
|
|||
self.frameset_ok = false; | |||
|
|||
// Add attributes to body element | |||
// @TODO add body attributes | |||
let body_node_id = self.open_elements.iter().find(|node_id| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we have multiple tags, we need to copy the attributes from the body tags in the first / original body tag. The other body tags are further ignored
@@ -2966,6 +2986,22 @@ impl<'chars> Html5Parser<'chars> { | |||
Token::StartTag { name, .. } if name == "template" => { | |||
let node_id = self.insert_html_element(&self.current_token.clone()); | |||
|
|||
self.active_formatting_elements_push_marker(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More work needs to be done. Html5ever doesn't seem to do this, but I think we should
position: InsertionPositionMode<NodeId>, | ||
token: &Token, | ||
) { | ||
pub fn insert_text_helper(&mut self, position: InsertionPositionMode<NodeId>, token: &Token) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we merge texts, we don't need to create a node, so we are not passing one anymore and only create one when it's needed.
document: DocumentHandle, | ||
} | ||
|
||
impl TreeOutputGenerator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This outputs a given parsed tree (document) into the same format as found in the fixture tests.
@@ -296,7 +297,7 @@ fn test(i: Span) -> IResult<Span, TestSpec> { | |||
|
|||
TestSpec { | |||
position, | |||
data: data.to_string(), | |||
data: data.to_string().trim_matches(|c| c == '\n').to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some nom parse issues when there is a \n at the end of the data. I can't seem to get the parser fix this during parsing, so we do this as a post-step. It works, but I reckon this can be done better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post-processing what the nom parser has a hard time with seems fine. As long as TestSpec
isn't losing any important information, the wrapper Test
struct can work it into a suitable format.
@@ -333,6 +334,9 @@ mod tests { | |||
|
|||
#[test] | |||
fn parse_data() { | |||
let (_, s) = data("#data\n Test \n#errors\n".into()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes when #errors is the last line of a fixture file, this goes wrong. So i'm testing with and without a \n now.
// See tests/data/html5lib-tests/tree-construction/ for other test files. | ||
#[test_case("tests1.dat")] | ||
#[test_case("tests2.dat")] | ||
#[test_case("tests3.dat")] | ||
#[test_case("tests4.dat")] | ||
#[test_case("tests5.dat")] | ||
#[test_case("tests6.dat")] | ||
// #[test_case("tests6.dat")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some test cases are not working. I reckon they were false positives so this is a good thing.
continue; | ||
} | ||
|
||
println!("tree construction: {}", test.data()); | ||
test.assert_valid(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a better separation of concerns: the assert_valid was in a test, but it should not be there. Now it's moved to a regular assert!, where the test simply tells it if it passes or not (is_success).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that I agree with the general principle in this case, but the specific change seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I think it separates this better, is that when doing an assert_valid(), the test case now knows about the specific test system you are using. By just having the test returning an "i'm ok", or "i'm not ok",.. you can leave the specific way of asserting to the caller. For instance, in the cargo tests, we use assert!
for this, but in the html5-parser-tests, we need just the is_success
to display an X
or .
.
use gosub_engine::testing::tree_construction::fixture::read_fixtures; | ||
use gosub_engine::testing::tree_construction::result::ResultStatus; | ||
use gosub_engine::testing::tree_construction::Harness; | ||
use gosub_engine::testing::tree_construction::Test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is an improvement, but I've also seen some Rust code like it, so it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you now talking about the fact that we have multiple structs / functions?
src/bin/parser-test.rs
Outdated
fn main() -> Result<()> { | ||
let mut results = TestResults { | ||
fn main() { | ||
let mut results = GlobalTestResults { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that "Global" tells us anything here that "TestResults" didn't already. But "global" has the disadvantage of connotations of a singleton or global variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GlobalTestResults
are actually an aggregation of test results. It is used to keep track on how many of the tests fails or succeeded. But i agree that the naming is a bit wrong here.. I think a TotalResults would be better.
"❌ ({}:{}) {} (missing)", | ||
entry.expected.line, entry.expected.col, entry.expected.message | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching from an enum that had the actual and expected to a bare enum status without fields that requires you to infer what was different feels like slight a step backwards. But it also seems harmless and easy to revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the things that I'm trying to get used to. So basically you are opting for each result status to be a struct by itself which contains all the information needed for that particular status, instead of having a status enum and accompaning variables?
I'm not 100% sure how this would look like, so maybe you can make a PR for this?
@@ -50,188 +32,80 @@ pub struct Test { | |||
pub line: usize, | |||
/// The specification of the test provided in the test file | |||
pub spec: TestSpec, | |||
/// The document tree that is expected to be parsed | |||
/// The document tree as found in the spec converted to an array | |||
pub document: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note (unrelated to this PR) — I feel like we're doing a lot of processing of the TestSpec
, e.g., by splitting the document into lines, and doing sorting and other processing of the lines. I suspect a more rigorous approach would be to just take the document as it is provided in the third-party test and attempt to reproduce it from our own parsing of the input.
(That observation is not meant for this PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think document here isn't something that would need to be in the test-spec, but is something that could be generated whenever we need to. The thing is that I'm testing the results better by generating a tree structure from our document, and compare that to the tree from the #document
found in the test spec. One of the benefits is that the tests itself are easier because it's now nothing more than a line-by-line compare. The previous way was too complex and resulted in some false positives that we didn't catch until i've actually refactored this.
pub fn errors(&self) -> &Vec<ErrorSpec> { | ||
&self.spec.errors | ||
pub fn get_document_as_str(&self) -> &str { | ||
return self.spec.document.as_str(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thought about the get_
prefix as above.
src/testing/tree_construction.rs
Outdated
line: self.line, | ||
spec: self.spec.clone(), | ||
document: self.document.clone(), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering whether we need an explicit Clone
implementation, instead of just doing
#[derive(Copy, Clone, Debug, ...)]
pub struct Test {
// ...
(no need to block on this question)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can now after a refactor that i did and didn't remove this clone
file_path: "".to_string(), | ||
line: 0, | ||
spec: TestSpec::default(), | ||
document: vec![], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question about default here:
#[derive(Default, Debug, ...)]
pub struct Test {
// ...
(Again, no need to block.)
src/testing/tree_construction.rs
Outdated
let options = Html5ParserOptions { scripting_enabled }; | ||
|
||
let mut chars = CharIterator::new(); | ||
chars.read_from_str(self.data(), None); | ||
chars.read_from_str(self.test.spec.data.as_str(), None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably keep test.spec
private and expose self.test.data().as_str()
, which delegates to &self.spec.data
inside the Test
struct.
(no need to block on this detail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do the same with document
from the spec_data, so this makes sense
src/testing/tree_construction.rs
Outdated
@@ -287,8 +160,171 @@ impl Test { | |||
Ok((document, parse_errors)) | |||
} | |||
|
|||
/// Retrieves the next line from the spec document | |||
fn get_next_line(&mut self) -> Option<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about the get_
prefix.
src/testing/tree_construction.rs
Outdated
loop { | ||
let expected_line = self.get_next_line(); | ||
if expected_line.is_none() { | ||
break; | ||
} | ||
|
||
result.tree_results.push(TreeLineResult { | ||
index: line_idx, | ||
result: ResultStatus::Additional, | ||
expected: expected_line.expect("").to_string(), | ||
actual: "".into(), | ||
}); | ||
line_idx += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop { | |
let expected_line = self.get_next_line(); | |
if expected_line.is_none() { | |
break; | |
} | |
result.tree_results.push(TreeLineResult { | |
index: line_idx, | |
result: ResultStatus::Additional, | |
expected: expected_line.expect("").to_string(), | |
actual: "".into(), | |
}); | |
line_idx += 1; | |
} | |
while let Some(expected_line) = self.get_next_line() { | |
result.tree_results.push(TreeLineResult { | |
index: line_idx, | |
result: ResultStatus::Additional, | |
expected: expected_line.expect("").to_string(), | |
actual: "".into(), | |
}); | |
line_idx += 1; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/testing/tree_construction.rs
Outdated
assert!(result.success(), "invalid tree-construction result"); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about deleting this dead code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with suggestions (feel free to ignore)
Will try and incorporate your suggestions. But that will be tormorrow at best |
For this PR, I'd only consider the smallest, easiest suggestions. Some of the suggestions were bigger, and some where more thoughts than suggestions, which came to mind but were not something I'd necessarily recommend. |
This is an initial draft of a refactorered setup for the tree-tokenizer as it was getting a bit too complex to implement fragmenting / templating correctly.
It uses the same parser as before from @emwalker, but the internals on the tester itself have been changed:
run_test
with a test structure.assert!(result->is_success()
, and for the parser_test we will actually output each single line from the test result for debugging purposes.I THINK this makes the whole process a lot clearer where most of the responsibilities are separated in separate files / structs.
In the end, this solves that we can call the harness with just a few commands, and have the results in whatever way we want to represent. It's easier to implement some of the more complex situations like document-fragments and template content.