Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html5 parser / tree-construction test refactor #223

Merged
merged 7 commits into from
Nov 5, 2023
Merged

Conversation

jaytaph
Copy link
Member

@jaytaph jaytaph commented Nov 3, 2023

This is an initial draft of a refactorered setup for the tree-tokenizer as it was getting a bit too complex to implement fragmenting / templating correctly.

It uses the same parser as before from @emwalker, but the internals on the tester itself have been changed:

  • we have load_fixture functions to load fixtures from directories and files.
  • we have a test harness, on which you can call run_test with a test structure.
  • the test harness calls the html5 parser and outputs regular data
  • the test harness then creates an internal generated tree through the tree generator function and compares the tree agains the expected tree and errors
  • the result is a "TestResult" which consists of a result for every line in the tree. Each line can be success, fail, missing, additional, depending on the result.
  • the same with errors: error results can be correct, failed, missing, or can be correct, but with wrong line/column positions.
  • the end user can deal with the output of the results. In the case of the html5parser-test, it's a simple dot or X, in the test suite, it's a assert!(result->is_success(), and for the parser_test we will actually output each single line from the test result for debugging purposes.

I THINK this makes the whole process a lot clearer where most of the responsibilities are separated in separate files / structs.

In the end, this solves that we can call the harness with just a few commands, and have the results in whatever way we want to represent. It's easier to implement some of the more complex situations like document-fragments and template content.

@jaytaph jaytaph marked this pull request as ready for review November 4, 2023 13:31
@jaytaph jaytaph requested review from emwalker, Kiyoshika, neuodev and a team November 4, 2023 13:31
src/bin/html5-parser-test.rs Outdated Show resolved Hide resolved
for result in results {
for test in fixture.tests.iter() {
for &scripting_enabled in test.script_modes() {
let result = harness
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run a single test (loaded from the fixtures). It returns a result which holds all information which can be evaluated manually

@@ -1203,7 +1203,7 @@ impl<'chars> Html5Parser<'chars> {
Token::EndTag { name, .. }
if name == "tbody" || name == "tfoot" || name == "thead" =>
{
if !self.is_in_scope(name, Scope::Table) {
if !self.is_in_scope(name, HTML_NAMESPACE, Scope::Table) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an issue in is_in_scope where we did check for a tag, but DIDN'T check the namespace. In these specific tests, the tag was "TR", but the namespace was NOT html, but mathml. This is changed so is_in_scope also received the namespace on where to check. This is always HTML though, but it's more flexible to have it this was.. a "tag" by itself says nothing, and now we are more flexible.

@@ -2145,8 +2147,25 @@ impl<'chars> Html5Parser<'chars> {

self.frameset_ok = false;

// Add attributes to body element
// @TODO add body attributes
let body_node_id = self.open_elements.iter().find(|node_id| {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have multiple tags, we need to copy the attributes from the body tags in the first / original body tag. The other body tags are further ignored

@@ -2966,6 +2986,22 @@ impl<'chars> Html5Parser<'chars> {
Token::StartTag { name, .. } if name == "template" => {
let node_id = self.insert_html_element(&self.current_token.clone());

self.active_formatting_elements_push_marker();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More work needs to be done. Html5ever doesn't seem to do this, but I think we should

position: InsertionPositionMode<NodeId>,
token: &Token,
) {
pub fn insert_text_helper(&mut self, position: InsertionPositionMode<NodeId>, token: &Token) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we merge texts, we don't need to create a node, so we are not passing one anymore and only create one when it's needed.

document: DocumentHandle,
}

impl TreeOutputGenerator {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This outputs a given parsed tree (document) into the same format as found in the fixture tests.

@@ -296,7 +297,7 @@ fn test(i: Span) -> IResult<Span, TestSpec> {

TestSpec {
position,
data: data.to_string(),
data: data.to_string().trim_matches(|c| c == '\n').to_string(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some nom parse issues when there is a \n at the end of the data. I can't seem to get the parser fix this during parsing, so we do this as a post-step. It works, but I reckon this can be done better.

Copy link
Collaborator

@emwalker emwalker Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-processing what the nom parser has a hard time with seems fine. As long as TestSpec isn't losing any important information, the wrapper Test struct can work it into a suitable format.

@@ -333,6 +334,9 @@ mod tests {

#[test]
fn parse_data() {
let (_, s) = data("#data\n Test \n#errors\n".into()).unwrap();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes when #errors is the last line of a fixture file, this goes wrong. So i'm testing with and without a \n now.

// See tests/data/html5lib-tests/tree-construction/ for other test files.
#[test_case("tests1.dat")]
#[test_case("tests2.dat")]
#[test_case("tests3.dat")]
#[test_case("tests4.dat")]
#[test_case("tests5.dat")]
#[test_case("tests6.dat")]
// #[test_case("tests6.dat")]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some test cases are not working. I reckon they were false positives so this is a good thing.

continue;
}

println!("tree construction: {}", test.data());
test.assert_valid();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a better separation of concerns: the assert_valid was in a test, but it should not be there. Now it's moved to a regular assert!, where the test simply tells it if it passes or not (is_success).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that I agree with the general principle in this case, but the specific change seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think it separates this better, is that when doing an assert_valid(), the test case now knows about the specific test system you are using. By just having the test returning an "i'm ok", or "i'm not ok",.. you can leave the specific way of asserting to the caller. For instance, in the cargo tests, we use assert! for this, but in the html5-parser-tests, we need just the is_success to display an X or ..

use gosub_engine::testing::tree_construction::fixture::read_fixtures;
use gosub_engine::testing::tree_construction::result::ResultStatus;
use gosub_engine::testing::tree_construction::Harness;
use gosub_engine::testing::tree_construction::Test;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is an improvement, but I've also seen some Rust code like it, so it's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you now talking about the fact that we have multiple structs / functions?

fn main() -> Result<()> {
let mut results = TestResults {
fn main() {
let mut results = GlobalTestResults {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that "Global" tells us anything here that "TestResults" didn't already. But "global" has the disadvantage of connotations of a singleton or global variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GlobalTestResults are actually an aggregation of test results. It is used to keep track on how many of the tests fails or succeeded. But i agree that the naming is a bit wrong here.. I think a TotalResults would be better.

"❌ ({}:{}) {} (missing)",
entry.expected.line, entry.expected.col, entry.expected.message
);
}
Copy link
Collaborator

@emwalker emwalker Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from an enum that had the actual and expected to a bare enum status without fields that requires you to infer what was different feels like slight a step backwards. But it also seems harmless and easy to revisit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the things that I'm trying to get used to. So basically you are opting for each result status to be a struct by itself which contains all the information needed for that particular status, instead of having a status enum and accompaning variables?

I'm not 100% sure how this would look like, so maybe you can make a PR for this?

@@ -50,188 +32,80 @@ pub struct Test {
pub line: usize,
/// The specification of the test provided in the test file
pub spec: TestSpec,
/// The document tree that is expected to be parsed
/// The document tree as found in the spec converted to an array
pub document: Vec<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note (unrelated to this PR) — I feel like we're doing a lot of processing of the TestSpec, e.g., by splitting the document into lines, and doing sorting and other processing of the lines. I suspect a more rigorous approach would be to just take the document as it is provided in the third-party test and attempt to reproduce it from our own parsing of the input.

(That observation is not meant for this PR.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think document here isn't something that would need to be in the test-spec, but is something that could be generated whenever we need to. The thing is that I'm testing the results better by generating a tree structure from our document, and compare that to the tree from the #document found in the test spec. One of the benefits is that the tests itself are easier because it's now nothing more than a line-by-line compare. The previous way was too complex and resulted in some false positives that we didn't catch until i've actually refactored this.

pub fn errors(&self) -> &Vec<ErrorSpec> {
&self.spec.errors
pub fn get_document_as_str(&self) -> &str {
return self.spec.document.as_str();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thought about the get_ prefix as above.

line: self.line,
spec: self.spec.clone(),
document: self.document.clone(),
}
Copy link
Collaborator

@emwalker emwalker Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether we need an explicit Clone implementation, instead of just doing

#[derive(Copy, Clone, Debug, ...)]
pub struct Test {
  // ...

(no need to block on this question)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can now after a refactor that i did and didn't remove this clone

file_path: "".to_string(),
line: 0,
spec: TestSpec::default(),
document: vec![],
}
Copy link
Collaborator

@emwalker emwalker Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question about default here:

#[derive(Default, Debug, ...)]
pub struct Test {
  // ...

(Again, no need to block.)

let options = Html5ParserOptions { scripting_enabled };

let mut chars = CharIterator::new();
chars.read_from_str(self.data(), None);
chars.read_from_str(self.test.spec.data.as_str(), None);
Copy link
Collaborator

@emwalker emwalker Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably keep test.spec private and expose self.test.data().as_str(), which delegates to &self.spec.data inside the Test struct.

(no need to block on this detail).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do the same with document from the spec_data, so this makes sense

@@ -287,8 +160,171 @@ impl Test {
Ok((document, parse_errors))
}

/// Retrieves the next line from the spec document
fn get_next_line(&mut self) -> Option<String> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about the get_ prefix.

Comment on lines 242 to 255
loop {
let expected_line = self.get_next_line();
if expected_line.is_none() {
break;
}

result.tree_results.push(TreeLineResult {
index: line_idx,
result: ResultStatus::Additional,
expected: expected_line.expect("").to_string(),
actual: "".into(),
});
line_idx += 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
loop {
let expected_line = self.get_next_line();
if expected_line.is_none() {
break;
}
result.tree_results.push(TreeLineResult {
index: line_idx,
result: ResultStatus::Additional,
expected: expected_line.expect("").to_string(),
actual: "".into(),
});
line_idx += 1;
}
while let Some(expected_line) = self.get_next_line() {
result.tree_results.push(TreeLineResult {
index: line_idx,
result: ResultStatus::Additional,
expected: expected_line.expect("").to_string(),
actual: "".into(),
});
line_idx += 1;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

assert!(result.success(), "invalid tree-construction result");
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about deleting this dead code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@emwalker emwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with suggestions (feel free to ignore)

@jaytaph
Copy link
Member Author

jaytaph commented Nov 4, 2023

Will try and incorporate your suggestions. But that will be tormorrow at best

@emwalker
Copy link
Collaborator

emwalker commented Nov 4, 2023

Will try and incorporate your suggestions. But that will be tormorrow at best

For this PR, I'd only consider the smallest, easiest suggestions. Some of the suggestions were bigger, and some where more thoughts than suggestions, which came to mind but were not something I'd necessarily recommend.

@jaytaph jaytaph requested a review from emwalker November 5, 2023 09:59
@jaytaph jaytaph merged commit d62ece2 into main Nov 5, 2023
4 checks passed
@jaytaph jaytaph deleted the template-fix branch November 5, 2023 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants