Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LibWeb: Fix two character reference tokenization bugs #3163

Merged
merged 3 commits into from
Jan 6, 2025

Conversation

squeek502
Copy link
Contributor

See commits for details. Split off from #3011.

Note: I left the Swift HTMLTokenizer implementation alone since I was unable to get the build working with -DENABLE_SWIFT. From what I can tell, the DONT_CONSUME_NEXT_INPUT_CHARACTER-related bug likely needs to be fixed in the Swift implementation, but it seems that the hex digit bug was already fixed in the Swift implementation (presumably during the initial port).

This already passes, so there's no reason to skip it anymore
Previously, if the NumericCharacterReferenceEnd state was reached when
current_input_character was None, then the
DONT_CONSUME_NEXT_INPUT_CHARACTER macro would restore back before the
EOF, and allow the next state (after the SWITCH_TO_RETURN_STATE) to
proceed with the last digit of the numeric character reference.

For example, with something like `&LadybirdBrowser#1111`, before this commit the
output would incorrectly be `<code point with the value 1111>1` instead
of just `<code point with the value 1111>`.

Instead of putting the `if (current_input_character.has_value())` check
inside NumericCharacterReferenceEnd directly, it was instead added to
DONT_CONSUME_NEXT_INPUT_CHARACTER, because all usages of the macro
benefit from this check, even if the other existing usage sites don't
exhibit any bugs without it:

- In MarkupDeclarationOpen, if the current_input_character is EOF, then
  the previous character is always `!`, so restoring and then checking
  forward for strings like `--`, `DOCTYPE`, etc won't match and the
  BogusComment state will run one extra time (once for `!` and once
  for EOF) with no practical consequences. With the `has_value()` check,
  BogusComment will only run once with EOF.

- In AfterDOCTYPEName, ConsumeNextResult::RanOutOfCharacters can only
  occur when stopping at the insertion point, and because of how
  the code is structured, it is guaranteed that current_input_character
  is either `P` or `S`, so the `has_value()` check is irrelevant.
Instead of just A-F/a-f, any char A-Z/a-z was being accepted as a valid
hexadecimal digit.
@gmta gmta enabled auto-merge (rebase) January 6, 2025 23:24
@gmta gmta merged commit 1ba15e1 into LadybirdBrowser:master Jan 6, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants