LibWeb: Fix two character reference tokenization bugs #3163

squeek502 · 2025-01-06T23:03:33Z

See commits for details. Split off from #3011.

Note: I left the Swift HTMLTokenizer implementation alone since I was unable to get the build working with -DENABLE_SWIFT. From what I can tell, the DONT_CONSUME_NEXT_INPUT_CHARACTER-related bug likely needs to be fixed in the Swift implementation, but it seems that the hex digit bug was already fixed in the Swift implementation (presumably during the initial port).

This already passes, so there's no reason to skip it anymore

Previously, if the NumericCharacterReferenceEnd state was reached when current_input_character was None, then the DONT_CONSUME_NEXT_INPUT_CHARACTER macro would restore back before the EOF, and allow the next state (after the SWITCH_TO_RETURN_STATE) to proceed with the last digit of the numeric character reference. For example, with something like `&LadybirdBrowser#1111`, before this commit the output would incorrectly be `<code point with the value 1111>1` instead of just `<code point with the value 1111>`. Instead of putting the `if (current_input_character.has_value())` check inside NumericCharacterReferenceEnd directly, it was instead added to DONT_CONSUME_NEXT_INPUT_CHARACTER, because all usages of the macro benefit from this check, even if the other existing usage sites don't exhibit any bugs without it: - In MarkupDeclarationOpen, if the current_input_character is EOF, then the previous character is always `!`, so restoring and then checking forward for strings like `--`, `DOCTYPE`, etc won't match and the BogusComment state will run one extra time (once for `!` and once for EOF) with no practical consequences. With the `has_value()` check, BogusComment will only run once with EOF. - In AfterDOCTYPEName, ConsumeNextResult::RanOutOfCharacters can only occur when stopping at the insertion point, and because of how the code is structured, it is guaranteed that current_input_character is either `P` or `S`, so the `has_value()` check is irrelevant.

Instead of just A-F/a-f, any char A-Z/a-z was being accepted as a valid hexadecimal digit.

squeek502 added 3 commits December 22, 2024 12:33

LibWeb: Don't skip named-character-references test

10d4af8

This already passes, so there's no reason to skip it anymore

LibWeb: Fix hex character references accepting all alphabetic ASCII

7630a21

Instead of just A-F/a-f, any char A-Z/a-z was being accepted as a valid hexadecimal digit.

squeek502 force-pushed the character-reference-fixes branch from 4da871c to 7630a21 Compare January 6, 2025 23:12

squeek502 mentioned this pull request Jan 6, 2025

LibWeb: Make named character reference tokenization more spec-compliant & efficient #3011

Open

gmta approved these changes Jan 6, 2025

View reviewed changes

gmta enabled auto-merge (rebase) January 6, 2025 23:24

gmta merged commit 1ba15e1 into LadybirdBrowser:master Jan 6, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LibWeb: Fix two character reference tokenization bugs #3163

LibWeb: Fix two character reference tokenization bugs #3163

squeek502 commented Jan 6, 2025

LibWeb: Fix two character reference tokenization bugs #3163

LibWeb: Fix two character reference tokenization bugs #3163

Conversation

squeek502 commented Jan 6, 2025