Skip to content

Commit

Permalink
[lazylex/html] Partially handle tags like </SCRIPT> and </STYLE>
Browse files Browse the repository at this point in the history
We don't have the full HTML5 rule of case-insensitive matching.  But we
will at least handle <scrIPT>foo</scrIPT> and <STyle>foo</STyle>, if
that's what you wrote.

It's no longer limited to lower case.

---

The more comprehensive solution is to introduce another lexer mode,
which only has an end tag regex only.  Then you keep finding end tags
until you hit the one that matches, insensitive to case.

We can do that later perhaps.

---

Also make some notes about conversion to XML.  I believe we can do a
sed-like transformation and it will work.  We just have to catch < & >
in a few places.
  • Loading branch information
Andy C committed Jan 13, 2025
1 parent fb42513 commit f2bf58f
Show file tree
Hide file tree
Showing 3 changed files with 71 additions and 25 deletions.
43 changes: 36 additions & 7 deletions doc/htm8.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,32 @@ in_progress: yes
default_highlighter: oils-sh
---

HTM8 - Efficient HTML with Errors
HTM8 - An Easy Subset of HTML5, With Errors
=================================

- Syntax Errors: It's a Subset
- Efficient
- Easy
- Easy to Remember
- Easy to Implement
- Runs Efficiently - you don't have to materialize a big DOM tree, which
causes many allocations
- Convertable to XML?
- without allocations, with a `sed`-like transformation!
- low level lexing and matching

<!--
TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
and then validated by an XML parser
- lxml - this is supposed to be high quality
- Python stdlib uses expat - https://libexpat.github.io/
- Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
- do they have the billion laughs bug?
-->

<div id="toc">
</div>
Expand Down Expand Up @@ -125,11 +142,18 @@ Conflicts between HTML5 and XML:

### Converting to XML?

- Always quote all attributes
- Always quote `>` - are we alloxing this in HX8?
- Do something with `<script>` and `<style>`
- I guess turn them into normal tags, with escaping?
- Or maybe just disallow them?
- Add quotes to unquoted attributes
- single and double quotes stay the same?
- Quote special chars
- & BadAmpersand -> `&amp;`
- < BadLessThan -> `&lt;`
- > BadGreaterTnan -> `&gt;`
- `<script>` and `<style>`
- either add `<![CDATA[`
- or simply escape their values with `&amp; &lt;`
- what to do about case-insensitive tags?
- maybe you can just normalize them
- because we do strict matching
- Maybe validate any other declarations, like `<!DOCTYPE foo>`
- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`

Expand Down Expand Up @@ -157,6 +181,11 @@ This makes lexing the top-level structure easier.

### What Doesn't This Cover?

- HTM8 tags must be balanced to convert them to XML

- NUL bytes aren't allowed - currently due to re2c sentinel
- Although I think we could have the preprocessing pass to convert it to the
Unicode replacement char? I think that HTML might mandate that
- Encodings other than UTF-8. HTM8 is always UTF-8.
- Unicode Tag names and attribute names.
- This is allowed in HTML5 and XML.
Expand Down
21 changes: 15 additions & 6 deletions lazylex/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,10 @@ def _Peek(self):

if self.search_state is not None and not self.no_special_tags:
# TODO: case-insensitive search for </SCRIPT> <SCRipt> ?
#
# Another strategy: enter a mode where we find ONLY the end tag
# regex, and any data that's not <, and then check the canonical
# tag name for 'script' or 'style'.
pos = self.s.find(self.search_state, self.pos)
if pos == -1:
# unterminated <script> or <style>
Expand Down Expand Up @@ -328,10 +332,11 @@ def _Peek(self):
return Tok.CData, pos + 3 # ]]>

if tok_id == Tok.StartTag:
if self.TagNameEquals('script'):
self.search_state = '</script>'
elif self.TagNameEquals('style'):
self.search_state = '</style>'
# TODO: reduce allocations
if (self.TagNameEquals('script') or
self.TagNameEquals('style')):
# <SCRipt a=b> -> </SCRipt>
self.search_state = '</' + self._LiteralTagName() + '>'

return tok_id, m.end()
else:
Expand All @@ -346,12 +351,16 @@ def TagNameEquals(self, expected):
# directly?
return expected == self.CanonicalTagName()

def CanonicalTagName(self):
def _LiteralTagName(self):
# type: () -> None
assert self.tag_pos_left != -1, self.tag_pos_left
assert self.tag_pos_right != -1, self.tag_pos_right

tag_name = self.s[self.tag_pos_left:self.tag_pos_right]
return self.s[self.tag_pos_left:self.tag_pos_right]

def CanonicalTagName(self):
# type: () -> None
tag_name = self._LiteralTagName()
# Most tags are already lower case, so avoid allocation with this conditional
# TODO: this could go in the mycpp runtime?
if tag_name.islower():
Expand Down
32 changes: 20 additions & 12 deletions lazylex/html_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,16 +229,19 @@ def testScriptStyle(self):
'''
tokens = Lex(h)

self.assertEqual(
[
(Tok.RawData, 12),
(Tok.StartTag, 27), # <script>
(Tok.HtmlCData, 78), # JavaScript code is HTML CData
(Tok.EndTag, 87), # </script>
(Tok.RawData, 96), # \n
(Tok.EndOfStream, 96), # \n
],
tokens)
expected = [
(Tok.RawData, 12),
(Tok.StartTag, 27), # <script>
(Tok.HtmlCData, 78), # JavaScript code is HTML CData
(Tok.EndTag, 87), # </script>
(Tok.RawData, 96), # \n
(Tok.EndOfStream, 96), # \n
]
self.assertEqual(expected, tokens)

# Test case matching
tokens = Lex(h.replace('script', 'scrIPT'))
self.assertEqual(expected, tokens)

def testScriptStyleXml(self):
Tok = html.Tok
Expand Down Expand Up @@ -372,6 +375,8 @@ def testValid(self):

# not allowed, but 3 > 4 is allowed
'<a> 3 < 4 </a>',
# Not a CDATA tag
'<STYLEz><</STYLEz>',
]

INVALID_PARSE = [
Expand Down Expand Up @@ -413,9 +418,12 @@ def testValid(self):

# capital VOID tag
'<META><a></a>',

'<script><</script>',
'<SCRipt><</script>',
# matching
'<SCRipt><</SCRipt>',
'<SCRIPT><</SCRIPT>',
'<STYLE><</STYLE>',
#'<SCRipt><</script>',

# Note: Python HTMLParser.py does DYNAMIC compilation of regex with re.I
# flag to handle this! Gah I want something faster.
Expand Down

0 comments on commit f2bf58f

Please sign in to comment.