[lazylex/html] Revert change related to <a empty= value>

The 'value' is a value, not another attribute! I tested it in the browser. So whitespace is allowed around =, regardless of whether the value is quoted or not.
oils-for-unix · Jan 12, 2025 · 2498981 · 2498981
1 parent d5fccf5
commit 2498981
Show file tree

Hide file tree

Showing 5 changed files with 59 additions and 46 deletions.
diff --git a/data_lang/htm8-test.sh b/data_lang/htm8-test.sh
@@ -4,9 +4,8 @@
 #   data_lang/htm8-test.sh
 #
 # TODO:
-# - Rename to data_lang/htm8.py
-#   - it has NO_SPECIAL_TAGS mode for XML
-#   - put iterators at a higher level in doctools/ ?
+# - Move code into data_lang/htm8.py
+#   - iterators stay in lazylex/html.py?
 #
 # - statically type it
 #   - revive pyannotate
@@ -15,8 +14,16 @@
 #   - for find(), do we need a C++ primitive for it?
 #   - no allocation for TagName()
 #   - ASDL file for Tok.Foo?
-# - refactor TagName() API - remove it from the TagLexer?
-#   - that is really the AttrLexer()
+# - remove TagName() from TagLexer(), it is on the Htm8Lexer
+#
+# re2c considerations:
+# - We need to use CAPTURES, so we can't use frontend/match directly
+# - Could we STREAM the lexer?
+#   - Instead of sentinel model, use something else!
+#   - default is sentinel with  padding, and there is YYFILL with padding
+#   - there is also the separate --storable-state option
+#   - because this can be used queries that don't allocate
+# - I may also want to do this with JSON
 #
 # Not working yet:
 # - understanding all entities &zz;
@@ -35,29 +42,6 @@
 # - Are there special rules for <svg> and <math>?
 # - Do we need to know about <textarea> <pre>?  Those don't have the same
 #   whitespace rules
-#
-# YSH API
-# - Generating HTML/HTM8 is much more common than parsing it
-#   - although maybe we can do RemoveComments as a demo?
-#   - that is the lowest level "sed" model
-# - For parsing, a minimum idea is:
-#   - lexer-based algorithms for query by tag, class name, and id
-#   - and then toTree() - this is a DOM
-#     - .tag and .attrs?
-#     - .innerHTML() and .outerHTML() perhaps
-#    - rewrite ul-table in that?
-#      - does that mean you mutate it, or construct text?
-#      - I think you can set the innerHTML probably
-#
-# - Testing of html.ysh aka htm8.ysh in the stdlib
-#
-# Cases:
-#   html 'hello <b>world</b>'
-#   html "hello <b>$name</b>"html
-#   html ["hello <b>$name</b>"]  # hm this isn't bad, it's an unevaluated expression?
-#   commonmark 'hello **world**'
-#   md 'hello **world**'
-#   md ['hello **$escape**'] ?   We don't have a good escaping algorithm
 
 
 REPO_ROOT=$(cd "$(dirname $0)/.."; pwd)

diff --git a/doc/htm8.md b/doc/htm8.md
@@ -108,6 +108,21 @@ Just emit it!  This always works, by design.
 
 - Set `NO_SPECIAL_TAGS`
 
+
+Conflicts between HTML5 and XML:
+
+- In XML, `<source>` is like any tag, and must be closed,
+- In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
+
+- In XML, `<script>` and `<style>` don't have special treatment
+- In HTML, they do
+
+- The header is different - `<!DOCTYPE html>` vs.  `<?xml version= ... ?>`
+
+- HTML: `<a empty= missing>` is two attributes
+- right now we don't handle `<a empty = "missing">` as a single attribute
+  - that is valid XML, so should we handle it?
+
 ### Converting to XML?
 
 - Always quote all attributes

diff --git a/doc/ysh-doc-processing.md b/doc/ysh-doc-processing.md
@@ -134,6 +134,34 @@ Safe HTML subset
 
 If you want to take user HTML, then you first use an HTML5 -> HT8 converter.
 
+## More Notes
+
+YSH API
+
+- Generating HTML/HTM8 is much more common than parsing it
+  - although maybe we can do RemoveComments as a demo?
+  - that is the lowest level "sed" model
+
+- For parsing, a minimum idea is:
+  - lexer-based algorithms for query by tag, class name, and id
+  - and then toTree() - this is a DOM
+    - .tag and .attrs?
+    - .innerHTML() and .outerHTML() perhaps
+   - rewrite ul-table in that?
+     - does that mean you mutate it, or construct text?
+     - I think you can set the innerHTML probably
+
+- Testing of html.ysh aka htm8.ysh in the stdlib
+
+Cases:
+
+    html 'hello <b>world</b>'
+    html "hello <b>$name</b>"html
+    html ["hello <b>$name</b>"]  # hm this isn't bad, it's an unevaluated expression?
+    commonmark 'hello **world**'
+    md 'hello **world**'
+    md ['hello **$escape**'] ?   We don't have a good escaping algorithm
+
 ## Related
 
 - [table-object-doc.html](table-object-doc.html)

diff --git a/lazylex/html.py b/lazylex/html.py
@@ -3,18 +3,6 @@
 lazylex/html.py - Low-Level HTML Processing.
 
 See lazylex/README.md for details.
-
-Conflicts between HTML5 and XML:
-
-- In XML, <source> is like any tag, and must be closed,
-- In HTML, <source> is a VOID tag, and must NOT be closedlike any tag, and must be closed,
-
-- In XML, <script> and <style> don't have special treatment
-- In HTML, they do
-
-- The header is different - <!DOCTYPE html> vs.  <?xml version= ... ?>
-
-So do have a mode for <script> <style> and void tags?  Upgrade HX8 into HTM8?
 """
 from __future__ import print_function
 
@@ -470,7 +458,7 @@ def ValidTokenList(s, no_special_tags=False):
 \s+                     # Leading whitespace is required
 (%s)                    # Attribute name
 (?:                     # Optional attribute value
-  =
+  \s* = \s*             # Spaces allowed around =
   (?:
     " ([^>"\x00]*) "    # double quoted value
   | ' ([^>'\x00]*) '    # single quoted value

diff --git a/lazylex/html_test.py b/lazylex/html_test.py
@@ -34,7 +34,7 @@ def testDotAll(self):
 
     def testAttrRe(self):
         _ATTR_RE = html._ATTR_RE
-        m = _ATTR_RE.match(' empty= missing')
+        m = _ATTR_RE.match(' empty= val')
         print(m.groups())
 
 
@@ -118,15 +118,13 @@ def testEmptyMissingValues(self):
         slices = lex.AllAttrsRawSlice()
         log('slices %s', slices)
 
-        lex = _MakeTagLexer(
-            '''<p double="" single='' empty= missing missing2>''')
+        lex = _MakeTagLexer('''<p double="" single='' empty= value missing>''')
         all_attrs = lex.AllAttrsRaw()
         self.assertEqual([
             ('double', ''),
             ('single', ''),
-            ('empty', ''),
+            ('empty', 'value'),
             ('missing', ''),
-            ('missing2', ''),
         ], all_attrs)
         # TODO: should have
         log('all %s', all_attrs)