Skip to content

Commit

Permalink
[lazylex/html] Revert change related to <a empty= value>
Browse files Browse the repository at this point in the history
The 'value' is a value, not another attribute!  I tested it in the
browser.

So whitespace is allowed around =, regardless of whether the value is
quoted or not.
  • Loading branch information
Andy C committed Jan 12, 2025
1 parent d5fccf5 commit 2498981
Show file tree
Hide file tree
Showing 5 changed files with 59 additions and 46 deletions.
40 changes: 12 additions & 28 deletions data_lang/htm8-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,8 @@
# data_lang/htm8-test.sh
#
# TODO:
# - Rename to data_lang/htm8.py
# - it has NO_SPECIAL_TAGS mode for XML
# - put iterators at a higher level in doctools/ ?
# - Move code into data_lang/htm8.py
# - iterators stay in lazylex/html.py?
#
# - statically type it
# - revive pyannotate
Expand All @@ -15,8 +14,16 @@
# - for find(), do we need a C++ primitive for it?
# - no allocation for TagName()
# - ASDL file for Tok.Foo?
# - refactor TagName() API - remove it from the TagLexer?
# - that is really the AttrLexer()
# - remove TagName() from TagLexer(), it is on the Htm8Lexer
#
# re2c considerations:
# - We need to use CAPTURES, so we can't use frontend/match directly
# - Could we STREAM the lexer?
# - Instead of sentinel model, use something else!
# - default is sentinel with padding, and there is YYFILL with padding
# - there is also the separate --storable-state option
# - because this can be used queries that don't allocate
# - I may also want to do this with JSON
#
# Not working yet:
# - understanding all entities &zz;
Expand All @@ -35,29 +42,6 @@
# - Are there special rules for <svg> and <math>?
# - Do we need to know about <textarea> <pre>? Those don't have the same
# whitespace rules
#
# YSH API
# - Generating HTML/HTM8 is much more common than parsing it
# - although maybe we can do RemoveComments as a demo?
# - that is the lowest level "sed" model
# - For parsing, a minimum idea is:
# - lexer-based algorithms for query by tag, class name, and id
# - and then toTree() - this is a DOM
# - .tag and .attrs?
# - .innerHTML() and .outerHTML() perhaps
# - rewrite ul-table in that?
# - does that mean you mutate it, or construct text?
# - I think you can set the innerHTML probably
#
# - Testing of html.ysh aka htm8.ysh in the stdlib
#
# Cases:
# html 'hello <b>world</b>'
# html "hello <b>$name</b>"html
# html ["hello <b>$name</b>"] # hm this isn't bad, it's an unevaluated expression?
# commonmark 'hello **world**'
# md 'hello **world**'
# md ['hello **$escape**'] ? We don't have a good escaping algorithm


REPO_ROOT=$(cd "$(dirname $0)/.."; pwd)
Expand Down
15 changes: 15 additions & 0 deletions doc/htm8.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,21 @@ Just emit it! This always works, by design.

- Set `NO_SPECIAL_TAGS`


Conflicts between HTML5 and XML:

- In XML, `<source>` is like any tag, and must be closed,
- In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,

- In XML, `<script>` and `<style>` don't have special treatment
- In HTML, they do

- The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`

- HTML: `<a empty= missing>` is two attributes
- right now we don't handle `<a empty = "missing">` as a single attribute
- that is valid XML, so should we handle it?

### Converting to XML?

- Always quote all attributes
Expand Down
28 changes: 28 additions & 0 deletions doc/ysh-doc-processing.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,34 @@ Safe HTML subset

If you want to take user HTML, then you first use an HTML5 -> HT8 converter.

## More Notes

YSH API

- Generating HTML/HTM8 is much more common than parsing it
- although maybe we can do RemoveComments as a demo?
- that is the lowest level "sed" model

- For parsing, a minimum idea is:
- lexer-based algorithms for query by tag, class name, and id
- and then toTree() - this is a DOM
- .tag and .attrs?
- .innerHTML() and .outerHTML() perhaps
- rewrite ul-table in that?
- does that mean you mutate it, or construct text?
- I think you can set the innerHTML probably

- Testing of html.ysh aka htm8.ysh in the stdlib

Cases:

html 'hello <b>world</b>'
html "hello <b>$name</b>"html
html ["hello <b>$name</b>"] # hm this isn't bad, it's an unevaluated expression?
commonmark 'hello **world**'
md 'hello **world**'
md ['hello **$escape**'] ? We don't have a good escaping algorithm

## Related

- [table-object-doc.html](table-object-doc.html)
Expand Down
14 changes: 1 addition & 13 deletions lazylex/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,6 @@
lazylex/html.py - Low-Level HTML Processing.
See lazylex/README.md for details.
Conflicts between HTML5 and XML:
- In XML, <source> is like any tag, and must be closed,
- In HTML, <source> is a VOID tag, and must NOT be closedlike any tag, and must be closed,
- In XML, <script> and <style> don't have special treatment
- In HTML, they do
- The header is different - <!DOCTYPE html> vs. <?xml version= ... ?>
So do have a mode for <script> <style> and void tags? Upgrade HX8 into HTM8?
"""
from __future__ import print_function

Expand Down Expand Up @@ -470,7 +458,7 @@ def ValidTokenList(s, no_special_tags=False):
\s+ # Leading whitespace is required
(%s) # Attribute name
(?: # Optional attribute value
=
\s* = \s* # Spaces allowed around =
(?:
" ([^>"\x00]*) " # double quoted value
| ' ([^>'\x00]*) ' # single quoted value
Expand Down
8 changes: 3 additions & 5 deletions lazylex/html_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def testDotAll(self):

def testAttrRe(self):
_ATTR_RE = html._ATTR_RE
m = _ATTR_RE.match(' empty= missing')
m = _ATTR_RE.match(' empty= val')
print(m.groups())


Expand Down Expand Up @@ -118,15 +118,13 @@ def testEmptyMissingValues(self):
slices = lex.AllAttrsRawSlice()
log('slices %s', slices)

lex = _MakeTagLexer(
'''<p double="" single='' empty= missing missing2>''')
lex = _MakeTagLexer('''<p double="" single='' empty= value missing>''')
all_attrs = lex.AllAttrsRaw()
self.assertEqual([
('double', ''),
('single', ''),
('empty', ''),
('empty', 'value'),
('missing', ''),
('missing2', ''),
], all_attrs)
# TODO: should have
log('all %s', all_attrs)
Expand Down

0 comments on commit 2498981

Please sign in to comment.