-
-
Notifications
You must be signed in to change notification settings - Fork 163
Parsing Models Cheatsheet
(Back to Shell Autocompletion)
From Zulip:
NOTE: Speed isn't the most important thing when considering a model; I think the more important issues are expressiveness (what can it parse), readability, and debuggability.
- Basic regexes (e.g. BRE grep/sed or ERE with awk, egrep): linear time matching, constant space
- Perl style regexes (now in Python, Ruby, JS, etc.): exponential in the worst case. (Russ Cox's articles are all about this.)
- Arbitrary CFG:
O(n^3)
. There are like 10 different algorithms to recognize CFGs, with complex sets of advantages and disadvantages.-
LR
family-
LALR(1)
grammar (yacc):O(n)
, accepting all the limitations with shift-reduce conflicts and such
-
-
LL
family- Python's pgen:
LL(1)
which can be matched inO(n)
time. - ANTLR: started out as
LL(k)
which I believe is alsoO(n)
, but ANTLR v4 introduced a more powerful modelALL(*)
(“all-star”).
- Python's pgen:
-
- PEG: exponential backtracking or linear time memoized packrat parsing.
- Turing complete code: you can write arbitrarily slow code, but people generally don't, because it's obvious when you have an
O(n^2)
loop or are doing exponential backtracking.
-
sed
: uses arbitrary code. -
awk
: LALR(1) parser with yacc [1]. - Python: an LL(1) parser written with a bespoke grammar DSL
pgen
[2].
One of the major motivations for OSH it to test this theory that Chet Ramey wrote about [3], after 20+ years maintaining bash:
One thing I've considered multiple times, but never done, is rewriting the bash parser using straight recursive-descent rather than using bison. [...] Were I starting bash from scratch, I probably would have written a parser by hand. It certainly would have made some things easier.
Andy: In my opinion, this experiment was a big success. The OSH parser is more maintainable and less buggy than bash's parser (although it's admittedly slower, being in Python). bash is 20+ years old and they are still fixing corner cases involving matching { }
and ( )
.
It's because their code s a messy mix of yacc and C code, and it would be better off as well-structured code (in C, or a higher level language). The interface between the two models messy and ill-defined (and filled with global variables).
Looking at parse.y
in bash, there's much more C code there than there is yacc grammar. The grammar solves maybe 25% of the problems you have. And subst.c
has a ton of ad hoc parsing code outside the grammar.
$ wc -l parse.y
>6513 parse.y
[1] I forked Kernighan's original Awk and found a couple minor bugs in it. https://github.com/andychu/bwk
[2] The "update" here is due to a private e-mail discussion I had with Guido on pgen's design. http://python-history.blogspot.com/2018/05/the-origins-of-pgen.html