-
-
Notifications
You must be signed in to change notification settings - Fork 163
Implementing the Oil Expression Language
Turn Oil's expression grammar into an AST #387
Related: Tips on Using pgen2
Demo:
bin/osh -n -c 'var x = 1 + 2 * 3;'
This already works. (Right now semicolon or newline are accepted, we should also add EOF.)
-
https://github.com/oilshell/oil/tree/master/oil_lang
-
grammar.pgen2
is literally Python 3's grammar!!! -
expr_parse.py
contains the public interface that the rest of the code uses. It turns a stream of tokens into an AST, which is two steps under the hood. (tokens -> parse tree, then parse tree -> AST)- Important: it also handles lexer modes! There's an important difference between OSH and Oil.
In the Oil expression language, lexer modes are decoupled from the parser. They can be determined just by
looking at 1 token -- i.e.
"
starts a double quoted string, and"
ends one. In OSH, the lexer modes are dependent on the control flow of the recursive descent parser.
- Important: it also handles lexer modes! There's an important difference between OSH and Oil.
In the Oil expression language, lexer modes are decoupled from the parser. They can be determined just by
looking at 1 token -- i.e.
-
expr_to_ast.py
-- the "transformer" i.e. parse tree -> AST step
-
-
frontend/syntax.asdl
is the unified OSH and Oil code representation- Scroll down to
OIL LANGUAGE
, and then everything we care about is under theexpr
type. -
command.OilAssign
is where Oil and OSH are integrated. That is,ls -l
andvar x = [1,2,3]
are both commands in OSH. The latter is an Oil expression.
- Scroll down to
-
frontend/lex.py
-- the huge unified OSH and Oil lexer. Lexer modes for Oil are toward the bottom.-
lex_mode_e.Expr
is the main one for Oil expressions. But we also have different ones for:- Double Quoted Strings
"string $interp"
- regex literals
$/ d+ /
- array literals
@[myprog --foo --bar=1]
- Double Quoted Strings
-
-
osh/word_parse.py
has the integration point between OSH and Oil-
enode, last_token = self.parse_ctx.ParseOilAssign(self.lexer, grammar_nt.oil_var)
-- that indicates that we're using theoil_var
production ingrammar.pgen2
-
- https://github.com/oilshell/oil/blob/master/opy/compiler2/transformer.py is a version of this for Python (forked from the Python 2 standard library)
- drwilly is working on
find
in https://github.com/oilshell/oil/pull/386, which also has a "transformer"
- LHS and RHS of assignments
Python distinguishes LHS and RHS after parsing and before AST construction, i.e. in this "transformer", and we'll follow the same strategy. That is, certainexpr
nodes can appear on both LHS and RHS, and others can only appear on the RHS.- no we want to restrict LHS expressions (and optional type expressions) in the grammar
- All the operators
- unary, binary
- ternary operator:
a if cond else b
- including
in
,not in
,is
,is not
- subscripting, slicing
- Small changes:
-
//
isdiv
-
**
is^
(following R and other mathematical languages) -
^
isxor
-
- lower priority, but we'll probably end up having:
- starred expressions on LHS and RHS for "splatting". (Might use
@
operator instead?) - chained comparisons like
3 < x <= 5
- starred expressions on LHS and RHS for "splatting". (Might use
- function calls
f(x, y=3)
. Includes method calls with.
operator, e.g.mydict.clear()
- To start, all the functions we will be builtins. User Function definitions come later!
- Literals: the "JSON" subset
- dict -- except keys are "bare words", like JS
- list
- tuples, although I want to disallow 1-tuples like
x,
- bool --
true
andfalse
, following C, Java, JS, etc.- not
True
andFalse
because types are generally capitalizedStr
,Dict
,List
- not
- integer
- float
- probably sets, although the syntax might be different to allow for dict punning, like
{key1, key2}
taking their values from surrounding scope - string: single quoted are like Python strings, but double quoted allows interpolation. This involves lexer modes. (Already implemented to a large extent)
- later: homogeneous arrays
-
@[ mycommand --flag1 --flag2 ]
-- uses the "command" lexer mode for "bare words" @[1 2 3]
-
- Comprehensions (lower priority)
- list, dict, set
- Function literals (lower priority)
- To save space, the parse tree doesn't follow the derivation of the string from the grammar. If a node in the parse tree would have only 1 child (a singleton), then it's omitted. Python doesn't omit it (which seems wasteful.)
- Our syntax tree is more like a "lossless syntax tree". The leaves are generally of type
token
-- I try not to preprocess these too much, to allow more options for downstream tools. Tokens have location information which makes it easy to generate precise error messages.
Generally I test things very quickly with osh -n -c
, or an interactive shell, but we should somehow record those tests. The simplest thing to do is to write some Python unit tests that take strings and print out the AST. Maybe they don't even need to make assertions?
Update: I added a test driver, which you can run like this:
test/unit.sh unit oil_lang/expr_parse_test.py
It takes lines of code and prints out an AST.
If you want to print out the parse tree, turn on print_parse_tree
in frontend/parse_lib.py ParseOilAssign
.
NOTE: The way I hacked everything together was with pgen2/pgen2-test.sh all
. (You can run less with a particular function in that file, like parse-exprs
or oil-productions
.) This worked pretty nicely, but I won't be surprised if others don't like this style or get confused by it :-/
- Idea: Can we compare against Python somehow? That might come into play more in execution, rather than parsing.
The whole front end is statically typed with MyPy
now. The types/osh-parse.sh
script checks it in Travis.
I usually the code working, and then add types. However filling in types first is conceivable. ASDL types map to MyPy types in a straightforward way.
See Contributing, but
build/dev.sh minimal
should be enough (on an Ubuntu/Debian machine).
Important: make sure to re-run this when changing frontend/syntax.asdl
. The file _devbuild/gen/syntax_asdl.py
needs to be regenerated.