-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy pathasm.lhs
250 lines (194 loc) · 8.28 KB
/
asm.lhs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
= ION Assembly =
By "computer program", we often mean a string intended for human
comprehension. In contrast, language specifications usually involve more
abstract objects such as syntax trees, where we can ignore details such as
which characters are treated as whitespace.
A parser maps a string to a syntax tree, or an error if the input string is
not a valid program.
------------------------------------------------------------------------
parse :: [Char] -> Either Error Language
------------------------------------------------------------------------
Parsing is often viewed as part of the compiler's job:
------------------------------------------------------------------------
parseCompile :: [Char] -> Either Error TargetLanguage
parseCompile s = case parse s of
Left err -> Left err
Right prog -> Right (compile prog)
------------------------------------------------------------------------
Our `TargetLanguage` is a sort of assembly language for the ION machine, where
we use a string representation of a list of `CL` terms. (Why a list instead
of a single `CL`? See below.)
------------------------------------------------------------------------
type TargetLanguage = [Char]
assemble :: TargetLanguage -> Either Error [CL]
------------------------------------------------------------------------
We also need some way to load an ION machine with a `[CL]`:
------------------------------------------------------------------------
load :: [CL] -> VM
------------------------------------------------------------------------
We build `assemble` and `load` below:
++++++++++
<script>
function hideshow(s) {
var x = document.getElementById(s);
if (x.style.display === "none") {
x.style.display = "block";
} else {
x.style.display = "none";
}
}
</script>
<p><a onclick='hideshow("ugh");'>▶ Toggle extensions and imports</a></p>
<div id='ugh' style='display:none'>
++++++++++
\begin{code}
{-# LANGUAGE NamedFieldPuns #-}
import Control.Monad
import Data.Char (ord)
import System.Environment
import Text.Megaparsec
import Text.Megaparsec.Char
import qualified ION
import ION (VM)
\end{code}
++++++++++
</div>
++++++++++
== Grammar ==
We define a language that is easy to generate and parse, and that humans can
tolerate in small doses:
------------------------------------------------------------------------
digit = '0' | ... | '9'
num = digit [num]
con = '#' CHAR | '(' num ')'
idx = '@' CHAR | '[' num ']'
comb = 'S' | 'K' | 'I' | ...
term = comb | '`' term term | con | idx
prog = term ';' [prog]
------------------------------------------------------------------------
A program is a sequence of terms terminated by semicolons.
The entry point is the last term of the sequence.
The motivation for a list instead of a single term is that rather than:
------------------------------------------------------------------------
norm = (\sq x y -> sq x + sq y) (\x -> x * x)
------------------------------------------------------------------------
we'd like to write the sequence of definitions:
------------------------------------------------------------------------
square x = x * x
norm x y = square x + square y
------------------------------------------------------------------------
The semicolon terminators in our grammar are unnecessary, but aid debugging.
The backquote is a prefix binary operator denoting application.
For example, we represent the program `BS(BB)` with:
------------------------------------------------------------------------
``BS`BB;
------------------------------------------------------------------------
A term is represented by the last element of a sequence of expressions
terminated by semicolons. The sequence is 0-indexed, and we may refer to an
earlier term by enclosing its index in square brackets.
Thus another way to represent `BS(BB)` is:
------------------------------------------------------------------------
B;S;``[0][1]`[0][0];
------------------------------------------------------------------------
An term may refer to an earlier term using the `(@)` prefix unary
operator. Let `n` be the ASCII code of the character following `(@)`. Then
it refers to the term of index `n - 32`. This offset is chosen so that short
programs can be written in printable ASCII without worrying about decimal
conversion.
For example, instead of "[0]" we may write "@ "
and instead of "[10]" we may write "@*".
If an earlier term is needed multiple times, then we share rather than
duplicate. Accordingly, we define combinatory logic terms to be combinators,
applications, and indexes of previously defined terms:
\begin{code}
data CL = Com Int | CL :@ CL | Nat Int | Idx Int
\end{code}
An 32-bit word constant is represented by a number in parentheses. For example,
42 is represented by `(42)`.
An alternative representation is to use the unary prefix operator `(#)`, in
which case the constant is the ASCII code of the following character. Again,
this is so we can generate code without worrying about converting to decimal.
For example, instead of `(42)` we may write `#*`.
== Parser ==
We use Megaparsec to build a recursive descent parser for ION assembly.
\begin{code}
digit = digitChar
num = read <$> some digit
nat = ord <$> (char '#' *> anySingle)
<|> char '(' *> num <* char ')'
idx = (+(-32)) . ord <$> (char '@' *> anySingle)
<|> char '[' *> num <* char ']'
comb = Com . ord <$> oneOf "SKIBCTRY:L=+-/*%?"
term = comb
<|> ((:@) <$> (char '`' *> term) <*> term)
<|> Nat <$> nat
<|> Idx <$> idx
prog :: Parsec () [Char] [CL]
prog = some (term <* char ';')
\end{code}
Thus our `assemble` function is simply:
\begin{code}
assemble = parse prog ""
\end{code}
We load terms onto a VM as follows.
\begin{code}
fromExpr :: [(Int, Int)] -> CL -> VM -> (Int, VM)
fromExpr defs t vm = case t of
Com n -> (n, vm)
x :@ y -> let
(a, vm1) = fromExpr defs x vm
(b, vm2) = fromExpr defs y vm1
in ION.app a b vm2
Nat n -> ION.app 35 n vm
Idx n | Just addr <- lookup n defs -> (addr, vm)
load :: [CL] -> VM
load ts = ION.push root vm1 where
((_, root):_, vm1) = foldl addDef ([], ION.new) ts
addDef (defs, m) t = let (addr, m') = fromExpr defs t m
in ((length defs, addr):defs, m')
\end{code}
A helper turns a given pure CL program into one that runs on standard input and
output by mapping a term `x` to `x(0?)(.)(T1)`:
\begin{code}
wrapIO :: CL -> CL
wrapIO x = x :@ (Com 48 :@ Com 63) :@ Com 46 :@ (Com 84 :@ Com 49)
\end{code}
Lastly, a `main` function which runs the program in the file whose name is
supplied as the first argument:
\begin{code}
main :: IO ()
main = getArgs >>= \as -> case as of
[] -> putStrLn "Must provide program."
[f] -> do
s <- readFile f
case assemble s of
Left err -> putStrLn $ "Parse error: " <> show err
Right cls -> do
void $ ION.evalIO $ load $ cls ++ [wrapIO $ Idx $ length cls - 1]
_ -> putStrLn "One program only."
\end{code}
For example:
------------------------------------------------------------------------
$ echo -n 'I;' > id
$ echo "Hello" | ion id
Hello
$ echo -n '``C`T?`KI;' > tail
$ echo "tail" | ion tail
ail
------------------------------------------------------------------------
== Parsing Pitfalls ==
Sadly, we largely ignore the topic of parsing because we wish to focus on
lambda calculus, combinatory logic, and types. We get by with parser
combinators, which we hardly explain.
This is a pity, because my textbooks omitted several fascinating facts, partly
due to their vintage:
* Many recommend Thompson's construction algorithm for building DFAs from
regular expressions. In fact, link:../haskell/re.html[regular expression
derivatives] are superior. (In practice, NFAs may be better to
because conversion to DFAs may blow up exponentially.)
See also https://dl.acm.org/doi/10.1145/2500365.2500613[_Fun with Semirings_]
for lesser-known connections between regular languages and other parts of
computer science.
* link:../haskell/parse.html[Parsing combinators] take the tedium out of recursive descent parsers, at least in languages like Haskell.
* link:../haskell/pwd.html[Parsing with derivatives] makes it easy to write an efficient context-free parser generator.
* https://en.wikipedia.org/wiki/Parsing_expression_grammar[Parsing expression grammars] can be parsed in linear time and can even handle some context-sensitive grammars, though it is unknown whether they can handle all context-free grammars.