-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy pathocaml-yacc.tex
347 lines (271 loc) · 10.1 KB
/
ocaml-yacc.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
\section{Ocamlyacc}
We mainly cover menhir here.
A grammar is mainly composed of four elements(terminals,
non-terminals, production rulls, start symbol)
\textbf{Syntax}
\begin{bluetext}
% {header
% }
%%
Grammar rules
%%
trailer
\end{bluetext}
A tiny example as follows (It has a subtle bug, readers should find it)
\begin{ocamlcode}
% {
open Printf
let parse_error s =
print_endline "error\n";
print_endline s ;
flush stdout
%}
%token <float> NUM
%token PLUS MINUS MULTIPLY DIVIDE CARET UMINUS
%token NEWLINE
%start input
%type <unit> input
%type <float> exp
%% /* rules and actions */
input: /* empty */ {}
| input line {}
;
line: NEWLINE {}
|exp NEWLINE {printf "\t%.10g\n" $1 ; flush stdout}
;
exp: NUM { $1 }
|exp exp PLUS {$1 +. $2 }
|exp exp MINUS {$1 -. $2 }
|exp exp MULTIPLY {$1 *. $2 }
|exp exp DIVIDE {$1 /. $2 }
|exp exp CARET {$1 ** $2 }
|exp UMINUS {-. $1 }
;
%%
\end{ocamlcode}
Notice that start non-terminal can be given \textit{several}, then you will
have a different .mli file, notice that it's different from ocamllex,
ocamlyacc will generate a .mli file, so here we get the output
interface as follows:
\begin{bluetext}
%type <type> nonterminal ... nonterminal
%start symbol ... symbol
\end{bluetext}
\begin{ocamlcode}
type token =
| NUM of (float)
| PLUS
| MINUS
| MULTIPLY
| DIVIDE
| CARET
| UMINUS
| NEWLINE
val input :
(Lexing.lexbuf -> token) -> Lexing.lexbuf -> unit
val exp :
(Lexing.lexbuf -> token) -> Lexing.lexbuf -> float
\end{ocamlcode}
Notice that we may use character strings as implicit terminals as in
\begin{ocamlcode}
expr : expr "+" expr {}
| expr "*" expr {}
| ... ;
\end{ocamlcode}
They are directly processed by the parser without passing through the
lexer. But it breaks the uniformity
\textbf{Contextual Grammar}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/context.ml}
First gammar
\begin{ocamlcode}
/* empty corresponds Ctrl-d.*/
input : /*empty*/ {} | input line {};
\end{ocamlcode}
Notice here we \textbf{preferred left-recursive} in yacc.
The underlying theory for LALR prefers LR. because all the elements
must be shifted onto the stack \textit{before} the rule can be applied even once.
\begin{ocamlcode}
exp : NUM | exp exp PLUS | exp exp MINUS ... ;
\end{ocamlcode}
Here is our lexer
\begin{ocamlcode}
{
open Rpcalc
open Printf
let first = ref true
}
let digit = ['0'-'9']
rule token = parse
|[' ' '\t' ] {token lexbuf}
|'\n' {NEWLINE}
| (digit+ | "." digit+ | digit+ "." digit*) as num
{NUM (float_of_string num)}
|'+' {PLUS}
|'-' {MINUS}
|'*' {MULTIPLY}
|'/' {DIVIDE}
|'^' {CARET}
|'n' {UMINUS}
|_ as c {printf "unrecognized char %c" c ; token lexbuf}
|eof {
if !first then begin first := false; NEWLINE end
else raise End_of_file }
{
let main () =
let file = Sys.argv.(1) in
let chan = open_in file in
try
let lexbuf = Lexing.from_channel chan in
while true do
Rpcalc.input token lexbuf
done
with End_of_file -> close_in chan
let _ = Printexc.print main ()
}
\end{ocamlcode}
We write driver function in lexer for convenience, since lexer depends
on yacc. \textit{Printex.print}
\textbf{precedence associatitvity }
Operator precedence is determined by the line ordering of the
declarations;
\textit{\%prec} in the grammar section, the \textit{\%prec} simply
instructs ocamlyacc that the rule \textit{|Minus exp } has the same
precedence as NEG
\textit{\%left,\%right,\%nonassoc}
\begin{enumerate}
\item The associatitvity of an operator op determines how repeated
uses of the operator nest: whether \textit{x op y op z} is parsed
by grouping \textit{x} with \textit{y} or. nonassoc will consider
it as an error
\item All the tokens declared in a single precedence declaration
have equal precedence and nest together according to their
associatitvity
\end{enumerate}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml} {code/calc/simple.mly}
Notice here the \textit{NEG} is a place a holder, it takes the
place, but it's not a token. since here we need \textit{MINUS} has
different levels.
%% The interface file is as follows
%% \inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/calc/simple.mli}
\textbf{Error Recovery}
By default, the parser function raises exception after calling \textit{parse\_error}
The ocamlyacc reserved word \textit{error}
\begin{ocamlcode}
line: NEWLINE |exp NEWLINE | error NEWLINE {}
\end{ocamlcode}
If an expression that cannot be evaluated is read, the error will be
recognized by the third rule for line, and parsing will continue
(parse\_error is still called). This form of error recovery deals
with syntax errors. There are also other kinds of errors.
\textbf{Location Tracking}
It's very easy. First, remember to use \textit{Lexing.new\_line} to
track your line number, then use \textit{rhs\_start\_pos, rhs\_end\_pos} to
track the symbolposition. 1 is for the leftmost component.
\begin{ocamlcode}
Parsing.(
let start_pos = rhs_start_pos 3 in
let end_pos = rhs_end_pos 3 in
printf "%d.%d --- %d.%d: dbz"
start_pos.pos_lnum (start_pos.pos_cnum -start_pos.pos_bol)
end_pos.pos_lnum (end_pos.pos_cnum - end_pos.pos_bol);
1.0
)
\end{ocamlcode}
For groupings, use the following function \textit{symbol\_start\_pos,
symbol\_end\_pos}
\textit{symbol\_start\_pos} is set to the beginning of the leftmost
component, and \textit{symbol\_end\_pos} to the end of the rightmost component.
\textbf{A complex Example}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/calc/calc_compl.mly}
In my opinion, the best practice is first modify .mly file, then
change .mll file later
\textbf{SHIFT REDUCE }
A very nice tutorial
\href{http://www.cs.uiuc.edu/class/sp10/cs421/lectures/lecture%2010%20supp.pdf}{shift-reduce}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s1.mly}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s2.mly}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s3.mly}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s4.mly}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s5.mly}
The prec trick is covered not correctly in this tutorial.
The symbols are declared to associate to the left, right,
nonassoc. The symbols are \textit{usually} tokens, they can
also be \textit{dummy} nonterminals, for use with the \%prec
directive in the rule.
\begin{enumerate}
\item Tokens and rules have precedences. The precedence of a
\textit{rule} is the precedence of its \textit{rightmost}
terminal. you can override this default by using the \textit{\%prec}
directive in the rule
\item A reduce/reduce conflict is resolved in favor of the first
ruel(in the order given by the source file)
\item A shift/reduce conflict is resolved by comparing the
\textit{predecence of the rule to be reduced} with the
\textit{precedence of the token to be shifted}. If the predecence of
the rule is higher, then the rule will be reducecd; if the predecence
of the token is higher then token will be shifted.
\item A shift/reduce conflict between a rule and a token with
the same precedence will be resolved using the associativity.
\item when a shift/reduce can not be resolved, a warning, and
in favor of \textit{shift}
\end{enumerate}
\section{MENHIR Related}
\begin{enumerate}
\item Syntax
\begin{bluetext}
specification ::= declaration . . . declaration %% rule . . . rule [ %% Objective Caml code ]
declaration ::= %{ Objective Caml code %}
% parameter < uid : Objective Caml module type >
%token [ < Objective Caml type > ] uid . . . uid
%nonassoc uid . . . uid
%left uid . . . uid
%right uid . . . uid
%type < Objective Caml type > lid . . . lid
%start [ < Objective Caml type > ] lid . . . lid
rule ::= [%public] [%inline] lid [( id, ..., id)] : [|] group | ... | group
group ::= production | . . . | production { Objective Caml code } [ %prec id ]
production ::= producer . . . producer [ %prec id ]
producer ::= [ lid = ] actual
actual ::= id [( actual, ..., actual)] [? | + | *]
\item parameter
\begin{bluetext}
%parameter <uid: Objective module types>
\end{bluetext}
This causes the entire parser to be parameterized over the module.
\item multiple files (private and public,tokens aside)
\item parameterized rules
\item inline
\item standard library
\begin{tabular}{|c|c|c|c|}
\hline
Name & Recognizes & Produces & Comment \\
\hline
option(X) & $\epsilon$ | X & $\alpha$ option, if X : $\alpha$ & \\
ioption(X) & & & inlined \\
boption(X) & & & bool \\
loption(X) & $\epsilon$ |X & $\alpha$ list, if X : $\alpha$ list & \\
pair(X,Y) & X Y & $\alpha \times \beta$ & \\
separated\_pair(X,sep,Y) & X sep Y & $\alpha \times \beta$ & \\
preceded(opening,X) & opening X & $\alpha$, if X : $\alpha$ & \\
terminated(X,closing) & X closing & $\alpha$, if X : $\alpha$ & \\
delimited(opening, X closing) & opening X closing & $\alpha$, if X : $\alpha$ & \\
list(X) & & & \\
nonempty\_list(X) & & & \\
separated\_list(sep,X) & & & \\
sepearted\_nonempty\_list(sep,X) & & & \\
\hline
\end{tabular}
\item combined with ulex
A typical tags file is as follows
\begin{bluetext}
true:use_menhir, pkg_ulex, pkg_pcre, pkg_menhirLib, pkg_batteries
<scanner.ml>: pkg_ulex, syntax_camlp4o
\end{bluetext}
You have to use \mint{ocaml}|Menhirlib.Convert| API, here
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/menhir/convert.mli}
\item example csss project
\end{enumerate}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "master"
%%% End: