ocaml-yacc.tex

\section{Ocamlyacc}
We mainly cover menhir here.

A grammar is mainly composed of four elements(terminals,
non-terminals, production rulls, start symbol)

\textbf{Syntax}


\begin{bluetext}
    % {header
    % }
    %%
    Grammar rules 
    %%
    trailer 
\end{bluetext}


A tiny example as follows (It has a subtle bug, readers should find it)  
  \begin{ocamlcode}
% {
  open Printf
  let parse_error s = 
    print_endline "error\n";
    print_endline s ; 
    flush stdout 
%}

%token <float> NUM 
%token PLUS MINUS MULTIPLY DIVIDE CARET UMINUS
%token NEWLINE 

%start input 
%type <unit> input 
%type <float> exp 
%% /* rules and actions */
    
input: /* empty */ {}
    | input line {}
; 

line: NEWLINE {}
    |exp NEWLINE  {printf "\t%.10g\n" $1 ; flush stdout}
;

exp: NUM { $1 }
    |exp exp PLUS {$1 +. $2 }
    |exp exp MINUS {$1 -. $2 }
    |exp exp MULTIPLY {$1 *. $2 }
    |exp exp DIVIDE {$1 /. $2 }
    |exp exp CARET {$1 ** $2 }
    |exp UMINUS {-. $1 }
; 

%%
\end{ocamlcode}

Notice that start non-terminal can be given \textit{several}, then you will
have a different .mli file, notice that it's different from ocamllex,
ocamlyacc will generate a .mli file, so here we get the output
interface as follows:

\begin{bluetext}
  %type <type> nonterminal ... nonterminal
  %start symbol ... symbol
\end{bluetext}

\begin{ocamlcode}
type token =
  | NUM of (float)
  | PLUS
  | MINUS
  | MULTIPLY
  | DIVIDE
  | CARET
  | UMINUS
  | NEWLINE
val input :
  (Lexing.lexbuf  -> token) -> Lexing.lexbuf -> unit
val exp :
  (Lexing.lexbuf  -> token) -> Lexing.lexbuf -> float
\end{ocamlcode}

Notice that we may use character strings as implicit terminals as in 
\begin{ocamlcode}
expr : expr "+" expr {}
     | expr "*" expr {}
     | ... ;
\end{ocamlcode}
They are directly processed by the parser without passing through the
lexer. But it breaks the uniformity

\textbf{Contextual Grammar}

\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/context.ml}


First gammar
\begin{ocamlcode}
  /* empty corresponds Ctrl-d.*/
  input : /*empty*/ {} | input line {}; 
\end{ocamlcode}

Notice here we \textbf{preferred left-recursive} in yacc.
The underlying theory for LALR prefers LR. because all the elements
must be shifted onto the stack \textit{before} the rule can be applied even once.

\begin{ocamlcode}
  exp : NUM | exp exp PLUS | exp exp MINUS  ... ; 
\end{ocamlcode}

Here is our lexer
\begin{ocamlcode}
{
  open Rpcalc
  open Printf
  let first = ref true
}
let digit = ['0'-'9']
rule token = parse 
  |[' ' '\t' ] {token lexbuf}
  |'\n' {NEWLINE}
  | (digit+ | "." digit+ | digit+ "." digit*) as num 
      {NUM (float_of_string num)}
  |'+' {PLUS}
  |'-' {MINUS}
  |'*' {MULTIPLY}
  |'/' {DIVIDE}
  |'^' {CARET}
  |'n' {UMINUS}
  |_ as c  {printf "unrecognized char %c" c ; token lexbuf}
  |eof {
    if !first then begin first := false; NEWLINE end 
    else raise End_of_file }

{
  let main ()  = 
    let file = Sys.argv.(1) in 
    let chan = open_in file in 
    try 
      let lexbuf = Lexing.from_channel chan in 
      while true do 
        Rpcalc.input token lexbuf 
      done 
    with End_of_file -> close_in chan 

 let _ = Printexc.print main ()

}
\end{ocamlcode}

We write driver function in lexer for convenience, since lexer depends
on yacc. \textit{Printex.print}

\textbf{precedence associatitvity }

Operator precedence is determined by the line ordering of the
declarations; 

\textit{\%prec} in the grammar section, the \textit{\%prec} simply
instructs ocamlyacc that the rule \textit{|Minus exp } has the same
precedence as NEG
\textit{\%left,\%right,\%nonassoc}

\begin{enumerate}
  \item The associatitvity of an operator op determines how repeated
    uses of the operator nest: whether \textit{x op y op z} is parsed
    by grouping \textit{x} with \textit{y} or. nonassoc will consider
    it as an error
  \item All the tokens declared in a single precedence declaration
    have equal precedence and nest together according to their
    associatitvity
  \end{enumerate}

\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml} {code/calc/simple.mly}

Notice here the \textit{NEG} is a place a holder, it takes the
place, but it's not a token. since here we need \textit{MINUS} has
different levels. 

%% The interface file is as follows
%% \inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/calc/simple.mli}
  
\textbf{Error Recovery}

By default, the parser function raises exception after calling \textit{parse\_error}
The ocamlyacc reserved word \textit{error}

  \begin{ocamlcode}
    line: NEWLINE |exp NEWLINE | error NEWLINE {}
  \end{ocamlcode}
  
If an expression that cannot be evaluated is read, the error will be
recognized by the third rule for line, and parsing will continue
(parse\_error is still called). This form of error recovery deals
with syntax errors. There are also other kinds of errors.

\textbf{Location Tracking}

It's very easy. First, remember to use \textit{Lexing.new\_line} to
track your line number, then use  \textit{rhs\_start\_pos, rhs\_end\_pos} to
 track the symbolposition.  1 is for the leftmost component.

\begin{ocamlcode}
            Parsing.(
              let start_pos = rhs_start_pos 3 in 
              let end_pos = rhs_end_pos 3 in 
              printf "%d.%d --- %d.%d: dbz"
                start_pos.pos_lnum (start_pos.pos_cnum -start_pos.pos_bol)
                end_pos.pos_lnum (end_pos.pos_cnum - end_pos.pos_bol); 
              1.0
            )    
\end{ocamlcode}

For groupings, use the following function \textit{symbol\_start\_pos,
  symbol\_end\_pos}

\textit{symbol\_start\_pos} is set to the beginning of the leftmost
component, and \textit{symbol\_end\_pos} to the end of the rightmost component.

\textbf{A complex Example}

\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/calc/calc_compl.mly}


In my opinion, the best practice is first modify .mly file, then
change .mll file later


\textbf{SHIFT REDUCE }
  
A very nice tutorial
\href{http://www.cs.uiuc.edu/class/sp10/cs421/lectures/lecture%2010%20supp.pdf}{shift-reduce}

\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s1.mly}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s2.mly}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s3.mly}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s4.mly}
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/shift_reduce/s5.mly}

The prec trick is covered not correctly in this tutorial.
        
The symbols are declared to associate to the left, right,
nonassoc. The symbols are \textit{usually} tokens, they can
also be \textit{dummy} nonterminals, for use with the \%prec
directive in the rule.

\begin{enumerate}
\item Tokens and rules have precedences. The precedence of a
\textit{rule} is the precedence of its \textit{rightmost}
terminal. you can override this default by using the \textit{\%prec}
directive in the rule
\item A reduce/reduce conflict is resolved in favor of the first
ruel(in the order given by the source file)
\item A shift/reduce conflict is resolved by comparing the
\textit{predecence of the rule to be reduced} with the
\textit{precedence of the token to be shifted}. If the predecence of
the rule is higher, then the rule will be reducecd; if the predecence
of the token is higher then token will be shifted.
\item A shift/reduce conflict between a rule and a token with
the same precedence will be resolved using the associativity.
\item when a shift/reduce can not be resolved, a warning, and
in favor of \textit{shift}
\end{enumerate}


\section{MENHIR Related}

\begin{enumerate}
\item Syntax
  \begin{bluetext}
specification ::= declaration . . . declaration %% rule . . . rule [ %% Objective Caml code ]
 declaration ::= %{ Objective Caml code %}
                         % parameter < uid : Objective Caml module type >
                         %token [ < Objective Caml type > ] uid . . . uid
                         %nonassoc uid . . . uid
                         %left uid . . . uid
                         %right uid . . . uid
                         %type < Objective Caml type > lid . . . lid 
                         %start [ < Objective Caml type > ] lid . . . lid
rule ::= [%public] [%inline] lid [( id, ..., id)] : [|] group | ... | group 
group ::= production | . . . | production { Objective Caml code } [ %prec id ]
production ::= producer . . . producer [ %prec id ]
 producer ::= [ lid = ] actual
actual ::= id [( actual, ..., actual)] [? | + | *]    

\item parameter

  \begin{bluetext}
%parameter <uid: Objective module types>    
  \end{bluetext}

This causes the entire parser to be parameterized over the module. 

\item multiple files (private and public,tokens aside)

\item parameterized rules

\item inline

\item standard library

  \begin{tabular}{|c|c|c|c|}
\hline
Name & Recognizes & Produces & Comment \\
\hline
option(X) & $\epsilon$ | X    & $\alpha$ option, if X : $\alpha$  &  \\
ioption(X) &                          &                            & inlined \\
boption(X)  &                        &                            & bool \\
loption(X)  &         $\epsilon$ |X & $\alpha$ list, if X : $\alpha$ list  & \\
pair(X,Y) &  X Y & $\alpha \times \beta$ & \\
separated\_pair(X,sep,Y) & X sep Y  & $\alpha \times \beta$     & \\
preceded(opening,X) & opening X & $\alpha$, if X : $\alpha$ & \\
terminated(X,closing) & X closing & $\alpha$, if X : $\alpha$ & \\
delimited(opening, X closing) & opening X closing & $\alpha$, if X : $\alpha$ & \\
list(X) &  & & \\
nonempty\_list(X) & & &  \\
separated\_list(sep,X) & & & \\
sepearted\_nonempty\_list(sep,X) & & & \\
\hline
  \end{tabular}


\item combined with ulex 

A typical tags file is as follows
  \begin{bluetext}
true:use_menhir, pkg_ulex, pkg_pcre, pkg_menhirLib, pkg_batteries
<scanner.ml>: pkg_ulex, syntax_camlp4o
  \end{bluetext}

You have to use \mint{ocaml}|Menhirlib.Convert| API, here 
\inputminted[fontsize=\scriptsize, fontsize=\scriptsize, ]{ocaml}{code/menhir/convert.mli}


\item example csss project 

\end{enumerate}


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "master"
%%% End: