Stanford Compilers CS143

Stanford Compilers (CS143) course notes

These notes were taken while I was studying Stanford's CS143 course and refreshing my rusty knowledge on compilers. The reason I decided to write the following is that, every now and then, I need to recall some specific detail of a parsing algorithm and, in this case, having a reference comes in handy. Hopefully, Stanford students and compiler geeks might find the following helpful when solving their assignments.

Feel free to mail me for any corrections or suggestions.

Top-down parsing

Graph-search with backtracking

BFS (Breadth First Search) parsing

Maintain a worklist of sentential forms, initially just the start symbol
While the worklist isn't empty:
1. Remove an element from the worklist
2. If it matches the target string, we're done. Otherwise, for each possible string that can be derived in one step (i.e. ), add that string to the worklist. Notice that may consist of both terminals and non-terminals.

Parse tree is tracked in the worklist.

BFS is very slow:

Wasted effort in expanding sentential forms that will never be matched
High branching factor

Leftmost BFS parsing

Same as above but only considers lefmost derivations that match the input string
Less wasted effort and branching factor
Will always find a valid parse of a program if one exists
Can be modified to find if a program can't be parseed
Left-recursion might result in exponential parsing time and memory

Leftmost DFS (Depth First Search) parsing

Lower memory usage
High performance
Easy to implement
Fails on left-recursive grammars (but left-recursion can be automatically eliminated)

Comparison of graph search methods

BFS	DFS
Works on all grammars	Works on grammars without left-recursion
Worst-case runtime is exponential	Worst-case runtime is exponential
Worst-case memory is exponential	Worst-case memory is linear
Rarely used in practice	Used in limited form as recursive descent

Predictive top-down parsing

LL(1)
- L: Left-to-right scan of tokens
- L: Leftmost derivation
- 1 token of lookahead
Predictive top-down parsing is based on FIRST sets

Computing FIRST sets

epsilon-free FIRST The following fixed-point iteration or transitive closure algorithm can be used:

For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }
Repeat until no changes occur
1. For each non-terminal A, for each production A -> Bw, set FIRST(A) = FIRST(A) U FIRST(B)
For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }
For all non-terminals A where A -> e is a production, add e to FIRST(A)
Repeat until no changes occur
For each production A -> a where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U { e }
For each production A -> atw where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) U { t }
For each production A -> aBw, where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U (FIRST(B) - {e})

LL(1) table construction agorithm

LL(1) parsing algorithm

Initialize stack with S$
Repeat until stack is empty
Let the next character of w be t
If the top of the stack is a terminal r
1. If r and t don't match, report an error
2. Otherwise consume t and pop r from the stack
If the top of the stack is a non-terminal A
1. If T[A, t] is undefined, report an error
2. Replace top of the stack with T[A, t]

Comparison of graph-search-based vs. predictive top-down parsing

Graph-search-based	Predictive top down
Can parse all grammars	Can parse limited number of grammars

Bottom-up parsing

Bottom-up parsing is the process of building the input's syntax tree in reverse
Bottom-up parsing is based on handle recognition. Handles are defined as either:
1. The the leftmost complete cluster of leaf nodes
2. An expansion of the rightmost non-terminal of the input
Handles are always found at the top of the parser's stack
The series of presentations cover four bottom-up shift/reduce parsing algorithms, namely LR(0), SLR(1), LALR(1), LR(1), sorted in order of parsing power, from least to most powerful. However, the aforementioned algorithms can be better understood when studied in the following order: LR(0), LR(1), SLR(1), LALR(1).
The aforementioned algorithms are:
- Directional - Scan the input from left-to-right (LR)
- Predictive - Guess which production should be inverted

LR(0) automata construction algorithm

The algorithm for constructing LR(0) NFA can be found in [02] p.134:

Algorithm #1

Create a state for each non-terminal
For each production
Construct states for each possible way of splitting
Add transitions on between and
For each state for non-terminal , add an -transition from to

Using subject construction, the NFA above may be converted to a DFA. However, a different, yet equivalent approach, can be followed for directly constructing an LR(0) DFA. See [02] p.162.

Algorithm #2

Begin with a state containing
Compute the closure of the state

If is in the state, add to the state for each production

Repeat until no new states are added:

If a state contains a production for symbol , add a transition on from that state to the state containing the closure of

LR(0) parsing algorithm

Set the top of the stack to (?, 1)
While the stack is not empty:
- If action[state] is shift: then shift the input and set
  - Let the next symbol of input be t
  - Push (t, goto[state, t])
- If action[state, t] is reduce
  - Pop symbols off the stack
  - Let the sate atop the stack be top-state
  - Push (A, goto[top-state, A])
- Otherwise report an error

LR(1) automata construction algorithm

LR(1) is sometimes mentioned in the literature as CLR(1) (C stands for Canonical).

The algorithm for constructing LR(1) NFA can be found in [02] p.325:

Begin with a state
For each state , for each production
Construct states for all possible ways of splitting
Add -transition from to each of these states
Add transitions on between and
For each state , add an -transition from to for each terminal

To directly construct the LR(1) DFA:

Begin in a state containing , where is the start symbol
Compute the closure of the state

If is in the state, add to the state for each production and for each terminal

Repeat until no new states are added

If a state contains a production , add a transition on from that state to the state containing the closure of

LR(1) parsing algorithm

Begin with an empty stack and the input set to , where is the string to parse. Set state to the initial state.
Repeat the following:
- Let the next symbol of input be t
- If action[state, t] is shift, then shift the input and set state = goto[state, t]
- If action[state, t] is reduce
  - Pop symbols off the stack; replace them with
  - Let the sate atop the stack be top-state
  - Set state = goto[top-state, A]
- If action[state, t] is accept, the the parse is done
- If action[state, t] is error, report an error

SLR(1)

SLR(1) stands for Simple LR(1)
Modified LR(0) that uses lookahead to avoid shift/reduce conflicts
Only reduce if the next token is in
In SLR(1), a lookahead of , practically means "What could follow somewhere in the grammar?", even if in a particular state couldn't possibly have that symbol after it.

LALR(1)

LALR(1) is the topic discussed in [03]. Almost all of the slide deck is dedicated to explaining in detail the problems that LALR(1) tries to solve. Understanding LALR(1) is a piece of cake if you follow the author's train of thought.
LALR(1) stands for Lookahead(1) LR(0)
Practically the LALR(1) can be constructed by merging LR(1) states with equal cores. However, this is not very practical, as LR(1) automata tend to be large.
The resulting LALR(1) automaton cannot have more shift/reduce conflicts than the original LR(1) automaton, since both automata share the same cores. However, extra reduce/reduce conflicts might arise.
Two algorithms for more efficient construction of LALR(1) automata:
- Lazy merging - Start constructing LR(1) states and maintain them in a worklist. As new states are created, their cores are looked up in the worklist and merging takes place.
- The elegant way [04] - Discussed in "Simple computation of LALR(1) lookahead sets" by Manuel E. Bermudez and George Logothetis:
  1. Construct the LR(0) automaton
  2. Construct an annotated grammar this way; For each in some state :
    1. Trace out the path takes through the LR(0) automaton starting at
    2. Replace each non-terminal in with a non-terminal annotated with the state transitioned to and from by the edge labeled with that non-terminal
    3. Replace with a non-terminal annotated with the start and end state of the transition on out of
  3. Compute FOLLOW sets in the resulting automaton. The resulting sets are more precise than the original grammar's FOLLOW sets as they implicitly hold context information.
  4. Propagate changes - For each item in a state
    1. Let be the non-terminal corresponding to following the transition out of into some state .
    2. Trace through the automaton along the path labeled by . This will lead to a state containing an item .
    3. Add to the lookahead of the contents of