-
Notifications
You must be signed in to change notification settings - Fork 32
Stanford Compilers CS143
These notes were taken while I was studying Stanford's CS143 course and refreshing my rusty knowledge on compilers. The reason I decided to write the following is that, every now and then, I need to recall some specific detail of a parsing algorithm and, in this case, having a reference comes in handy. Hopefully, Stanford students and compiler geeks might find the following helpful when solving their assignments.
Feel free to mail me for any corrections or suggestions.
- Maintain a worklist of sentential forms, initially just the start symbol
- While the worklist isn't empty:
- Remove an element from the worklist
- If it matches the target string, we're done. Otherwise, for each possible string that can be derived in one step (i.e. ), add that string to the worklist. Notice that may consist of both terminals and non-terminals.
Parse tree is tracked in the worklist.
BFS is very slow:
- Wasted effort in expanding sentential forms that will never be matched
- High branching factor
- Same as above but only considers lefmost derivations that match the input string
- Less wasted effort and branching factor
- Will always find a valid parse of a program if one exists
- Can be modified to find if a program can't be parseed
- Left-recursion might result in exponential parsing time and memory
- Lower memory usage
- High performance
- Easy to implement
- Fails on left-recursive grammars (but left-recursion can be automatically eliminated)
BFS | DFS |
---|---|
Works on all grammars | Works on grammars without left-recursion |
Worst-case runtime is exponential | Worst-case runtime is exponential |
Worst-case memory is exponential | Worst-case memory is linear |
Rarely used in practice | Used in limited form as recursive descent |
-
LL(1)
- L: Left-to-right scan of tokens
- L: Leftmost derivation
- 1 token of lookahead
-
Predictive top-down parsing is based on FIRST sets
epsilon-free FIRST The following fixed-point iteration or transitive closure algorithm can be used:
-
For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }
-
Repeat until no changes occur
- For each non-terminal A, for each production A -> Bw, set FIRST(A) = FIRST(A) U FIRST(B)
-
For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }
-
For all non-terminals A where A -> e is a production, add e to FIRST(A)
-
Repeat until no changes occur
-
For each production A -> a where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U { e }
-
For each production A -> atw where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) U { t }
-
For each production A -> aBw, where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U (FIRST(B) - {e})
- Initialize stack with S$
- Repeat until stack is empty
- Let the next character of w be t
- If the top of the stack is a terminal r
- If r and t don't match, report an error
- Otherwise consume t and pop r from the stack
- If the top of the stack is a non-terminal A
- If T[A, t] is undefined, report an error
- Replace top of the stack with T[A, t]
Graph-search-based | Predictive top down |
---|---|
Can parse all grammars | Can parse limited number of grammars |
-
Bottom-up parsing is the process of building the input's syntax tree in reverse
-
Bottom-up parsing is based on handle recognition. Handles are defined as either:
- The the leftmost complete cluster of leaf nodes
- An expansion of the rightmost non-terminal of the input
-
Handles are always found at the top of the parser's stack
-
The series of presentations cover four bottom-up shift/reduce parsing algorithms, namely LR(0), SLR(1), LALR(1), LR(1), sorted in order of parsing power, from least to most powerful. However, the aforementioned algorithms can be better understood when studied in the following order: LR(0), LR(1), SLR(1), LALR(1).
-
The aforementioned algorithms are:
- Directional - Scan the input from left-to-right (LR)
- Predictive - Guess which production should be inverted
The algorithm for constructing LR(0) NFA can be found in [02] p.134:
Algorithm #1
- Create a state for each non-terminal
- For each production
- Construct states for each possible way of splitting
- Add transitions on between and
- For each state for non-terminal , add an -transition from to
Using subject construction, the NFA above may be converted to a DFA. However, a different, yet equivalent approach, can be followed for directly constructing an LR(0) DFA. See [02] p.162.
Algorithm #2
- Begin with a state containing
- Compute the closure of the state
- If is in the state, add to the state for each production
- Repeat until no new states are added:
- If a state contains a production for symbol , add a transition on from that state to the state containing the closure of
- Set the top of the stack to (?, 1)
- While the stack is not empty:
- If action[state] is shift:
then shift the input and set
- Let the next symbol of input be t
- Push (t, goto[state, t])
- If action[state, t] is reduce
- Pop symbols off the stack
- Let the sate atop the stack be top-state
- Push (A, goto[top-state, A])
- Otherwise report an error
- If action[state] is shift:
then shift the input and set
LR(1) is sometimes mentioned in the literature as CLR(1) (C stands for Canonical).
The algorithm for constructing LR(1) NFA can be found in [02] p.325:
- Begin with a state
- For each state , for each production
- Construct states for all possible ways of splitting
- Add -transition from to each of these states
- Add transitions on between and
- For each state , add an -transition from to for each terminal
To directly construct the LR(1) DFA:
- Begin in a state containing , where is the start symbol
- Compute the closure of the state
- If is in the state, add to the state for each production and for each terminal
- Repeat until no new states are added
- If a state contains a production , add a transition on from that state to the state containing the closure of
- Begin with an empty stack and the input set to , where is the string to parse. Set state to the initial state.
- Repeat the following:
- Let the next symbol of input be t
- If action[state, t] is shift, then shift the input and set state = goto[state, t]
- If action[state, t] is reduce
- Pop symbols off the stack; replace them with
- Let the sate atop the stack be top-state
- Set state = goto[top-state, A]
- If action[state, t] is accept, the the parse is done
- If action[state, t] is error, report an error
-
SLR(1) stands for Simple LR(1)
-
Modified LR(0) that uses lookahead to avoid shift/reduce conflicts
-
Only reduce if the next token is in
-
In SLR(1), a lookahead of , practically means "What could follow somewhere in the grammar?", even if in a particular state couldn't possibly have that symbol after it.
-
LALR(1) is the topic discussed in [03]. Almost all of the slide deck is dedicated to explaining in detail the problems that LALR(1) tries to solve. Understanding LALR(1) is a piece of cake if you follow the author's train of thought.
-
LALR(1) stands for Lookahead(1) LR(0)
-
Practically the LALR(1) can be constructed by merging LR(1) states with equal cores. However, this is not very practical, as LR(1) automata tend to be large.
-
The resulting LALR(1) automaton cannot have more shift/reduce conflicts than the original LR(1) automaton, since both automata share the same cores. However, extra reduce/reduce conflicts might arise.
-
Two algorithms for more efficient construction of LALR(1) automata:
-
Lazy merging - Start constructing LR(1) states and maintain them in a worklist. As new states are created, their cores are looked up in the worklist and merging takes place.
-
The elegant way [04] - Discussed in "Simple computation of LALR(1) lookahead sets" by Manuel E. Bermudez and George Logothetis:
-
Construct the LR(0) automaton
-
Construct an annotated grammar this way; For each in some state :
-
Trace out the path takes through the LR(0) automaton starting at
-
Replace each non-terminal in with a non-terminal annotated with the state transitioned to and from by the edge labeled with that non-terminal
-
Replace with a non-terminal annotated with the start and end state of the transition on out of
-
-
Compute FOLLOW sets in the resulting automaton. The resulting sets are more precise than the original grammar's FOLLOW sets as they implicitly hold context information.
-
Propagate changes - For each item in a state
-
Let be the non-terminal corresponding to following the transition out of into some state .
-
Trace through the automaton along the path labeled by . This will lead to a state containing an item .
-
Add to the lookahead of the contents of
-
-
-
[01] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/04/Slides04B.pdf
[02] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/05/Slides05.pdf
[03] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/06/Slides06.pdf
[04] https://www.sciencedirect.com/science/article/pii/0020019089900793