Skip to content

Stanford Compilers CS143

huku edited this page Oct 15, 2022 · 4 revisions

Stanford Compilers (CS143) course notes

These notes were taken while I was studying Stanford's CS143 course and refreshing my rusty knowledge on compilers. The reason I decided to write the following is that, every now and then, I need to recall some specific detail of a parsing algorithm and, in this case, having a reference comes in handy. Hopefully, Stanford students and compiler geeks might find the following helpful when solving their assignments.

Feel free to mail me for any corrections or suggestions.

Top-down parsing

Graph-search with backtracking

BFS (Breadth First Search) parsing

  1. Maintain a worklist of sentential forms, initially just the start symbol bfs_alg_01
  2. While the worklist isn't empty:
    1. Remove an element bfs_alg_02 from the worklist
    2. If it matches the target string, we're done. Otherwise, for each possible string that can be derived in one step (i.e. bfs_alg_03), add that string to the worklist. Notice that bfs_alg_03 may consist of both terminals and non-terminals.

Parse tree is tracked in the worklist.

BFS is very slow:

  • Wasted effort in expanding sentential forms that will never be matched
  • High branching factor

Leftmost BFS parsing

  • Same as above but only considers lefmost derivations that match the input string
  • Less wasted effort and branching factor
  • Will always find a valid parse of a program if one exists
  • Can be modified to find if a program can't be parseed
  • Left-recursion might result in exponential parsing time and memory

Leftmost DFS (Depth First Search) parsing

  • Lower memory usage
  • High performance
  • Easy to implement
  • Fails on left-recursive grammars (but left-recursion can be automatically eliminated)

Comparison of graph search methods

BFS DFS
Works on all grammars Works on grammars without left-recursion
Worst-case runtime is exponential Worst-case runtime is exponential
Worst-case memory is exponential Worst-case memory is linear
Rarely used in practice Used in limited form as recursive descent

Predictive top-down parsing

  • LL(1)

    • L: Left-to-right scan of tokens
    • L: Leftmost derivation
    • 1 token of lookahead
  • Predictive top-down parsing is based on FIRST sets

Computing FIRST sets

epsilon-free FIRST The following fixed-point iteration or transitive closure algorithm can be used:

  1. For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }

  2. Repeat until no changes occur

    1. For each non-terminal A, for each production A -> Bw, set FIRST(A) = FIRST(A) U FIRST(B)
  3. For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }

  4. For all non-terminals A where A -> e is a production, add e to FIRST(A)

  5. Repeat until no changes occur

  6. For each production A -> a where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U { e }

  7. For each production A -> atw where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) U { t }

  8. For each production A -> aBw, where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U (FIRST(B) - {e})

LL(1) table construction agorithm

LL(1) parsing algorithm

  1. Initialize stack with S$
  2. Repeat until stack is empty
  3. Let the next character of w be t
  4. If the top of the stack is a terminal r
    1. If r and t don't match, report an error
    2. Otherwise consume t and pop r from the stack
  5. If the top of the stack is a non-terminal A
    1. If T[A, t] is undefined, report an error
    2. Replace top of the stack with T[A, t]

Comparison of graph-search-based vs. predictive top-down parsing

Graph-search-based Predictive top down
Can parse all grammars Can parse limited number of grammars

Bottom-up parsing

  • Bottom-up parsing is the process of building the input's syntax tree in reverse

  • Bottom-up parsing is based on handle recognition. Handles are defined as either:

    1. The the leftmost complete cluster of leaf nodes
    2. An expansion of the rightmost non-terminal of the input
  • Handles are always found at the top of the parser's stack

  • The series of presentations cover four bottom-up shift/reduce parsing algorithms, namely LR(0), SLR(1), LALR(1), LR(1), sorted in order of parsing power, from least to most powerful. However, the aforementioned algorithms can be better understood when studied in the following order: LR(0), LR(1), SLR(1), LALR(1).

  • The aforementioned algorithms are:

    • Directional - Scan the input from left-to-right (LR)
    • Predictive - Guess which production should be inverted

LR(0) automata construction algorithm

The algorithm for constructing LR(0) NFA can be found in [02] p.134:

Algorithm #1

  1. Create a state for each non-terminal
  2. For each production lr0_nfa_01
  3. Construct states lr0_nfa_02 for each possible way of splitting lr0_nfa_03
  4. Add transitions on lr0_nfa_04 between lr0_nfa_05 and lr0_nfa_06
  5. For each state lr0_nfa_07 for non-terminal lr0_nfa_08, add an lr0_nfa_09-transition from lr0_nfa_07 to lr0_nfa_08

Using subject construction, the NFA above may be converted to a DFA. However, a different, yet equivalent approach, can be followed for directly constructing an LR(0) DFA. See [02] p.162.

Algorithm #2

  1. Begin with a state containing lr0_dfa_01
  2. Compute the closure of the state
  • If lr0_nfa_07 is in the state, add lr0_dfa_02 to the state for each production lr0_dfa_03
  1. Repeat until no new states are added:
  • If a state contains a production lr0_nfa_05 for symbol lr0_nfa_04, add a transition on lr0_nfa_04 from that state to the state containing the closure of lr0_nfa_06

LR(0) parsing algorithm

  1. Set the top of the stack to (?, 1)
  2. While the stack is not empty:
    • If action[state] is shift: then shift the input and set
      • Let the next symbol of input be t
      • Push (t, goto[state, t])
    • If action[state, t] is reduce lr0_alg_01
      • Pop lr0_alg_02 symbols off the stack
      • Let the sate atop the stack be top-state
      • Push (A, goto[top-state, A])
    • Otherwise report an error

LR(1) automata construction algorithm

LR(1) is sometimes mentioned in the literature as CLR(1) (C stands for Canonical).

The algorithm for constructing LR(1) NFA can be found in [02] p.325:

  1. Begin with a state lr1_nfa_01
  2. For each state lr1_nfa_02, for each production lr0_nfa_01
  3. Construct states lr1_nfa_04 for all possible ways of splitting lr0_nfa_03
  4. Add lr0_nfa_09-transition from lr1_nfa_07 to each of these states
  5. Add transitions on lr0_nfa_04 between lr1_nfa_09 and lr1_nfa_10
  6. For each state lr1_nfa_11, add an lr0_nfa_09-transition from lr1_nfa_11 to lr1_nfa_12 for each terminal lr1_nfa_13

To directly construct the LR(1) DFA:

  1. Begin in a state containing lr1_dfa_01, where lr1_dfa_02 is the start symbol
  2. Compute the closure of the state
  • If lr1_nfa_11 is in the state, add lr1_dfa_03 to the state for each production lr0_dfa_03 and for each terminal lr1_nfa_13
  1. Repeat until no new states are added
  • If a state contains a production lr1_nfa_09, add a transition on lr0_nfa_04 from that state to the state containing the closure of lr1_nfa_10

LR(1) parsing algorithm

  1. Begin with an empty stack and the input set to lr1_alg_01, where lr1_alg_02 is the string to parse. Set state to the initial state.
  2. Repeat the following:
    • Let the next symbol of input be t
    • If action[state, t] is shift, then shift the input and set state = goto[state, t]
    • If action[state, t] is reduce lr0_alg_01
      • Pop lr0_alg_02 symbols off the stack; replace them with lr1_alg_05
      • Let the sate atop the stack be top-state
      • Set state = goto[top-state, A]
    • If action[state, t] is accept, the the parse is done
    • If action[state, t] is error, report an error

SLR(1)

  • SLR(1) stands for Simple LR(1)

  • Modified LR(0) that uses lookahead to avoid shift/reduce conflicts

  • Only reduce lr0_alg_01 if the next token is in slr1_alg_01

  • In SLR(1), a lookahead of slr1_alg_01, practically means "What could follow lr1_alg_05 somewhere in the grammar?", even if in a particular state lr1_alg_05 couldn't possibly have that symbol after it.

LALR(1)

  • LALR(1) is the topic discussed in [03]. Almost all of the slide deck is dedicated to explaining in detail the problems that LALR(1) tries to solve. Understanding LALR(1) is a piece of cake if you follow the author's train of thought.

  • LALR(1) stands for Lookahead(1) LR(0)

  • Practically the LALR(1) can be constructed by merging LR(1) states with equal cores. However, this is not very practical, as LR(1) automata tend to be large.

  • The resulting LALR(1) automaton cannot have more shift/reduce conflicts than the original LR(1) automaton, since both automata share the same cores. However, extra reduce/reduce conflicts might arise.

  • Two algorithms for more efficient construction of LALR(1) automata:

    • Lazy merging - Start constructing LR(1) states and maintain them in a worklist. As new states are created, their cores are looked up in the worklist and merging takes place.

    • The elegant way [04] - Discussed in "Simple computation of LALR(1) lookahead sets" by Manuel E. Bermudez and George Logothetis:

      1. Construct the LR(0) automaton

      2. Construct an annotated grammar this way; For each lalr1_dfa_01 in some state lalr1_dfa_04:

        1. Trace out the path lr1_alg_02 takes through the LR(0) automaton starting at lalr1_dfa_04

        2. Replace each non-terminal in lalr1_dfa_01 with a non-terminal annotated with the state transitioned to and from by the edge labeled with that non-terminal

        3. Replace lr1_alg_05 with a non-terminal annotated with the start and end state of the transition on lr1_alg_05 out of lalr1_dfa_04

      3. Compute FOLLOW sets in the resulting automaton. The resulting sets are more precise than the original grammar's FOLLOW sets as they implicitly hold context information.

      4. Propagate changes - For each item lalr1_dfa_01 in a state lalr1_dfa_04

        1. Let lalr1_dfa_02 be the non-terminal corresponding to lr1_alg_05 following the transition out of lalr1_dfa_04 into some state lalr1_dfa_05.

        2. Trace through the automaton along the path labeled by lr1_alg_02. This will lead to a state containing an item lalr1_dfa_03.

        3. Add to the lookahead of lalr1_dfa_03 the contents of lalr1_dfa_06

References

[01] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/04/Slides04B.pdf

[02] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/05/Slides05.pdf

[03] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/06/Slides06.pdf

[04] https://www.sciencedirect.com/science/article/pii/0020019089900793