Skip to content

Complete Indent Block Parsing

Shane Brinkman-Davis Delamore edited this page May 7, 2017 · 12 revisions

Related: Blocks-Instead-of-Brackets

CaffeineScript is founded on the idea that it is possible to do Indent-Block parsing consistently and universally throughout the language. Other indent-based languages (Python, CoffeeScript) resort to hacks to parse indent-blocks, which are inherently context sensitive, with LALR parsers which can only parse contex-free syntax. The hack is they, essentially, insert "{" and "}" brackets around the detected blocks in the lexer pass. This approach is fundamentally incompatible with indent-based comments, indent-based-strings and other constructs which change how the contents of a given block is parsed. It wouldn't work to insert "{" and "}" around a string-block.

It took me several months to figure out how to achieve "complete indent-block parsing" efficiently. My answer was to combined parsing-expression-grammars (PEG) with 'sub-parsing.' Basically, while parsing, when a block-start is expected and detected, a new parser is instantiated and run over the contents of the deindented block source-text. While subparsing is relatively streightforward, it only works with PEGs, which combine both the lexing and parsing into one step.

Subparsing Example:

# input:
if foo
  bar()
  baz ""
    boom()
  bam()

# deindented, subparsed block #1, parsing rule: statements
bar()
baz ""
  boom()
bam()

# deindented, subparsed block #2, parsing rule: string
boom()

Output:

if (foo) {
  bar();
  baz("boom()");
  bam();
}

Because a new subparser is started for each block, that block can be parsed arbitrarily. CaffeineScript uses this for string-blocks, comment-blocks and regexp-blocks.

The result is my BabelBridgeJS parser library. This library stands on its own. You can use it to write your own parsers, optionally with complete-indent-block-parsing support.

Clone this wiki locally