Skip to content

Lossless Syntax Tree Pattern

andychu edited this page Jun 28, 2024 · 37 revisions

Feel free to add more resources to this page or clean it up!

Hacker News Comments on "From AST to Lossless Syntax Tree"

Other Names

IDE Architecture

  • Roslyn -- Microsoft's C# compiler platform

  • JetBrains Implementing Parser and PSI -- The AST nodes have a direct mapping to text ranges in the underlying document. The bottom-most nodes of the AST match individual tokens returned by the lexer, and higher level nodes match multiple-token fragments. Operations performed on nodes of the AST tree, such as inserting, removing, reordering nodes and so on, are immediately reflected as changes to the text of the underlying document.

    • There is currently no ready way to reuse existing language grammars, for example, from ANTLR, for creating custom language parsers. The parsers need to be coded manually. -- this should go in my meta-language rant
    • Custom language parser and PSI classes can be generated from grammars using Grammar-Kit plugin. Besides code generation it provides various features for editing grammar files: syntax highlighting, quick navigation, refactorings and more. The Grammar-Kit plugin is built using its own engine and its source code can be found on GitHub.
    • Comma, an IntelliJ-Based IDE for Raku (YouTube). Nice video with concrete experiences. He talks about PSI nodes, how they are memory intensive and have stubs, etc.
  • Making Tools Kythe-Compatible -- I would have thought that Kythe had the LST concept, because it's basically an IDE, but I don't see it. I guess that is because it's read-only?

Lossless Representations

  • cst for JavaScript -- CST means Concrete Syntax Tree. Unlike an AST (Abstract Syntax Tree), a CST contains all the information from the JavaScript source file: whitespace, punctuators, comments. This information is extremely useful for code style checkers and other code linters. CST is also useful for cases when you need to apply modifications to existing JavaScript files while preserving the initial file formatting.

  • JRuby Parser -- JRuby once had a parser which kept track of all sorts of extra information when it built it's Abstract Syntax Tree (AST). Stuff like character offsets where a particular element started or ended. The impact of this extra information was a more than noticeable amount of memory and a bit of a perf impact. At the time we decided to discontinue having this sort of parser in JRuby we created JRubyParser.* -- related to Parsing Is Difficult

  • Swift libSyntax (thanks to CAD1997 )

Currently, lib/AST structures don't make a very clear distinction between syntactic and semantic information. Long term, we hope to achieve the following based on work here: In no particular order, here is a summary of the design and implementation points for this library:

  • Represent Swift source with "full fidelity" - parsing a source file and printing the syntax tree should result in the same file.
  • Provide good structured editing APIs at all granularities.
  • Accommodate "bad syntax" - humans are imperfect and source code is constantly in a state of flux in an editor.

Style-Preserving Source Translators

  • 2to3 -- for Python 2 to Python 3

  • RedBaron -- RedBaron, by relying on Baron, uses a Full Syntax Tree (FST). It’s like an AST except it keeps every information, included formatting, and is then a lossless representation of the source code.

  • Facebook's jscodeshift -- uses the recast library

    • recast - JavaScript syntax tree transformer, nondestructive pretty-printer, and automatic source map generator
  • Facebook's pfff -- Tools for code analysis, visualizations, or style-preserving source transformation

  • External Clang Examples -- has clang-mutate, but it's dormant

Source Reformatters

TODO: What data structures do these use?

Other Docs / Designs / Discussions

  • Haskell GHC AST Annotations -- Simplify the roundtripping and modification of source by explicitly capturing the missing location information for the syntactic markers

  • What to do about comments? on Lambda the Ultimate

  • https://news.ycombinator.com/item?id=13914218 -- From C# compiler dev -- Resilient parsing. This is the big one! If you give our parser a string that is illegal according to the grammar, our parser will still give you a syntax tree! (We'll also spit errors out). But getting a syntax tree regardless of the actual validity of the program being passed in means that the IDE can give autocomplete and report type-checking error messages. As an example, the code "var x = velocity." is invalid C#. However, in order to give autocomplete on "velocity", that code needs to be parsed into an AST, and then typechecked, and then we can extract the members on the type in order to provide a good user experience.

Transpilers

These don't need to use the Lossless Syntax Tree because the resulting code won't be edited by a human.

  • CoffeeScript, etc.

Related

Clone this wiki locally