Can I pass whitespace and newlines to the parser without having to explicitly consume them? #1934

pcafstockf · 2023-03-18T17:33:41Z

pcafstockf
Mar 18, 2023

I would like CST nodes from the parser to be a full and accurate representation of the file that was parsed.
On the other hand, I do not want to have to explicitly consume every whitespace and newline token in every possible grammar rule where they could occur.

Lexer.SKIPPED tokens do not seem to get passed to CSTParser at all.
CSTParser.canTokenTypeBeDeletedInRecovery is the general idea, but could have problems with whitespace and newlines back to back, and more importantly removes them from the output of CST nodes of the parser.

Maybe something like a Lexer.IMPLICITLY_CONSUMED group on a token?

This whole concept is important to me because I want to be able to re-create the input file exactly as it was using only a final parsed tree of CST nodes.

Any ideas/suggestions would be much appreciated.

bd82 · 2023-03-19T15:01:28Z

bd82
Mar 19, 2023
Maintainer

This general approach may work:

Lex whitespace normally (without skipping it)
Modify consumeInternal and the logic it invokes to also collect all whitespace tokens after consuming a token.
Add this whitespace tokens to the currently being build CST node
- Code adding Terminals to CST node

Notes:

This logic won't work for prefix whitespace (before any "real" tokens)
This may have subtle effects, e.g on fault tolerance or calculating location information of CST nodes.

0 replies

msujew · 2023-03-19T15:37:36Z

msujew
Mar 19, 2023
Collaborator

In addition to what @bd82 said, I offer a different approach, which doesn't explicitly consume the tokens, but rather allows to calculate whitespace information from the resulting CST. We use this approach in Langium, where we need this information in some LSP services such as formatting:

Initialize nodeLocationTracking with full, in order to get line and column information for each node. When looking for the whitespace information, we just calculate the distance between two subsequent CST nodes. The distance is effectively just the line/column difference between these two nodes. Compared to Shahar's approach, this doesn't require to implicitly parse these tokens, but it has other caveats:

This process might need to be modified in the presence of other hidden tokens such as comments
It doesn't yield the exact text that was used to construct the CST. For example, there would be no difference between the tab and whitespace character as both only have a character width of 1. In vscode (Langium's main use case) it's clear whether the user is using tab or whitespace indentation, but the issue might not be as clear cut in your case.

0 replies

pcafstockf · 2023-03-22T17:33:07Z

pcafstockf
Mar 22, 2023
Author

Thanks to both of you!

This simplified my grammars so much that I wanted to post my implementation in case anybody else runs into the same need.

I did have a few situations where whitespace was significant, but most times was not. So I extended ConsumeMethodOpts to be able to flag those.

declare module 'chevrotain' {
	interface CstParser {
		consumeInternal(tokType: TokenType, idx: number, options?: ConsumeMethodOpts): IToken;
		consumeToken(): void;
		cstPostTerminal(key: string, consumedToken: IToken): void
	}
	interface ConsumeMethodOpts {
		noImplicit?: boolean;
	}
}

	private consumeImplicits(options?: ConsumeMethodOpts) {
		let checkForMore = false;
		do {
			let nextToken = this.LA(1);
			switch (nextToken.tokenType) {
				case Whitespace:
				case EOL:
					this.consumeToken();
					this.cstPostTerminal(options !== undefined && options.LABEL !== undefined ? options.LABEL : nextToken.tokenType.name, nextToken);
					checkForMore = true;
					break;
				default:
					checkForMore = false;
					break;
			}
		} while (checkForMore);
	}

	consumeInternal(tokType: TokenType, idx: number, options?: ConsumeMethodOpts): IToken {
		if (! options?.noImplicit)
			this.consumeImplicits(options);
		const retVal = super.consumeInternal(tokType, idx, options);
		if (! options?.noImplicit)
			this.consumeImplicits(options);
		return retVal;
	}

There could be better ways, but this allowed me to remove dozens of this.OPTION(() => this.CONSUME(Whitespace)) calls in my grammar which is now much cleaner to read and modify.

Thanks again.

edited by bd82 to add typescript annotation to code blocks

0 replies

matthew-dean · 2023-08-22T19:38:03Z

matthew-dean
Aug 22, 2023

Oh wow, I wish I had found this earlier, but I'm glad I found it! This is very similar to my question here and I'm surprised this discussion wasn't referenced, as it's nearly the exact same use case.

I too found working with "skipped" tokens difficult in Chevrotain as there are many scenarios where you don't want them to be considered for most parsing paths, but you still want to use them for some parsing decisions and/or output. There are many languages that preserve comments but don't want them considered when parsing, such as TypeScript or Less.

I tried a variation of @pcafstockf's solution, but I just couldn't get it to work, perhaps because I'm using the chevrotain-allstar package. There seemed to be way too many cases where, for example, look-aheads would break, even if I customized the LA() method to only return non-skipped tokens. Many of Chevrotain's internals don't seem to work well with arbitrary skipping.

So, I instead took a different approach, which preserves "skipped" tokens but removes them entirely from "consideration" by Chevrotain internals (which I think should make things faster than other proposed solutions?).

First, I used a Chevrotain augmentation like this:

import type {
  TokenType,
  IToken,
  CstElement,
  ConsumeMethodOpts,
  CstChildrenDictionary,
  CstNodeLocation
} from 'chevrotain'

declare module 'chevrotain' {
  interface CstParser {
    consumeInternal(tokType: TokenType, idx: number, options?: ConsumeMethodOpts): IToken
    cstPostTerminal(key: string, consumedToken: IToken): void
    setInitialNodeLocation(node: CstNode): void
    setNodeLocationFromToken: (
      nodeLocation: CstNodeLocation,
      locationInformation: CstNodeLocation,
    ) => void
    setNodeLocationFromNode: (
      nodeLocation: CstNodeLocation,
      locationInformation: CstNodeLocation,
    ) => void
    // @ts-ignore - this is incorrectly defined as a data property in Chevrotain's public API
    set input(value: IToken[])
    // @ts-ignore
    get input(): IToken[]
    currIdx: number
    CST_STACK: CstNode[]
  }

  interface CstNode {
    readonly name: string
    readonly children: CstChildrenDictionary
    readonly recoveredNode?: boolean
    readonly location?: CstNodeLocation
    /** Extension */
    childrenStream: CstElement[]
  }
}

Then I overrode the set input() method, separating "skipped" tokens by label. This creates a Map that groups any skipped tokens before any eventual token in the tokVector array.

/** Separate skipped tokens into a new map */
  // eslint-disable-next-line accessor-pairs
  set input(value: IToken[]) {
    const skippedTokens = new Map<number, IToken[]>()
    const inputTokens: IToken[] = []
    let foundTokens: number = 0
    for (let i = 0; i < value.length; i++) {
      const token = value[i]
      if (token.tokenType.LABEL === SKIPPED_LABEL) {
        const tokens = skippedTokens.get(foundTokens) ?? []
        skippedTokens.set(foundTokens, [...tokens, token])
      } else {
        inputTokens.push(token)
        foundTokens++
      }
    }
    this.skippedTokens = skippedTokens
    super.input = inputTokens
  }

The consumeInternal method is similar to @pcafstockf's solution, with some alterations:

private _consumeImplicits(key: 'pre' | 'post') {
    const skipped = this.skippedTokens.get(this.currIdx + 1)
    if (skipped) {
      if (key === 'pre' || this.LA(1).tokenType === EOF) {
        skipped.forEach(token => this.cstPostTerminal(key, token))
      }
    }
  }

  consumeInternal(tokType: TokenType, idx: number, options?: ConsumeMethodOpts): IToken {
    this._consumeImplicits('pre')
    const retVal = super.consumeInternal(tokType, idx, options)
    this._consumeImplicits('post')
    return retVal
  }

However, as a last touch, I re-worked an earlier solution I had for CstNode serialization. One of my gripes with the current structure of a Chevrotain CstNode is that it's not generically serializable. (Don't get me wrong, @bd82, everything in general about this library is amazing.) You essentially have to know the exact parsing structure to know which of the children nodes were captured in what order. So if you want to re-serialize a single node, you would essentially have to replicate your parsing rule as a serialization rule. (Alternatively, you could recursively traverse the CstNodes and collect all the Tokens, and then sort them by their offset, but I wanted something simpler and faster.)

That's the purpose of the childrenStream property of the CstNode: all the captured ITokens and CstNodes are in order.

Doing that requires overriding, and in some cases, completely replicating some Chevrotain methods and functions. (The addTerminalToCst and addNoneTerminalToCst functions are not exposed anywhere, so they have to be re-created.)

So, those changes look like this:

  cstPostTerminal(
    key: string,
    consumedToken: IToken
  ): void {
    const rootCst = this.CST_STACK[this.CST_STACK.length - 1]
    this.addTerminalToCst(rootCst, consumedToken, key)
    this.setNodeLocationFromToken(rootCst.location!, <any>consumedToken)
  }

  cstPostNonTerminal(
    ruleCstResult: CstNode,
    ruleName: string
  ): void {
    const preCstNode = this.CST_STACK[this.CST_STACK.length - 1]
    this.addNoneTerminalToCst(preCstNode, ruleName, ruleCstResult)
    this.setNodeLocationFromNode(preCstNode.location!, ruleCstResult.location!)
  }

  cstInvocationStateUpdate(this: CstParser, fullRuleName: string): void {
    const cstNode: Partial<CstNode> = {
      name: fullRuleName,
      children: Object.create(null)
    }
    /**
     * Sets a linear stream of children CstNodes and ITokens
     * which can easily be re-serialized.
     */
    Object.defineProperty(cstNode, 'childrenStream', {
      value: []
    })

    this.setInitialNodeLocation(cstNode as CstNode)
    this.CST_STACK.push(cstNode as CstNode)
  }

  addTerminalToCst(node: CstNode, token: IToken, tokenTypeName: string) {
    node.childrenStream.push(token)
    if (node.children[tokenTypeName] === undefined) {
      node.children[tokenTypeName] = [token]
    } else {
      node.children[tokenTypeName].push(token)
    }
  }

  addNoneTerminalToCst(node: CstNode, ruleName: string, ruleResult: any) {
    this.addTerminalToCst(node, ruleResult, ruleName)
  }

To re-serialize a CstNode, I use this function:

export const stringify = (cst: CstNode): string => {
  let output = ''

  const recurseCst = (node: CstNode | IToken): void => {
    if (!node) {
      return
    }
    if ('name' in node) {
      node.childrenStream.forEach(child => { recurseCst(child) })
      return
    }
    output += node.image
  }
  recurseCst(cst)

  return output
}

Now, to @pcafstockf's question, there's still the issue of how to use skipped tokens during parsing. In my parser, I have two convenience methods for that:

  /**
   * Used in a GATE.
   * Determine if there is white-space before the next token
   */
  hasWS() {
    const skipped = this.skippedTokens.get(this.currIdx + 1)
    if (!skipped) {
      return false
    }
    return !!skipped.find(token => token.tokenType === this.T.WS)
  }

  /**
   * Used in a GATE.
   * Affirms that there is NOT white space or comment before next token
   */
  noSep() {
    return !this.skippedTokens.get(this.currIdx + 1)
  }

I hope that helps! And thanks so much to everyone who posted their own solutions!

2 replies

matthew-dean Aug 23, 2023

Note: I discovered upon compiling from scratch that tsc will fail on either the input accessor override or the augmentation that attempts to prevent it, so I've made a PR addressing those here.

matthew-dean Aug 23, 2023

Hmm, I'm also encountering this problem, which I'm trying to figure out. 🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I pass whitespace and newlines to the parser without having to explicitly consume them? #1934

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Can I pass whitespace and newlines to the parser without having to explicitly consume them? #1934

pcafstockf Mar 18, 2023

Replies: 4 comments · 2 replies

bd82 Mar 19, 2023 Maintainer

msujew Mar 19, 2023 Collaborator

pcafstockf Mar 22, 2023 Author

matthew-dean Aug 22, 2023

matthew-dean Aug 23, 2023

matthew-dean Aug 23, 2023

pcafstockf
Mar 18, 2023

Replies: 4 comments 2 replies

bd82
Mar 19, 2023
Maintainer

msujew
Mar 19, 2023
Collaborator

pcafstockf
Mar 22, 2023
Author

matthew-dean
Aug 22, 2023