Can I pass whitespace and newlines to the parser without having to explicitly consume them? #1934
Replies: 4 comments 2 replies
-
This general approach may work:
Notes:
|
Beta Was this translation helpful? Give feedback.
-
In addition to what @bd82 said, I offer a different approach, which doesn't explicitly consume the tokens, but rather allows to calculate whitespace information from the resulting CST. We use this approach in Langium, where we need this information in some LSP services such as formatting: Initialize nodeLocationTracking with
|
Beta Was this translation helpful? Give feedback.
-
Thanks to both of you! This simplified my grammars so much that I wanted to post my implementation in case anybody else runs into the same need. I did have a few situations where whitespace was significant, but most times was not. So I extended declare module 'chevrotain' {
interface CstParser {
consumeInternal(tokType: TokenType, idx: number, options?: ConsumeMethodOpts): IToken;
consumeToken(): void;
cstPostTerminal(key: string, consumedToken: IToken): void
}
interface ConsumeMethodOpts {
noImplicit?: boolean;
}
} private consumeImplicits(options?: ConsumeMethodOpts) {
let checkForMore = false;
do {
let nextToken = this.LA(1);
switch (nextToken.tokenType) {
case Whitespace:
case EOL:
this.consumeToken();
this.cstPostTerminal(options !== undefined && options.LABEL !== undefined ? options.LABEL : nextToken.tokenType.name, nextToken);
checkForMore = true;
break;
default:
checkForMore = false;
break;
}
} while (checkForMore);
}
consumeInternal(tokType: TokenType, idx: number, options?: ConsumeMethodOpts): IToken {
if (! options?.noImplicit)
this.consumeImplicits(options);
const retVal = super.consumeInternal(tokType, idx, options);
if (! options?.noImplicit)
this.consumeImplicits(options);
return retVal;
} There could be better ways, but this allowed me to remove dozens of Thanks again.
|
Beta Was this translation helpful? Give feedback.
-
Oh wow, I wish I had found this earlier, but I'm glad I found it! This is very similar to my question here and I'm surprised this discussion wasn't referenced, as it's nearly the exact same use case. I too found working with "skipped" tokens difficult in Chevrotain as there are many scenarios where you don't want them to be considered for most parsing paths, but you still want to use them for some parsing decisions and/or output. There are many languages that preserve comments but don't want them considered when parsing, such as TypeScript or Less. I tried a variation of @pcafstockf's solution, but I just couldn't get it to work, perhaps because I'm using the chevrotain-allstar package. There seemed to be way too many cases where, for example, look-aheads would break, even if I customized the So, I instead took a different approach, which preserves "skipped" tokens but removes them entirely from "consideration" by Chevrotain internals (which I think should make things faster than other proposed solutions?). First, I used a Chevrotain augmentation like this: import type {
TokenType,
IToken,
CstElement,
ConsumeMethodOpts,
CstChildrenDictionary,
CstNodeLocation
} from 'chevrotain'
declare module 'chevrotain' {
interface CstParser {
consumeInternal(tokType: TokenType, idx: number, options?: ConsumeMethodOpts): IToken
cstPostTerminal(key: string, consumedToken: IToken): void
setInitialNodeLocation(node: CstNode): void
setNodeLocationFromToken: (
nodeLocation: CstNodeLocation,
locationInformation: CstNodeLocation,
) => void
setNodeLocationFromNode: (
nodeLocation: CstNodeLocation,
locationInformation: CstNodeLocation,
) => void
// @ts-ignore - this is incorrectly defined as a data property in Chevrotain's public API
set input(value: IToken[])
// @ts-ignore
get input(): IToken[]
currIdx: number
CST_STACK: CstNode[]
}
interface CstNode {
readonly name: string
readonly children: CstChildrenDictionary
readonly recoveredNode?: boolean
readonly location?: CstNodeLocation
/** Extension */
childrenStream: CstElement[]
}
} Then I overrode the /** Separate skipped tokens into a new map */
// eslint-disable-next-line accessor-pairs
set input(value: IToken[]) {
const skippedTokens = new Map<number, IToken[]>()
const inputTokens: IToken[] = []
let foundTokens: number = 0
for (let i = 0; i < value.length; i++) {
const token = value[i]
if (token.tokenType.LABEL === SKIPPED_LABEL) {
const tokens = skippedTokens.get(foundTokens) ?? []
skippedTokens.set(foundTokens, [...tokens, token])
} else {
inputTokens.push(token)
foundTokens++
}
}
this.skippedTokens = skippedTokens
super.input = inputTokens
} The private _consumeImplicits(key: 'pre' | 'post') {
const skipped = this.skippedTokens.get(this.currIdx + 1)
if (skipped) {
if (key === 'pre' || this.LA(1).tokenType === EOF) {
skipped.forEach(token => this.cstPostTerminal(key, token))
}
}
}
consumeInternal(tokType: TokenType, idx: number, options?: ConsumeMethodOpts): IToken {
this._consumeImplicits('pre')
const retVal = super.consumeInternal(tokType, idx, options)
this._consumeImplicits('post')
return retVal
} However, as a last touch, I re-worked an earlier solution I had for CstNode serialization. One of my gripes with the current structure of a Chevrotain CstNode is that it's not generically serializable. (Don't get me wrong, @bd82, everything in general about this library is amazing.) You essentially have to know the exact parsing structure to know which of the children nodes were captured in what order. So if you want to re-serialize a single node, you would essentially have to replicate your parsing rule as a serialization rule. (Alternatively, you could recursively traverse the CstNodes and collect all the Tokens, and then sort them by their offset, but I wanted something simpler and faster.) That's the purpose of the Doing that requires overriding, and in some cases, completely replicating some Chevrotain methods and functions. (The So, those changes look like this: cstPostTerminal(
key: string,
consumedToken: IToken
): void {
const rootCst = this.CST_STACK[this.CST_STACK.length - 1]
this.addTerminalToCst(rootCst, consumedToken, key)
this.setNodeLocationFromToken(rootCst.location!, <any>consumedToken)
}
cstPostNonTerminal(
ruleCstResult: CstNode,
ruleName: string
): void {
const preCstNode = this.CST_STACK[this.CST_STACK.length - 1]
this.addNoneTerminalToCst(preCstNode, ruleName, ruleCstResult)
this.setNodeLocationFromNode(preCstNode.location!, ruleCstResult.location!)
}
cstInvocationStateUpdate(this: CstParser, fullRuleName: string): void {
const cstNode: Partial<CstNode> = {
name: fullRuleName,
children: Object.create(null)
}
/**
* Sets a linear stream of children CstNodes and ITokens
* which can easily be re-serialized.
*/
Object.defineProperty(cstNode, 'childrenStream', {
value: []
})
this.setInitialNodeLocation(cstNode as CstNode)
this.CST_STACK.push(cstNode as CstNode)
}
addTerminalToCst(node: CstNode, token: IToken, tokenTypeName: string) {
node.childrenStream.push(token)
if (node.children[tokenTypeName] === undefined) {
node.children[tokenTypeName] = [token]
} else {
node.children[tokenTypeName].push(token)
}
}
addNoneTerminalToCst(node: CstNode, ruleName: string, ruleResult: any) {
this.addTerminalToCst(node, ruleResult, ruleName)
} To re-serialize a CstNode, I use this function: export const stringify = (cst: CstNode): string => {
let output = ''
const recurseCst = (node: CstNode | IToken): void => {
if (!node) {
return
}
if ('name' in node) {
node.childrenStream.forEach(child => { recurseCst(child) })
return
}
output += node.image
}
recurseCst(cst)
return output
} Now, to @pcafstockf's question, there's still the issue of how to use skipped tokens during parsing. In my parser, I have two convenience methods for that: /**
* Used in a GATE.
* Determine if there is white-space before the next token
*/
hasWS() {
const skipped = this.skippedTokens.get(this.currIdx + 1)
if (!skipped) {
return false
}
return !!skipped.find(token => token.tokenType === this.T.WS)
}
/**
* Used in a GATE.
* Affirms that there is NOT white space or comment before next token
*/
noSep() {
return !this.skippedTokens.get(this.currIdx + 1)
} I hope that helps! And thanks so much to everyone who posted their own solutions! |
Beta Was this translation helpful? Give feedback.
-
I would like CST nodes from the parser to be a full and accurate representation of the file that was parsed.
On the other hand, I do not want to have to explicitly consume every whitespace and newline token in every possible grammar rule where they could occur.
Lexer.SKIPPED tokens do not seem to get passed to CSTParser at all.
CSTParser.canTokenTypeBeDeletedInRecovery is the general idea, but could have problems with whitespace and newlines back to back, and more importantly removes them from the output of CST nodes of the parser.
Maybe something like a Lexer.IMPLICITLY_CONSUMED group on a token?
This whole concept is important to me because I want to be able to re-create the input file exactly as it was using only a final parsed tree of CST nodes.
Any ideas/suggestions would be much appreciated.
Beta Was this translation helpful? Give feedback.
All reactions