-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve tokenizer performance #1630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is possibly another performance bottleneck in there still, so I am very curious what the final share of the parser will be after fixing that.
Can you recreate this PR as a new branch in the Dolos repository? This will enable the CI to run on your code as well.
|
||
for (const child of node.namedChildren) { | ||
yield* this.tokenizeNode(child); | ||
allNodes.push(...childNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spread operator is probably applied on large lists of childNodes
, which could have a large performance impact.
In addition, this also copies the items previously added to that list to another list. Passing the tokens then actually becomes a
I suggest passing the array to add to as an argument instead of returning it. This also has the added benefit of allowing a cleaner return type:
public generateTokens(text: string): Token[] {
const tree = this.parser.parse(text, undefined, { bufferSize: Math.max(32 * 1024, text.length * 2) });
const tokens = [];
tokenizeNode(tree.rootNode, tokens);
return tokens;
}
private tokenizeNode(node: SyntaxNode, tokens: Token[]): [number, number] {
const location = new Region(node.startPosition.row, node.startPosition.column, node.endPosition.row, node.endPosition.column);
tokens.push(this.newToken("(", location);
tokens.push(this.newToken(node.type, location);
for (const child of node.namedChildren) {
const [childStartRow, childStartCol] = this.tokenizeNode(child, tokens);
// If the code is already captured in one of the children, the region of the current node can be shortened.
if ((childStartRow < location.endRow) || (childStartRow === location.endRow && childStartCol < location.endCol)) {
location.endRow = childStartRow;
location.endCol = childStartCol;
}
}
tokens.push(this.newToken(")", location);
return [location.startRow, location.startCol];
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was indeed something I already tested last week, but (surprisingly) didn't have any impact on the execution time.
Although, it indeed does make the return type more clear.
Nevermind my suggestion to recreate the branch, it somehow started doing the CI on this PR. |
I think this happened because I just created the new branch in this repo. But I don't know why that triggered the CI in this PR? |
This PR improves the tokenizer performance by using a bottom-up algorithm to walk the syntax-tree to calculate the corresponding regions for each syntax node.
On our large benchmark dataset (1100 submissions), this change brings the time spent tokenizing the files down from 25 seconds to 2.5 seconds, making this step 10 times faster. As more than half of the total analysis was spent tokenizing, Dolos is now more than twice as fast.
API change: The
generateTokens
method ofTokenizer
and its inheritorsCodeTokenizer
andCharTokenizer
now returns an arrayToken[]
instead of yieldingIterableIterator<Token>
. In most cases this will not break code using this method.