Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tokenizer performance #1630

Merged
merged 9 commits into from
Oct 24, 2024
Merged

Conversation

milachae
Copy link
Collaborator

@milachae milachae commented Oct 22, 2024

This PR improves the tokenizer performance by using a bottom-up algorithm to walk the syntax-tree to calculate the corresponding regions for each syntax node.

On our large benchmark dataset (1100 submissions), this change brings the time spent tokenizing the files down from 25 seconds to 2.5 seconds, making this step 10 times faster. As more than half of the total analysis was spent tokenizing, Dolos is now more than twice as fast.

API change: The generateTokens method of Tokenizer and its inheritors CodeTokenizer and CharTokenizer now returns an array Token[] instead of yielding IterableIterator<Token>. In most cases this will not break code using this method.

@milachae milachae requested a review from rien October 22, 2024 09:33
@milachae milachae self-assigned this Oct 22, 2024
Copy link
Member

@rien rien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is possibly another performance bottleneck in there still, so I am very curious what the final share of the parser will be after fixing that.

Can you recreate this PR as a new branch in the Dolos repository? This will enable the CI to run on your code as well.


for (const child of node.namedChildren) {
yield* this.tokenizeNode(child);
allNodes.push(...childNodes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spread operator is probably applied on large lists of childNodes, which could have a large performance impact.

In addition, this also copies the items previously added to that list to another list. Passing the tokens then actually becomes a $$\Theta(n^2)$$ operation as some tokens can be moved multiple times.

I suggest passing the array to add to as an argument instead of returning it. This also has the added benefit of allowing a cleaner return type:

public generateTokens(text: string): Token[] {
    const tree = this.parser.parse(text, undefined, { bufferSize: Math.max(32 * 1024, text.length * 2) });
    const tokens = [];
	tokenizeNode(tree.rootNode, tokens);
	return tokens;
  }

private tokenizeNode(node: SyntaxNode, tokens: Token[]): [number, number] {
	const location = new Region(node.startPosition.row, node.startPosition.column, node.endPosition.row, node.endPosition.column);
	tokens.push(this.newToken("(", location);
	tokens.push(this.newToken(node.type, location);
	for (const child of node.namedChildren) {
		const [childStartRow, childStartCol] = this.tokenizeNode(child, tokens);
		// If the code is already captured in one of the children, the region of the current node can be shortened.
		if ((childStartRow < location.endRow) || (childStartRow === location.endRow && childStartCol < location.endCol)) {
			location.endRow = childStartRow;
			location.endCol = childStartCol;
		}
	}
	tokens.push(this.newToken(")", location);
	return [location.startRow, location.startCol];
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was indeed something I already tested last week, but (surprisingly) didn't have any impact on the execution time.

Although, it indeed does make the return type more clear.

@rien
Copy link
Member

rien commented Oct 23, 2024

Nevermind my suggestion to recreate the branch, it somehow started doing the CI on this PR.

@milachae
Copy link
Collaborator Author

Nevermind my suggestion to recreate the branch, it somehow started doing the CI on this PR.

I think this happened because I just created the new branch in this repo. But I don't know why that triggered the CI in this PR?

@rien rien changed the title Improve the tokenizer Improve tokenizer performance Oct 24, 2024
@rien rien merged commit bb6841e into dodona-edu:main Oct 24, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants