Improve tokenizer performance #1630

milachae · 2024-10-22T09:33:32Z

This PR improves the tokenizer performance by using a bottom-up algorithm to walk the syntax-tree to calculate the corresponding regions for each syntax node.

On our large benchmark dataset (1100 submissions), this change brings the time spent tokenizing the files down from 25 seconds to 2.5 seconds, making this step 10 times faster. As more than half of the total analysis was spent tokenizing, Dolos is now more than twice as fast.

API change: The generateTokens method of Tokenizer and its inheritors CodeTokenizer and CharTokenizer now returns an array Token[] instead of yielding IterableIterator<Token>. In most cases this will not break code using this method.

lib/src/lib/tokenizer/charTokenizer.ts

rien

There is possibly another performance bottleneck in there still, so I am very curious what the final share of the parser will be after fixing that.

Can you recreate this PR as a new branch in the Dolos repository? This will enable the CI to run on your code as well.

lib/src/lib/tokenizer/tokenizer.ts

rien · 2024-10-23T11:41:16Z

lib/src/lib/tokenizer/codeTokenizer.ts


-    for (const child of node.namedChildren) {
-      yield* this.tokenizeNode(child);
+      allNodes.push(...childNodes);


This spread operator is probably applied on large lists of childNodes, which could have a large performance impact.

In addition, this also copies the items previously added to that list to another list. Passing the tokens then actually becomes a $$\Theta(n^2)$$ operation as some tokens can be moved multiple times.

I suggest passing the array to add to as an argument instead of returning it. This also has the added benefit of allowing a cleaner return type:

public generateTokens(text: string): Token[] { const tree = this.parser.parse(text, undefined, { bufferSize: Math.max(32 * 1024, text.length * 2) }); const tokens = []; tokenizeNode(tree.rootNode, tokens); return tokens; } private tokenizeNode(node: SyntaxNode, tokens: Token[]): [number, number] { const location = new Region(node.startPosition.row, node.startPosition.column, node.endPosition.row, node.endPosition.column); tokens.push(this.newToken("(", location); tokens.push(this.newToken(node.type, location); for (const child of node.namedChildren) { const [childStartRow, childStartCol] = this.tokenizeNode(child, tokens); // If the code is already captured in one of the children, the region of the current node can be shortened. if ((childStartRow < location.endRow) || (childStartRow === location.endRow && childStartCol < location.endCol)) { location.endRow = childStartRow; location.endCol = childStartCol; } } tokens.push(this.newToken(")", location); return [location.startRow, location.startCol]; }

This was indeed something I already tested last week, but (surprisingly) didn't have any impact on the execution time.

Although, it indeed does make the return type more clear.

lib/src/lib/tokenizer/tokenizer.ts

rien · 2024-10-23T16:00:59Z

Nevermind my suggestion to recreate the branch, it somehow started doing the CI on this PR.

milachae · 2024-10-23T16:05:54Z

Nevermind my suggestion to recreate the branch, it somehow started doing the CI on this PR.

I think this happened because I just created the new branch in this repo. But I don't know why that triggered the CI in this PR?

milachae added 7 commits October 13, 2024 14:26

add tests for tokenizer

e9c24dc

recursive post-order tree traversal

c8e2033

remove redundant code + add more tests

a05b591

optimization: look at range of first childs

1a22140

Merge branch 'main' into improve_tokenizer

b349a7a

Add comments

b455f0b

test change due to better calculation of regions

1f6f715

milachae requested a review from rien October 22, 2024 09:33

milachae self-assigned this Oct 22, 2024

bmesuere requested changes Oct 22, 2024

View reviewed changes

lib/src/lib/tokenizer/charTokenizer.ts Outdated Show resolved Hide resolved

Remove .flat() from charTokenizer

33254b8

rien requested changes Oct 23, 2024

View reviewed changes

add code suggestions

2b877e5

rien approved these changes Oct 23, 2024

View reviewed changes

rien changed the title ~~Improve the tokenizer~~ Improve tokenizer performance Oct 24, 2024

rien added enhancement New feature or request Dolos library algorithm labels Oct 24, 2024

rien merged commit bb6841e into dodona-edu:main Oct 24, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tokenizer performance #1630

Improve tokenizer performance #1630

milachae commented Oct 22, 2024 •

edited by rien

Loading

rien left a comment

rien Oct 23, 2024

milachae Oct 23, 2024

rien commented Oct 23, 2024

milachae commented Oct 23, 2024

Improve tokenizer performance #1630

Improve tokenizer performance #1630

Conversation

milachae commented Oct 22, 2024 • edited by rien Loading

rien left a comment

Choose a reason for hiding this comment

rien Oct 23, 2024

Choose a reason for hiding this comment

milachae Oct 23, 2024

Choose a reason for hiding this comment

rien commented Oct 23, 2024

milachae commented Oct 23, 2024

milachae commented Oct 22, 2024 •

edited by rien

Loading