Speed up tagger loading: remove IndexMap, new -> with_capacity #66

bminixhofer · 2021-04-15T17:43:49Z

Hey @drahnr I've had a go at speeding up loading the Tokenizer today.

I did two things:

Replace the IndexMap with Vec<(WordIdInt, PosIdInt)> as discussed in Improve loading speed (of regex?) - cli usecase #56. This makes the most difference.
new -> with_capacity by storing the lengths, this makes a very small but measurable difference (a couple %).

Overall I get a 25% speedup, which is something at least. I experimented a bit with parallelization, particularly setting some "anchor" points in the FST and splitting the work in chunks where each chunk iterators from one anchor point to the next, but it seems the speedup from that is nullified by the merge we have to do afterwards.

Maybe there's some more smarter ways to further speed this up, but I couldn't think of anything.

drahnr · 2021-04-16T05:50:58Z

This is very good news! 25% is already a noticeable improvement, sorry for dropping the ball on this :>

bminixhofer · 2021-04-16T10:48:36Z

No worries. As of release 0.6.2 you should see the speedup :)

remove IndexMap, new -> with_capacity

022498a

bminixhofer added 3 commits April 16, 2021 10:16

try CI fix

0744e9b

ci

cc3d5fe

comment out problematic test

aec968d

bminixhofer merged commit 2a243aa into main Apr 16, 2021

bminixhofer deleted the faster-tagger-load branch April 16, 2021 09:06

drahnr mentioned this pull request Apr 16, 2021

Reduce seemingly inactive / dead time drahnr/cargo-spellcheck#104

Open

bminixhofer mentioned this pull request Apr 28, 2021

Improve loading speed (of regex?) - cli usecase #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up tagger loading: remove IndexMap, new -> with_capacity #66

Speed up tagger loading: remove IndexMap, new -> with_capacity #66

bminixhofer commented Apr 15, 2021

drahnr commented Apr 16, 2021

bminixhofer commented Apr 16, 2021

Speed up tagger loading: remove IndexMap, new -> with_capacity #66

Speed up tagger loading: remove IndexMap, new -> with_capacity #66

Conversation

bminixhofer commented Apr 15, 2021

drahnr commented Apr 16, 2021

bminixhofer commented Apr 16, 2021