Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up tagger loading: remove IndexMap, new -> with_capacity #66

Merged
merged 4 commits into from
Apr 16, 2021

Conversation

bminixhofer
Copy link
Owner

Hey @drahnr I've had a go at speeding up loading the Tokenizer today.

I did two things:

  • Replace the IndexMap with Vec<(WordIdInt, PosIdInt)> as discussed in Improve loading speed (of regex?) - cli usecase #56. This makes the most difference.
  • new -> with_capacity by storing the lengths, this makes a very small but measurable difference (a couple %).

Overall I get a 25% speedup, which is something at least. I experimented a bit with parallelization, particularly setting some "anchor" points in the FST and splitting the work in chunks where each chunk iterators from one anchor point to the next, but it seems the speedup from that is nullified by the merge we have to do afterwards.

Maybe there's some more smarter ways to further speed this up, but I couldn't think of anything.

@drahnr
Copy link
Contributor

drahnr commented Apr 16, 2021

This is very good news! 25% is already a noticeable improvement, sorry for dropping the ball on this :>

@bminixhofer bminixhofer merged commit 2a243aa into main Apr 16, 2021
@bminixhofer bminixhofer deleted the faster-tagger-load branch April 16, 2021 09:06
@bminixhofer
Copy link
Owner Author

No worries. As of release 0.6.2 you should see the speedup :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants