-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve tagger: Return iterators over WordData
, remove groups, parallelize deserialization
#70
Conversation
Could you provide some instructions how you run your benches? Just copied the
|
Oh, I forgot that the binaries have changed. You can rebuild them with:
|
Alright, it was a bit more than that, but it's running now.
A few observations (again,
Thanks for working on the perf ❤️ |
Benches were all done with
(for perspective). |
I should document that, I guess you had to download the build dirs too. Thanks for checking the perf! And thanks for the PR. I agree on the point with the warmup, I'll merge it. |
Turns out the Unfortunately I think that's it for the reasonably low-hanging speed improvements 😉 |
I had another look at the tagger today. This PR:
get_tags_*
methods to return iterators instead ofVec
.groups
. These were only used by nlprule in thePosReplacer
which wasn't used anywhere as it is not fully implemented. Some of the currently unimplemented rules might need thegroups
in some form though, but we can probably get away with search + caching since the groups are only needed if a rule actually matches there.tags
without synchronization.HashMap
with aVec
since the IDs go from zero ton_words
anyway, so we don't need to hash anything.I see another ~ 30% speedup in loading the Tokenizer. This could also have positive impact on rule checking speed, but there's some weird behavior in the local benchmark on my PC so I have to double check.
@drahnr you might be interested in this PR. It would also be great if you could double check the speedup.