Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tagger: Return iterators over WordData, remove groups, parallelize deserialization #70

Merged
merged 7 commits into from
Apr 24, 2021

Conversation

bminixhofer
Copy link
Owner

I had another look at the tagger today. This PR:

  • Changes all the get_tags_* methods to return iterators instead of Vec.
  • Removes the groups. These were only used by nlprule in the PosReplacer which wasn't used anywhere as it is not fully implemented. Some of the currently unimplemented rules might need the groups in some form though, but we can probably get away with search + caching since the groups are only needed if a rule actually matches there.
  • Iterates over the FST in parallel in chunks with disjoint words, this allows populating the tags without synchronization.
  • Replaces the word HashMap with a Vec since the IDs go from zero to n_words anyway, so we don't need to hash anything.

I see another ~ 30% speedup in loading the Tokenizer. This could also have positive impact on rule checking speed, but there's some weird behavior in the local benchmark on my PC so I have to double check.

@drahnr you might be interested in this PR. It would also be great if you could double check the speedup.

@drahnr
Copy link
Contributor

drahnr commented Apr 22, 2021

Could you provide some instructions how you run your benches? Just copied the .bin files from the last release, but to no avail - I assume the structure of the .bins changed as well.

Benchmarking load tokenizer: Warming up for 3.0000 sthread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Serialization(Io(Custom { kind: UnexpectedEof, error: "failed to fill whole buffer" }))', nlprule/benches/load.rs:6:76
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@bminixhofer
Copy link
Owner Author

Oh, I forgot that the binaries have changed. You can rebuild them with:

./scripts/build_and_test.sh en xxx

@drahnr
Copy link
Contributor

drahnr commented Apr 22, 2021

Alright, it was a bit more than that, but it's running now.

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 36.1s, or reduce sample count to 10.
load tokenizer          time:   [350.09 ms 350.94 ms 351.92 ms]                           

load rules              time:   [32.718 ms 32.819 ms 32.929 ms]

A few observations (again, cargo-spellcheck PoV):

  • Numbers look really good, getting closer to the point where I would consider not thinking about launching another process just to avoid re-parsing - so this is a huge 👍

  • since the data is constant, a 3.0 second warmup does not represent what a common user of cargo-spellcheck will face (talking mostly caches here, black_box will only prevent the compiler to pre-maturely optimizing, but the actual CPU doesn't care) - See more details in chore: improve benches a bit #71 - EDIT: even for the case of a long running service, you are mostly intersted in the one-off execution, both rules and tokenizer loading are not common hot-paths, but rather one-off operations.

  • criterion has options for, so I would recommend not much of a warmup, so the worst case represents what cli programs will see

Thanks for working on the perf ❤️

@drahnr
Copy link
Contributor

drahnr commented Apr 22, 2021

Benches were all done with

vendor_id	: AuthenticAMD
cpu family	: 23
model		: 113
model name	: AMD Ryzen 7 3700X 8-Core Processor
stepping	: 0
microcode	: 0x8701013
cpu MHz		: 2200.000
cache size	: 512 KB
...

(for perspective).

@bminixhofer
Copy link
Owner Author

Alright, it was a bit more than that

I should document that, I guess you had to download the build dirs too.

Thanks for checking the perf! And thanks for the PR. I agree on the point with the warmup, I'll merge it.

@bminixhofer
Copy link
Owner Author

Turns out the Chunker deserialization also contributed a significant amount now that the Tokenizer is faster. There was a fairly easy fix storing the parameters in one large vector and accessing with (offset, length) tuples instead of having one vector for each feature which shaves of another ~12% of loading time on the Tokenizer (in terms of the already improved speed above).

Unfortunately I think that's it for the reasonably low-hanging speed improvements 😉

nlprule/src/compile/impls.rs Show resolved Hide resolved
nlprule/src/compile/impls.rs Show resolved Hide resolved
nlprule/src/compile/impls.rs Show resolved Hide resolved
@bminixhofer bminixhofer merged commit 602aaf1 into main Apr 24, 2021
@bminixhofer bminixhofer deleted the faster-tagger-load branch April 27, 2021 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants