-
-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor tokenization pipeline to use GATs #1924
Conversation
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #1924 +/- ##
==========================================
- Coverage 94.46% 94.44% -0.02%
==========================================
Files 309 309
Lines 56485 56537 +52
==========================================
+ Hits 53360 53399 +39
- Misses 3125 3138 +13
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
EmptyTokenStream::default().into() | ||
type TokenStream<'a> = EmptyTokenStream; | ||
fn token_stream<'a>(&self, _text: &'a str) -> EmptyTokenStream { | ||
EmptyTokenStream::default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EmptyTokenStream::default() | |
EmptyTokenStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EmptyTokenStream
isn't an empty struct, it contains a default Token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job.
See inline comments.
while working on #1654, I found a substantial part of the difference between "normal" strings and pre-split strings is the amount of allocation/deallocation done. It's one per string and per filter, as filter take a
BoxTokenStream
and returns aBoxTokenStream
. This leverage GATs so only the the finalTokenStream
get boxed.I've also found that pre-allocating a String for the LowerCaser is harmful to performance if some strings may be ascii only, which happens a lot more when strings are pre-split (and making matters worse, a lot more
LowerCaserTokenStream
are also instantiated in that case).Measuring on
fmassot--bench-hdfs-with-array
, I get a 3% uplift in the general case, and a 19% with pre-split strings.