refactor tokenization pipeline to use GATs #1924

trinity-1686a · 2023-03-03T11:08:40Z

while working on #1654, I found a substantial part of the difference between "normal" strings and pre-split strings is the amount of allocation/deallocation done. It's one per string and per filter, as filter take a BoxTokenStream and returns a BoxTokenStream. This leverage GATs so only the the final TokenStream get boxed.

I've also found that pre-allocating a String for the LowerCaser is harmful to performance if some strings may be ascii only, which happens a lot more when strings are pre-split (and making matters worse, a lot more LowerCaserTokenStream are also instantiated in that case).

Measuring on fmassot--bench-hdfs-with-array, I get a 3% uplift in the general case, and a 19% with pre-split strings.

codecov-commenter · 2023-03-03T13:30:39Z

Codecov Report

Merging #1924 (59f7417) into main (ca20bfa) will decrease coverage by 0.02%.
The diff coverage is 97.54%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main    #1924      +/-   ##
==========================================
- Coverage   94.46%   94.44%   -0.02%     
==========================================
  Files         309      309              
  Lines       56485    56537      +52     
==========================================
+ Hits        53360    53399      +39     
- Misses       3125     3138      +13

Impacted Files	Coverage Δ
src/query/more_like_this/more_like_this.rs	`67.54% <ø> (ø)`
src/tokenizer/ascii_folding_filter.rs	`99.89% <91.66%> (-0.04%)`	⬇️
src/tokenizer/lower_caser.rs	`98.38% <91.66%> (-1.62%)`	⬇️
src/tokenizer/alphanum_only.rs	`92.85% <94.11%> (-0.48%)`	⬇️
src/tokenizer/split_compound_words.rs	`95.62% <94.44%> (-0.38%)`	⬇️
src/indexer/segment_writer.rs	`97.65% <100.00%> (ø)`
src/query/query_parser/query_parser.rs	`95.14% <100.00%> (+<0.01%)`	⬆️
src/tokenizer/empty_tokenizer.rs	`65.00% <100.00%> (ø)`
src/tokenizer/facet_tokenizer.rs	`93.67% <100.00%> (-0.08%)`	⬇️
src/tokenizer/mod.rs	`97.32% <100.00%> (+0.04%)`	⬆️
... and 17 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

src/tokenizer/empty_tokenizer.rs

fulmicoton · 2023-03-08T03:08:54Z

src/tokenizer/empty_tokenizer.rs

-        EmptyTokenStream::default().into()
+    type TokenStream<'a> = EmptyTokenStream;
+    fn token_stream<'a>(&self, _text: &'a str) -> EmptyTokenStream {
+        EmptyTokenStream::default()


Suggested change

EmptyTokenStream::default()

EmptyTokenStream

EmptyTokenStream isn't an empty struct, it contains a default Token

src/tokenizer/raw_tokenizer.rs

tokenizer-api/src/lib.rs

fulmicoton

Good job.
See inline comments.

refactor tokenization pipeline to use GATs

9fd9266

trinity-1686a requested a review from fulmicoton March 3, 2023 11:08

fix doctests

59f7417

fulmicoton reviewed Mar 8, 2023

View reviewed changes

src/tokenizer/empty_tokenizer.rs Show resolved Hide resolved

fulmicoton reviewed Mar 8, 2023

View reviewed changes

src/tokenizer/raw_tokenizer.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Mar 8, 2023

View reviewed changes

tokenizer-api/src/lib.rs Outdated Show resolved Hide resolved

fulmicoton approved these changes Mar 8, 2023

View reviewed changes

trinity-1686a added 2 commits March 8, 2023 10:32

fix clippy lints

d532c6c

remove commented code

1655b42

trinity-1686a merged commit 0645181 into main Mar 9, 2023

trinity-1686a deleted the trinity--gat-tokenizer branch March 9, 2023 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor tokenization pipeline to use GATs #1924

refactor tokenization pipeline to use GATs #1924

trinity-1686a commented Mar 3, 2023

codecov-commenter commented Mar 3, 2023

fulmicoton Mar 8, 2023

trinity-1686a Mar 8, 2023

fulmicoton left a comment

refactor tokenization pipeline to use GATs #1924

refactor tokenization pipeline to use GATs #1924

Conversation

trinity-1686a commented Mar 3, 2023

codecov-commenter commented Mar 3, 2023

Codecov Report

fulmicoton Mar 8, 2023

Choose a reason for hiding this comment

trinity-1686a Mar 8, 2023

Choose a reason for hiding this comment

fulmicoton left a comment

Choose a reason for hiding this comment