-
-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encode some part of posting list as -1 instead of direct values #2185
Conversation
Codecov ReportPatch coverage:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #2185 +/- ##
==========================================
+ Coverage 94.40% 94.42% +0.01%
==========================================
Files 322 322
Lines 63225 63609 +384
==========================================
+ Hits 59688 60061 +373
- Misses 3537 3548 +11
☔ View full report in Codecov by Sentry. |
@@ -24,7 +24,7 @@ fn max_score<I: Iterator<Item = Score>>(mut it: I) -> Option<Score> { | |||
#[derive(Clone)] | |||
pub struct BlockSegmentPostings { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the skip_reader
member in BlockSegmentPostings
is pub(crate)
, but calling next on it can break the invariant you rephrased ("if block_loaded
is set to true the block loaded in doc_decoder
and freq_decoder
is matching the block of skip_reader).
Can you see if it is possible to make it private, and if needed expose it with an inline non mut accessor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the skip_reader member in BlockSegmentPostings is pub(crate), but calling next on it can break the invariant you rephrased ("if block_loaded is set to true the block loaded in doc_decoder and freq_decoder is matching the block of skip_reader).
Can you see if it is possible to make it private, and if needed expose it with an inline non mut accessor.
b3e831e
to
770a012
Compare
fix quickwit-oss/quickwit#3740 on next tantivy pull
fix #2180
tested with quickwit on 100k documents from github-archive, using a json term covering everything, indexed+stored+record-position+fast, all put in a single split.
Here are the size of the different sections of the split, at each step of the changes:
the code should be compatible with old indexes, however it generate posting lists that can't be read by old version.
Not emitting TF cause the decoding to become somewhat of a guesswork, we know this field has frequency, but does it really has frequency for this term? We now decide that based on whether the section is lengthily enough to contain those. As far as I can tell it's correct, but not beautiful, and rather fragile. I think I prefer encoding dummy TF/position than actually not emitting those