Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode some part of posting list as -1 instead of direct values #2185

Merged
merged 9 commits into from
Oct 20, 2023

Conversation

trinity-1686a
Copy link
Contributor

@trinity-1686a trinity-1686a commented Sep 21, 2023

fix quickwit-oss/quickwit#3740 on next tantivy pull
fix #2180

tested with quickwit on 100k documents from github-archive, using a json term covering everything, indexed+stored+record-position+fast, all put in a single split.
Here are the size of the different sections of the split, at each step of the changes:

image

the code should be compatible with old indexes, however it generate posting lists that can't be read by old version.
Not emitting TF cause the decoding to become somewhat of a guesswork, we know this field has frequency, but does it really has frequency for this term? We now decide that based on whether the section is lengthily enough to contain those. As far as I can tell it's correct, but not beautiful, and rather fragile. I think I prefer encoding dummy TF/position than actually not emitting those

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2023

Codecov Report

Patch coverage: 98.31% and project coverage change: +0.01% 🎉

Comparison is base (2d73903) 94.40% compared to head (244de65) 94.42%.
Report is 5 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2185      +/-   ##
==========================================
+ Coverage   94.40%   94.42%   +0.01%     
==========================================
  Files         322      322              
  Lines       63225    63609     +384     
==========================================
+ Hits        59688    60061     +373     
- Misses       3537     3548      +11     
Files Changed Coverage Δ
src/postings/compression/mod.rs 98.84% <95.65%> (-1.16%) ⬇️
src/indexer/mod.rs 100.00% <100.00%> (ø)
src/positions/reader.rs 100.00% <100.00%> (ø)
src/positions/serializer.rs 100.00% <100.00%> (ø)
src/postings/block_segment_postings.rs 98.65% <100.00%> (+0.09%) ⬆️
src/postings/serializer.rs 99.30% <100.00%> (+0.35%) ⬆️
src/postings/skip.rs 98.45% <100.00%> (+0.14%) ⬆️

... and 12 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/postings/skip.rs Outdated Show resolved Hide resolved
src/postings/skip.rs Outdated Show resolved Hide resolved
src/postings/skip.rs Outdated Show resolved Hide resolved
@@ -24,7 +24,7 @@ fn max_score<I: Iterator<Item = Score>>(mut it: I) -> Option<Score> {
#[derive(Clone)]
pub struct BlockSegmentPostings {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the skip_reader member in BlockSegmentPostings is pub(crate), but calling next on it can break the invariant you rephrased ("if block_loaded is set to true the block loaded in doc_decoder and freq_decoder is matching the block of skip_reader).

Can you see if it is possible to make it private, and if needed expose it with an inline non mut accessor.

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the skip_reader member in BlockSegmentPostings is pub(crate), but calling next on it can break the invariant you rephrased ("if block_loaded is set to true the block loaded in doc_decoder and freq_decoder is matching the block of skip_reader).

Can you see if it is possible to make it private, and if needed expose it with an inline non mut accessor.

@trinity-1686a trinity-1686a merged commit 0d45892 into main Oct 20, 2023
@trinity-1686a trinity-1686a deleted the trinity--bitpack-delta-1 branch October 20, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants