-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id #2697
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Before merging should we include this code in vparquet3? |
Yep, added as mandatory feature in vParquet3. Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I'm concerned about is that, since our block min/max ids are currently incorrect the same bug will exist in this logic unwittingly.
Other than that I'm good.
…ck ended on exactly the configured rowgroup size while flushing
9c37545
to
cee39a8
Compare
Since vParquet3 is already being used in some installs, we decided it needs to not be a breaking change after all. Now it is optional like in vParquet2. |
That's a good thought and I don't have a great answer. Since we never determined the root cause or prevalence of that, it is hard to know this feature will be affected or not. I added an environment variable |
Yeah, I'm good. I suppose we could write a script that validates the index (and the min/max block size) after this is in a testing cluster to see if we find any inconsistencies. Let's keep an eye on the vult(ch)! |
What this PR does:
This re-adds an index file to vparquet2 and vparquet3 blocks to help find traces by ID. It contains the max trace ID of each row group and allows us to jump straight to the correct one, instead of the current method which performs i/o to test each group. Added as optional to avoid breaking changes. New blocks will create and use the index, and old blocks will fallback to the prior method.
Benchmarks don't show any improvement unless we simulate some latency for the i/o operations. Numbers below are for a level-0 block with 11 rowgroups. It saves around log2(rgs) i/o operations, so would be even better on larger blocks. Additionally the index file will be cached in practice but not included here.
0 ms:
50ms
100ms
Which issue(s) this PR fixes:
Fixes n/a
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]