Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id #2697

Merged
merged 15 commits into from
Aug 24, 2023

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented Jul 25, 2023

What this PR does:
This re-adds an index file to vparquet2 and vparquet3 blocks to help find traces by ID. It contains the max trace ID of each row group and allows us to jump straight to the correct one, instead of the current method which performs i/o to test each group. Added as optional to avoid breaking changes. New blocks will create and use the index, and old blocks will fallback to the prior method.

Benchmarks don't show any improvement unless we simulate some latency for the i/o operations. Numbers below are for a level-0 block with 11 rowgroups. It saves around log2(rgs) i/o operations, so would be even better on larger blocks. Additionally the index file will be cached in practice but not included here.

0 ms:

name              old time/op    new time/op    delta
FindTraceByID-12    24.2ms ±48%    24.2ms ±48%   ~     (p=0.739 n=10+10)

50ms

name              old time/op    new time/op    delta
FindTraceByID-12     321ms ± 1%     167ms ± 1%  -47.85%  (p=0.000 n=9+8)

100ms

name              old time/op    new time/op    delta
FindTraceByID-12     622ms ± 1%     318ms ± 2%  -48.82%  (p=0.000 n=9+10)

Which issue(s) this PR fixes:
Fixes n/a

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Member

@mapno mapno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joe-elliott
Copy link
Member

Before merging should we include this code in vparquet3?

@mdisibio
Copy link
Contributor Author

Before merging should we include this code in vparquet3?

Yep, added as mandatory feature in vParquet3. Please take a look.

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I'm concerned about is that, since our block min/max ids are currently incorrect the same bug will exist in this logic unwittingly.

Other than that I'm good.

.gitignore Show resolved Hide resolved
@mdisibio
Copy link
Contributor Author

mdisibio commented Aug 4, 2023

Since vParquet3 is already being used in some installs, we decided it needs to not be a breaking change after all. Now it is optional like in vParquet2.

@mdisibio
Copy link
Contributor Author

since our block min/max ids are currently incorrect the same bug will exist in this logic

That's a good thought and I don't have a great answer. Since we never determined the root cause or prevalence of that, it is hard to know this feature will be affected or not. I added an environment variable VPARQUET_INDEX=0 to turn off the index lookup in case we are seeing issues. Sound good?

@joe-elliott
Copy link
Member

joe-elliott commented Aug 23, 2023

That's a good thought and I don't have a great answer. Since we never determined the root cause or prevalence of that, it is hard to know this feature will be affected or not. I added an environment variable VPARQUET_INDEX=0 to turn off the index lookup in case we are seeing issues. Sound good?

Yeah, I'm good. I suppose we could write a script that validates the index (and the min/max block size) after this is in a testing cluster to see if we find any inconsistencies. Let's keep an eye on the vult(ch)!

@mdisibio mdisibio merged commit 14da7b8 into grafana:main Aug 24, 2023
@mdisibio mdisibio changed the title [Parquet] Add traceid index to vparquet2 and use it when finding trace by id [Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants