[Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id #2697

mdisibio · 2023-07-25T19:29:04Z

What this PR does:
This re-adds an index file to vparquet2 and vparquet3 blocks to help find traces by ID. It contains the max trace ID of each row group and allows us to jump straight to the correct one, instead of the current method which performs i/o to test each group. Added as optional to avoid breaking changes. New blocks will create and use the index, and old blocks will fallback to the prior method.

Benchmarks don't show any improvement unless we simulate some latency for the i/o operations. Numbers below are for a level-0 block with 11 rowgroups. It saves around log2(rgs) i/o operations, so would be even better on larger blocks. Additionally the index file will be cached in practice but not included here.

0 ms:

name              old time/op    new time/op    delta
FindTraceByID-12    24.2ms ±48%    24.2ms ±48%   ~     (p=0.739 n=10+10)

50ms

name              old time/op    new time/op    delta
FindTraceByID-12     321ms ± 1%     167ms ± 1%  -47.85%  (p=0.000 n=9+8)

100ms

name              old time/op    new time/op    delta
FindTraceByID-12     622ms ± 1%     318ms ± 2%  -48.82%  (p=0.000 n=9+10)

Which issue(s) this PR fixes:
Fixes n/a

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

mapno

LGTM

tempodb/encoding/vparquet2/block_findtracebyid_test.go

joe-elliott · 2023-07-27T12:13:46Z

Before merging should we include this code in vparquet3?

mdisibio · 2023-07-27T13:24:06Z

Before merging should we include this code in vparquet3?

Yep, added as mandatory feature in vParquet3. Please take a look.

joe-elliott

The only thing I'm concerned about is that, since our block min/max ids are currently incorrect the same bug will exist in this logic unwittingly.

Other than that I'm good.

.gitignore

…ck ended on exactly the configured rowgroup size while flushing

mdisibio · 2023-08-04T14:44:55Z

Since vParquet3 is already being used in some installs, we decided it needs to not be a breaking change after all. Now it is optional like in vParquet2.

mdisibio · 2023-08-23T10:56:47Z

since our block min/max ids are currently incorrect the same bug will exist in this logic

That's a good thought and I don't have a great answer. Since we never determined the root cause or prevalence of that, it is hard to know this feature will be affected or not. I added an environment variable VPARQUET_INDEX=0 to turn off the index lookup in case we are seeing issues. Sound good?

joe-elliott · 2023-08-23T12:19:16Z

That's a good thought and I don't have a great answer. Since we never determined the root cause or prevalence of that, it is hard to know this feature will be affected or not. I added an environment variable VPARQUET_INDEX=0 to turn off the index lookup in case we are seeing issues. Sound good?

Yeah, I'm good. I suppose we could write a script that validates the index (and the min/max block size) after this is in a testing cluster to see if we find any inconsistencies. Let's keep an eye on the vult(ch)!

Add traceid index to vparquet2 and use it when finding trace by id

f2d491d

mdisibio requested review from joe-elliott, annanay25, mapno, yvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners July 25, 2023 19:29

mdisibio added 3 commits July 26, 2023 13:38

Merge branch 'main' into vparquet2-index

485c62f

Update to detect when trace ID is beyond the last row group and bail

eb66f8f

simplify

b9a4b1d

mapno approved these changes Jul 27, 2023

View reviewed changes

tempodb/encoding/vparquet2/block_findtracebyid_test.go Show resolved Hide resolved

mdisibio added 5 commits July 27, 2023 08:23

Add index to vparquet3

d51509f

go away

766aabe

lint

128c9bc

changelog

f796e54

Merge branch 'main' into vparquet2-index

6d61d97

joe-elliott reviewed Jul 27, 2023

View reviewed changes

.gitignore Show resolved Hide resolved

mdisibio added 3 commits July 27, 2023 11:17

Fix index to not flush duplicate entry, which could happen if the blo…

b4c9716

…ck ended on exactly the configured rowgroup size while flushing

the opposite of move fast and break things

7dea128

more optionalality

cee39a8

mdisibio force-pushed the vparquet2-index branch from 9c37545 to cee39a8 Compare August 3, 2023 20:47

mdisibio added 2 commits August 22, 2023 07:58

Add new env var VPARQUET_INDEX=0 to disable the index lookup

a2afeed

Merge branch 'main' into vparquet2-index

a040fa6

joe-elliott approved these changes Aug 23, 2023

View reviewed changes

Merge branch 'main' into vparquet2-index

0242027

mdisibio merged commit 14da7b8 into grafana:main Aug 24, 2023

mdisibio changed the title ~~[Parquet] Add traceid index to vparquet2 and use it when finding trace by id~~ [Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id Aug 24, 2023

This was referenced Aug 29, 2023

Fix correctness of block min/max IDs #2867

Merged

Fix flush index error handling #2890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id #2697

[Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id #2697

mdisibio commented Jul 25, 2023 •

edited

Loading

mapno left a comment

joe-elliott commented Jul 27, 2023

mdisibio commented Jul 27, 2023

joe-elliott left a comment •

edited

Loading

mdisibio commented Aug 4, 2023

mdisibio commented Aug 23, 2023

joe-elliott commented Aug 23, 2023 •

edited

Loading

[Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id #2697

[Parquet] Add traceid index to vparquet2 and vparquet3 and use it when finding trace by id #2697

Conversation

mdisibio commented Jul 25, 2023 • edited Loading

mapno left a comment

Choose a reason for hiding this comment

joe-elliott commented Jul 27, 2023

mdisibio commented Jul 27, 2023

joe-elliott left a comment • edited Loading

Choose a reason for hiding this comment

mdisibio commented Aug 4, 2023

mdisibio commented Aug 23, 2023

joe-elliott commented Aug 23, 2023 • edited Loading

mdisibio commented Jul 25, 2023 •

edited

Loading

joe-elliott left a comment •

edited

Loading

joe-elliott commented Aug 23, 2023 •

edited

Loading