-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count the number of blocks loaded by a store-gateway by duration #6074
Conversation
319e77f
to
ab74a1e
Compare
Example of the metrics produced:
|
ab74a1e
to
94d8f13
Compare
Emit metrics about the number of blocks a store-gateway has loaded based on the duration of the blocks (bucketed into the block durations the compactor is configured to produce). A common symptom of the compactor failing is store-gateways loading more 2h blocks that usual and degrading performance of the read path. Counting the number of blocks by duration allows us to detect when this happens. Signed-off-by: Nick Pillitteri <[email protected]>
94d8f13
to
67de993
Compare
|
||
return stats | ||
} | ||
|
||
func bucketBlockDuration(buckets tsdb.DurationList, duration time.Duration) time.Duration { | ||
for _, d := range buckets { | ||
if duration <= d { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocks are sometimes over by a few milliseconds which would put 2:00:00.1 blocks in the 12hr bucket, do we want to add a little jitter to this duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking to @56quarters offline and inspecting some systems blocks that are MS over the time limit are very rare, so I don't think we need jitter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Based on this code I don't think it should be possible that blocks have more than 2h of samples. I've double checked a few production cells and couldn't find instances of blocks exceeding their supposed duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocks are sometimes over by a few milliseconds which would put 2:00:00.1 blocks in the 12hr bucket, do we want to add a little jitter to this duration?
Do we have blocks with time range over the expected one? I'm aware a block could be smaller (e.g. 1:59:59
) but not bigger. Can you follow up on this, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocks are sometimes over by a few milliseconds which would put 2:00:00.1 blocks in the 12hr bucket, do we want to add a little jitter to this duration?
Do we have blocks with time range over the expected one?
We don't, at least I've never seen such block myself.
Signed-off-by: Nick Pillitteri <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome for finding compactor issues! Thank you!
pkg/storegateway/bucket_stores.go
Outdated
return float64(count) | ||
metrics <- prometheus.MustNewConstMetric(u.blocksLoaded, prometheus.GaugeValue, float64(total)) | ||
for d, n := range loaded { | ||
metrics <- prometheus.MustNewConstMetric(u.blocksLoadedByDuration, prometheus.GaugeValue, float64(n), d.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we convert time.Duration
to model.Duration
and use its String
method? It will remove unnecessary 0m0s
suffixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively: we can change duration
to hours
, to avoid formatting 24h
as 1d
when using model.Duration
. (Although I don't think that's a big deal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll convert to model.Duration
. I was trying to think of a good way to remove the trailing 0s
without assuming that block durations were always multiples of hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to model.Duration
for formatting of the label value in 8fa952c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this can prove useful. My comments are just non-blocking nits (although I'd like to see duration formatting improved).
Note that store-gateways don't load all 2h blocks, and ignore blocks that are too new (up to 10h by default, iirc), but I don't think that diminishes usefulness this metric. If store-gateways have too many 2h blocks, then something is definitely wrong.
pkg/storegateway/bucket_stores.go
Outdated
u.storesMu.RLock() | ||
|
||
for _, store := range u.stores { | ||
count += store.Stats().BlocksLoaded | ||
stats := store.Stats(u.cfg.TSDB.BlockRanges) | ||
for d, n := range stats.BlocksLoaded { | ||
loaded[d] += n | ||
total += n | ||
} | ||
} | ||
|
||
u.storesMu.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super-nit: Can we extract this into a function and use defer
for unlock? It makes it easier to see and verify that locking is correct here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll see if I can split this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the summing of counts and locking to a separate method in 8fa952c
Signed-off-by: Nick Pillitteri <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
for _, store := range u.stores { | ||
count += store.Stats().BlocksLoaded | ||
stats := store.Stats(u.cfg.TSDB.BlockRanges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u.cfg.TSDB
was a config supposed to be used only by the ingester. This means that it's supposed to only have 1 range (the 2h one). I think what you want here are the compactor ranges, not the ingester ones.
However, reading a compactor config looks a bad smell to me, so I'm wondering if we could just keep it easy and have hardcoded buckets instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make that change. Is your preference for using the compactor config or hard-coding the expected buckets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a hard-coded list of durations instead of requiring the store-gateway to be aware of compactor-specific configuration. Related #6074 Signed-off-by: Nick Pillitteri <[email protected]>
Use a hard-coded list of durations instead of requiring the store-gateway to be aware of compactor-specific configuration. Related #6074 Signed-off-by: Nick Pillitteri <[email protected]>
What this PR does
Emit metrics about the number of blocks a store-gateway has loaded based on the duration of the blocks (bucketed into the block durations the compactor is configured to produce).
A common symptom of the compactor failing is store-gateways loading more 2h blocks that usual and degrading performance of the read path. Counting the number of blocks by duration allows us to detect when this happens.
Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]