Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count the number of blocks loaded by a store-gateway by duration #6074

Merged
merged 3 commits into from
Sep 21, 2023

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Sep 20, 2023

What this PR does

Emit metrics about the number of blocks a store-gateway has loaded based on the duration of the blocks (bucketed into the block durations the compactor is configured to produce).

A common symptom of the compactor failing is store-gateways loading more 2h blocks that usual and degrading performance of the read path. Counting the number of blocks by duration allows us to detect when this happens.

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • [NA] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@56quarters 56quarters force-pushed the 56quarters/sg-block-count branch from 319e77f to ab74a1e Compare September 20, 2023 15:00
@56quarters
Copy link
Contributor Author

Example of the metrics produced:

# HELP cortex_bucket_store_blocks_loaded Number of currently loaded blocks.
# TYPE cortex_bucket_store_blocks_loaded gauge
cortex_bucket_store_blocks_loaded{component="store-gateway"} 34
# HELP cortex_bucket_store_blocks_loaded_by_duration Number of currently loaded blocks, bucketed by block duration.
# TYPE cortex_bucket_store_blocks_loaded_by_duration gauge
cortex_bucket_store_blocks_loaded_by_duration{component="store-gateway",duration="12h0m0s"} 16
cortex_bucket_store_blocks_loaded_by_duration{component="store-gateway",duration="24h0m0s"} 1
cortex_bucket_store_blocks_loaded_by_duration{component="store-gateway",duration="2h0m0s"} 17

@56quarters 56quarters force-pushed the 56quarters/sg-block-count branch from ab74a1e to 94d8f13 Compare September 20, 2023 19:42
@56quarters 56quarters marked this pull request as ready for review September 20, 2023 19:47
@56quarters 56quarters requested a review from a team as a code owner September 20, 2023 19:47
Emit metrics about the number of blocks a store-gateway has loaded based on
the duration of the blocks (bucketed into the block durations the compactor
is configured to produce).

A common symptom of the compactor failing is store-gateways loading more 2h
blocks that usual and degrading performance of the read path. Counting the
number of blocks by duration allows us to detect when this happens.

Signed-off-by: Nick Pillitteri <[email protected]>
@56quarters 56quarters force-pushed the 56quarters/sg-block-count branch from 94d8f13 to 67de993 Compare September 20, 2023 19:49

return stats
}

func bucketBlockDuration(buckets tsdb.DurationList, duration time.Duration) time.Duration {
for _, d := range buckets {
if duration <= d {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocks are sometimes over by a few milliseconds which would put 2:00:00.1 blocks in the 12hr bucket, do we want to add a little jitter to this duration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking to @56quarters offline and inspecting some systems blocks that are MS over the time limit are very rare, so I don't think we need jitter here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Based on this code I don't think it should be possible that blocks have more than 2h of samples. I've double checked a few production cells and couldn't find instances of blocks exceeding their supposed duration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocks are sometimes over by a few milliseconds which would put 2:00:00.1 blocks in the 12hr bucket, do we want to add a little jitter to this duration?

Do we have blocks with time range over the expected one? I'm aware a block could be smaller (e.g. 1:59:59) but not bigger. Can you follow up on this, please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocks are sometimes over by a few milliseconds which would put 2:00:00.1 blocks in the 12hr bucket, do we want to add a little jitter to this duration?

Do we have blocks with time range over the expected one?

We don't, at least I've never seen such block myself.

Signed-off-by: Nick Pillitteri <[email protected]>
Copy link
Contributor

@treid314 treid314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome for finding compactor issues! Thank you!

return float64(count)
metrics <- prometheus.MustNewConstMetric(u.blocksLoaded, prometheus.GaugeValue, float64(total))
for d, n := range loaded {
metrics <- prometheus.MustNewConstMetric(u.blocksLoadedByDuration, prometheus.GaugeValue, float64(n), d.String())
Copy link
Member

@pstibrany pstibrany Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we convert time.Duration to model.Duration and use its String method? It will remove unnecessary 0m0s suffixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively: we can change duration to hours, to avoid formatting 24h as 1d when using model.Duration. (Although I don't think that's a big deal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll convert to model.Duration. I was trying to think of a good way to remove the trailing 0s without assuming that block durations were always multiples of hours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to model.Duration for formatting of the label value in 8fa952c

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this can prove useful. My comments are just non-blocking nits (although I'd like to see duration formatting improved).

Note that store-gateways don't load all 2h blocks, and ignore blocks that are too new (up to 10h by default, iirc), but I don't think that diminishes usefulness this metric. If store-gateways have too many 2h blocks, then something is definitely wrong.

Comment on lines 571 to 581
u.storesMu.RLock()

for _, store := range u.stores {
count += store.Stats().BlocksLoaded
stats := store.Stats(u.cfg.TSDB.BlockRanges)
for d, n := range stats.BlocksLoaded {
loaded[d] += n
total += n
}
}

u.storesMu.RUnlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super-nit: Can we extract this into a function and use defer for unlock? It makes it easier to see and verify that locking is correct here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll see if I can split this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the summing of counts and locking to a separate method in 8fa952c

Signed-off-by: Nick Pillitteri <[email protected]>
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@56quarters 56quarters merged commit a231def into main Sep 21, 2023
@56quarters 56quarters deleted the 56quarters/sg-block-count branch September 21, 2023 15:13
for _, store := range u.stores {
count += store.Stats().BlocksLoaded
stats := store.Stats(u.cfg.TSDB.BlockRanges)
Copy link
Collaborator

@pracucci pracucci Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u.cfg.TSDB was a config supposed to be used only by the ingester. This means that it's supposed to only have 1 range (the 2h one). I think what you want here are the compactor ranges, not the ingester ones.

However, reading a compactor config looks a bad smell to me, so I'm wondering if we could just keep it easy and have hardcoded buckets instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make that change. Is your preference for using the compactor config or hard-coding the expected buckets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

56quarters added a commit that referenced this pull request Sep 25, 2023
Use a hard-coded list of durations instead of requiring the store-gateway
to be aware of compactor-specific configuration.

Related #6074

Signed-off-by: Nick Pillitteri <[email protected]>
56quarters added a commit that referenced this pull request Sep 25, 2023
Use a hard-coded list of durations instead of requiring the store-gateway
to be aware of compactor-specific configuration.

Related #6074

Signed-off-by: Nick Pillitteri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants