Add new metric in ruler to track rules that fetched no series #5925

zenador · 2023-09-05T10:16:09Z

What this PR does

Adds a new per-tenant metric cortex_ruler_queries_zero_fetched_series_total in ruler to track rules that fetched no series, so that users can add an alert on this metric to see if there is any late data affecting the rules, which could potentially result in alerts not firing when they should (we have had cases of that in the past). When they see this counter increase, they can search the logs for fetched_series_count=0 to find the affected query. Since this metric relies on query stats, it needs to be enabled before the metric is available, similar to cortex_ruler_query_seconds_total.

Haven't added tests yet as I'm not sure if we want to take this approach (is enabling query stats too heavy? but I'm not sure how else to track this), but if we decide to go with this then I'll add them. It's been manually tested, I added some rules locally that fetch zero series on purpose and it incremented the counter accordingly, and also some rules that return no data but still fetch some series and they correctly do not increment the counter.

Which issue(s) this PR fixes or relates to

Fixes https://github.com/grafana/mimir-squad/issues/2237

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

56quarters

LGTM, thanks! Please make sure to update GEM as well after this change.

56quarters · 2023-09-06T18:18:04Z

Haven't added tests yet as I'm not sure if we want to take this approach (is enabling query stats too heavy? but I'm not sure how else to track this), but if we decide to go with this then I'll add them. It's been manually tested, I added some rules locally that fetch zero series on purpose and it incremented the counter accordingly, and also some rules that return no data but still fetch some series and they correctly do not increment the counter.

Query stats are the default on the query-frontend and I'd expect them to be the default on the ruler (we should change that if they aren't). We definitely run with them enabled in Grafana Cloud. I think it's fine to write tests that assume query stats are enabled as long as they're explicit about that.

zenador · 2023-09-08T01:07:13Z

Thanks for the review! I updated a basic test, but I couldn't find any example test in ruler that pushes metrics and queries them to test the rules or ruler metrics (only saw a mock pusher).

gotjosh

The tests that do actual querying, evaluation and assertion on metrics are integration tests. You can see an example here.

Can you elaborate a bit more on what kind of things you're hoping to catch with this? I'd like to understand why not (for example) put this information on the rules API directly so that we can surface it back to the user.

In Prometheus, we have rule_group_last_evaluation_samples which in my eyes looks like effectively the same thing (a query that returned no series would produce no samples) could it be that we don't have it in the ruler and that's what we need to do?

gotjosh · 2023-09-08T08:34:50Z

I verified this, and it seems like we have: cortex_prometheus_last_evaluation_samples (it even includes the rule group, which is a suggestion I wanted to make) - now, why the names don't align Is probably something we need to fix (and look for any breaking changes)

zenador · 2023-09-08T09:36:33Z

Thanks for the link to those tests! Will look into those if we agree on this approach.

The following is based on my understanding of how the ruler and samples work, but I've never worked with the ruler before so please correct any misconceptions:

There are cases where rules that should have fired didn't, because the data was late, so at the time the rule was evaluated, it didn't actually fetch any series as it had no samples in the requested time range. So if you query later after the data has been added, it's difficult to see why it hasn't fired. With this new metric, you'd see an increase in the counter wherever this happens, so it can help to explain why the alert didn't fire. And if you add an alert on this metric, you'll know whenever the data for rules is late, so you can investigate by looking at the logs, and consider expanding the time range. So a metric would help more than a new API for this use case.

Yes, I've seen that metric, and the names not aligning isn't an issue, and it seems to work in much the same way the Prometheus one does. However, it tracks something different which does not help for the problem we're trying to address. It's true that a query that returns no series would produce no samples, but a query that results in no samples may or may not have fetched any series. For example, imagine we have a gauge called query_in_progress across multiple components, and we want an alert if we have too many queries in progress at once, so we alert on sum(query_in_progress) > 100. If the total sum is less than 100, it will return no samples and the alert will not fire which is what we want, but it still fetches the different query_in_progress series. But if the total sum is more than 100 but the data is late, it will also return no samples and the alert will not fire which is problematic, and it fetches no series. So tracking zero series is different from tracking zero samples.

gotjosh · 2023-09-14T11:03:52Z

Apologies for my late response and Thank you very much for the detailed explanation - I think it makes perfect sense!

This is something that we typically monitor at a remote-write level (we would check if there was indeed a remote write delay in the original sender) but in scenarios where we don't control the client I think his can be very cumbersome.

As such, my counter-proposal is that we do two things (you can pick and choose):

Upstream can benefit from this, as I can see this is a valid use case for Prometheus as well, and I can help with reviews and getting it merged. Specifically, we can measure the same thing here and here.
For parity, I think we should have cortex_prometheus_last_evaluation_samples alongside cortex_prometheus_last_evaluation_query_samples as for the same rule evaluation, they'd be measuring both the query and the output and make it easy to discover them. My main reason is that I don't think a counter is useful as I'd care more about the exact number of series I'm touching than the rate.

zenador · 2023-09-14T11:52:47Z

No worries, funny that I just saw your message right after I made some changes:

Don't increment the counter for queries with errors
Update integration test (distinguishing between no fetched series and no samples)

I would have done this in Prometheus if I knew how to, but the current method relies on query stats which is a Mimir-only feature, so that probably means Prometheus doesn't have something similar. Do you know if there is one, or another way we can track if no series were fetched? Upstreaming the whole query stats feature seems like it'd be a big change.
Sorry I don't get what you mean by this. Also, are you talking about samples or fetched series?

gotjosh · 2023-09-25T16:27:33Z

Sorry I don't get what you mean by this. Also, are you talking about samples or fetched series?

Please ignore me! I finally had some time to take a deeper look at both prometheus and this PR, I get it now and love the idea of what you're doing.

Prometheus takes a completely different approach and the wrapping of the query func we do here is enough that they don't both overlap in a clean way. Carry on with what you're doing!

56quarters · 2023-09-29T20:50:58Z

Shall I merge this @zenador ?

zenador · 2023-09-29T21:05:33Z

Sure, we can merge if no one has any objections, thanks!

zenador force-pushed the zenador/new-ruler-metric-for-zero-series branch from 0a6c356 to ec76e02 Compare September 5, 2023 10:26

zenador marked this pull request as ready for review September 6, 2023 17:40

zenador requested review from a team as code owners September 6, 2023 17:40

56quarters approved these changes Sep 6, 2023

View reviewed changes

zenador force-pushed the zenador/new-ruler-metric-for-zero-series branch from ec76e02 to f1ffcdd Compare September 8, 2023 01:04

gotjosh reviewed Sep 8, 2023

View reviewed changes

zenador added 4 commits September 14, 2023 19:44

Add new metric in ruler to track rules that fetched no series

e18fb0c

Update test and changelog

e4c90ac

Make new metric ignore queries with errors and update integration test

9f6dab0

Reuse code in integration test

c016d61

zenador force-pushed the zenador/new-ruler-metric-for-zero-series branch from f1ffcdd to c016d61 Compare September 14, 2023 11:45

56quarters merged commit aee3889 into main Sep 29, 2023

56quarters deleted the zenador/new-ruler-metric-for-zero-series branch September 29, 2023 21:06

zenador mentioned this pull request Nov 2, 2023

Exclude vector queries from being counted in metric for rules with zero fetched series #6544

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new metric in ruler to track rules that fetched no series #5925

Add new metric in ruler to track rules that fetched no series #5925

zenador commented Sep 5, 2023 •

edited

Loading

56quarters left a comment

56quarters commented Sep 6, 2023

zenador commented Sep 8, 2023

gotjosh left a comment •

edited

Loading

gotjosh commented Sep 8, 2023

zenador commented Sep 8, 2023

gotjosh commented Sep 14, 2023

zenador commented Sep 14, 2023

gotjosh commented Sep 25, 2023 •

edited

Loading

56quarters commented Sep 29, 2023

zenador commented Sep 29, 2023

Add new metric in ruler to track rules that fetched no series #5925

Add new metric in ruler to track rules that fetched no series #5925

Conversation

zenador commented Sep 5, 2023 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

56quarters left a comment

Choose a reason for hiding this comment

56quarters commented Sep 6, 2023

zenador commented Sep 8, 2023

gotjosh left a comment • edited Loading

Choose a reason for hiding this comment

gotjosh commented Sep 8, 2023

zenador commented Sep 8, 2023

gotjosh commented Sep 14, 2023

zenador commented Sep 14, 2023

gotjosh commented Sep 25, 2023 • edited Loading

56quarters commented Sep 29, 2023

zenador commented Sep 29, 2023

zenador commented Sep 5, 2023 •

edited

Loading

gotjosh left a comment •

edited

Loading

gotjosh commented Sep 25, 2023 •

edited

Loading