Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new metric in ruler to track rules that fetched no series #5925

Merged
merged 4 commits into from
Sep 29, 2023

Conversation

zenador
Copy link
Contributor

@zenador zenador commented Sep 5, 2023

What this PR does

Adds a new per-tenant metric cortex_ruler_queries_zero_fetched_series_total in ruler to track rules that fetched no series, so that users can add an alert on this metric to see if there is any late data affecting the rules, which could potentially result in alerts not firing when they should (we have had cases of that in the past). When they see this counter increase, they can search the logs for fetched_series_count=0 to find the affected query. Since this metric relies on query stats, it needs to be enabled before the metric is available, similar to cortex_ruler_query_seconds_total.

Haven't added tests yet as I'm not sure if we want to take this approach (is enabling query stats too heavy? but I'm not sure how else to track this), but if we decide to go with this then I'll add them. It's been manually tested, I added some rules locally that fetch zero series on purpose and it incremented the counter accordingly, and also some rules that return no data but still fetch some series and they correctly do not increment the counter.

Which issue(s) this PR fixes or relates to

Fixes https://github.com/grafana/mimir-squad/issues/2237

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@zenador zenador force-pushed the zenador/new-ruler-metric-for-zero-series branch from 0a6c356 to ec76e02 Compare September 5, 2023 10:26
@zenador zenador marked this pull request as ready for review September 6, 2023 17:40
@zenador zenador requested review from a team as code owners September 6, 2023 17:40
Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Please make sure to update GEM as well after this change.

@56quarters
Copy link
Contributor

Haven't added tests yet as I'm not sure if we want to take this approach (is enabling query stats too heavy? but I'm not sure how else to track this), but if we decide to go with this then I'll add them. It's been manually tested, I added some rules locally that fetch zero series on purpose and it incremented the counter accordingly, and also some rules that return no data but still fetch some series and they correctly do not increment the counter.

Query stats are the default on the query-frontend and I'd expect them to be the default on the ruler (we should change that if they aren't). We definitely run with them enabled in Grafana Cloud. I think it's fine to write tests that assume query stats are enabled as long as they're explicit about that.

@zenador zenador force-pushed the zenador/new-ruler-metric-for-zero-series branch from ec76e02 to f1ffcdd Compare September 8, 2023 01:04
@zenador
Copy link
Contributor Author

zenador commented Sep 8, 2023

Thanks for the review! I updated a basic test, but I couldn't find any example test in ruler that pushes metrics and queries them to test the rules or ruler metrics (only saw a mock pusher).

Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests that do actual querying, evaluation and assertion on metrics are integration tests. You can see an example here.

Can you elaborate a bit more on what kind of things you're hoping to catch with this? I'd like to understand why not (for example) put this information on the rules API directly so that we can surface it back to the user.

In Prometheus, we have rule_group_last_evaluation_samples which in my eyes looks like effectively the same thing (a query that returned no series would produce no samples) could it be that we don't have it in the ruler and that's what we need to do?

@gotjosh
Copy link
Contributor

gotjosh commented Sep 8, 2023

I verified this, and it seems like we have: cortex_prometheus_last_evaluation_samples (it even includes the rule group, which is a suggestion I wanted to make) - now, why the names don't align Is probably something we need to fix (and look for any breaking changes)

@zenador
Copy link
Contributor Author

zenador commented Sep 8, 2023

Thanks for the link to those tests! Will look into those if we agree on this approach.

The following is based on my understanding of how the ruler and samples work, but I've never worked with the ruler before so please correct any misconceptions:

There are cases where rules that should have fired didn't, because the data was late, so at the time the rule was evaluated, it didn't actually fetch any series as it had no samples in the requested time range. So if you query later after the data has been added, it's difficult to see why it hasn't fired. With this new metric, you'd see an increase in the counter wherever this happens, so it can help to explain why the alert didn't fire. And if you add an alert on this metric, you'll know whenever the data for rules is late, so you can investigate by looking at the logs, and consider expanding the time range. So a metric would help more than a new API for this use case.

Yes, I've seen that metric, and the names not aligning isn't an issue, and it seems to work in much the same way the Prometheus one does. However, it tracks something different which does not help for the problem we're trying to address. It's true that a query that returns no series would produce no samples, but a query that results in no samples may or may not have fetched any series. For example, imagine we have a gauge called query_in_progress across multiple components, and we want an alert if we have too many queries in progress at once, so we alert on sum(query_in_progress) > 100. If the total sum is less than 100, it will return no samples and the alert will not fire which is what we want, but it still fetches the different query_in_progress series. But if the total sum is more than 100 but the data is late, it will also return no samples and the alert will not fire which is problematic, and it fetches no series. So tracking zero series is different from tracking zero samples.

@gotjosh
Copy link
Contributor

gotjosh commented Sep 14, 2023

Apologies for my late response and Thank you very much for the detailed explanation - I think it makes perfect sense!

This is something that we typically monitor at a remote-write level (we would check if there was indeed a remote write delay in the original sender) but in scenarios where we don't control the client I think his can be very cumbersome.

As such, my counter-proposal is that we do two things (you can pick and choose):

  1. Upstream can benefit from this, as I can see this is a valid use case for Prometheus as well, and I can help with reviews and getting it merged. Specifically, we can measure the same thing here and here.

  2. For parity, I think we should have cortex_prometheus_last_evaluation_samples alongside cortex_prometheus_last_evaluation_query_samples as for the same rule evaluation, they'd be measuring both the query and the output and make it easy to discover them. My main reason is that I don't think a counter is useful as I'd care more about the exact number of series I'm touching than the rate.

@zenador zenador force-pushed the zenador/new-ruler-metric-for-zero-series branch from f1ffcdd to c016d61 Compare September 14, 2023 11:45
@zenador
Copy link
Contributor Author

zenador commented Sep 14, 2023

No worries, funny that I just saw your message right after I made some changes:

  • Don't increment the counter for queries with errors
  • Update integration test (distinguishing between no fetched series and no samples)
  1. I would have done this in Prometheus if I knew how to, but the current method relies on query stats which is a Mimir-only feature, so that probably means Prometheus doesn't have something similar. Do you know if there is one, or another way we can track if no series were fetched? Upstreaming the whole query stats feature seems like it'd be a big change.
  2. Sorry I don't get what you mean by this. Also, are you talking about samples or fetched series?

@gotjosh
Copy link
Contributor

gotjosh commented Sep 25, 2023

Sorry I don't get what you mean by this. Also, are you talking about samples or fetched series?

Please ignore me! I finally had some time to take a deeper look at both prometheus and this PR, I get it now and love the idea of what you're doing.

Prometheus takes a completely different approach and the wrapping of the query func we do here is enough that they don't both overlap in a clean way. Carry on with what you're doing!

@56quarters
Copy link
Contributor

Shall I merge this @zenador ?

@zenador
Copy link
Contributor Author

zenador commented Sep 29, 2023

Sure, we can merge if no one has any objections, thanks!

@56quarters 56quarters merged commit aee3889 into main Sep 29, 2023
@56quarters 56quarters deleted the zenador/new-ruler-metric-for-zero-series branch September 29, 2023 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants