-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new metric in ruler to track rules that fetched no series #5925
Conversation
0a6c356
to
ec76e02
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Please make sure to update GEM as well after this change.
Query stats are the default on the query-frontend and I'd expect them to be the default on the ruler (we should change that if they aren't). We definitely run with them enabled in Grafana Cloud. I think it's fine to write tests that assume query stats are enabled as long as they're explicit about that. |
ec76e02
to
f1ffcdd
Compare
Thanks for the review! I updated a basic test, but I couldn't find any example test in ruler that pushes metrics and queries them to test the rules or ruler metrics (only saw a mock pusher). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests that do actual querying, evaluation and assertion on metrics are integration tests. You can see an example here.
Can you elaborate a bit more on what kind of things you're hoping to catch with this? I'd like to understand why not (for example) put this information on the rules API directly so that we can surface it back to the user.
In Prometheus, we have rule_group_last_evaluation_samples
which in my eyes looks like effectively the same thing (a query that returned no series would produce no samples) could it be that we don't have it in the ruler and that's what we need to do?
I verified this, and it seems like we have: |
Thanks for the link to those tests! Will look into those if we agree on this approach. The following is based on my understanding of how the ruler and samples work, but I've never worked with the ruler before so please correct any misconceptions: There are cases where rules that should have fired didn't, because the data was late, so at the time the rule was evaluated, it didn't actually fetch any series as it had no samples in the requested time range. So if you query later after the data has been added, it's difficult to see why it hasn't fired. With this new metric, you'd see an increase in the counter wherever this happens, so it can help to explain why the alert didn't fire. And if you add an alert on this metric, you'll know whenever the data for rules is late, so you can investigate by looking at the logs, and consider expanding the time range. So a metric would help more than a new API for this use case. Yes, I've seen that metric, and the names not aligning isn't an issue, and it seems to work in much the same way the Prometheus one does. However, it tracks something different which does not help for the problem we're trying to address. It's true that a query that returns no series would produce no samples, but a query that results in no samples may or may not have fetched any series. For example, imagine we have a gauge called |
Apologies for my late response and Thank you very much for the detailed explanation - I think it makes perfect sense! This is something that we typically monitor at a remote-write level (we would check if there was indeed a remote write delay in the original sender) but in scenarios where we don't control the client I think his can be very cumbersome. As such, my counter-proposal is that we do two things (you can pick and choose):
|
f1ffcdd
to
c016d61
Compare
No worries, funny that I just saw your message right after I made some changes:
|
Please ignore me! I finally had some time to take a deeper look at both prometheus and this PR, I get it now and love the idea of what you're doing. Prometheus takes a completely different approach and the wrapping of the query func we do here is enough that they don't both overlap in a clean way. Carry on with what you're doing! |
Shall I merge this @zenador ? |
Sure, we can merge if no one has any objections, thanks! |
What this PR does
Adds a new per-tenant metric
cortex_ruler_queries_zero_fetched_series_total
in ruler to track rules that fetched no series, so that users can add an alert on this metric to see if there is any late data affecting the rules, which could potentially result in alerts not firing when they should (we have had cases of that in the past). When they see this counter increase, they can search the logs forfetched_series_count=0
to find the affected query. Since this metric relies on query stats, it needs to be enabled before the metric is available, similar tocortex_ruler_query_seconds_total
.Haven't added tests yet as I'm not sure if we want to take this approach (is enabling query stats too heavy? but I'm not sure how else to track this), but if we decide to go with this then I'll add them.It's been manually tested, I added some rules locally that fetch zero series on purpose and it incremented the counter accordingly, and also some rules that return no data but still fetch some series and they correctly do not increment the counter.Which issue(s) this PR fixes or relates to
Fixes https://github.com/grafana/mimir-squad/issues/2237
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]