-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tempo-mixin] Increase notification delay for TempoTenantIndexTooOld #3115
Conversation
@@ -143,7 +143,7 @@ | |||
expr: ||| | |||
max by (%s) (tempodb_blocklist_tenant_index_age_seconds{}) > %s | |||
||| % [$._config.group_by_tenant, $._config.alerts.max_tenant_index_age_seconds], | |||
'for': '5m', | |||
'for': '20m', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 700s is more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why 700s? also why 20m?
I was assuming that tempo operators can use max_tenant_index_age_seconds
to tune this alert? and don't have to bump for
?.
I am not aginst the chage, just want to know the reason for the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good questions. The PR is because I've been noticing that this flaps quite a bit, causing alert noise when no action is required to being it back into desired state. Perhaps there is another approach we can take here, or tweak to be slightly more targeted. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh 20m because I was just rusing up a PR after getting paged. The 700s is 2x the polling cycle (600s) plus 100s for good measure in case we're on a boundary. But this is based on default values.
The tenant index deletion was originally put in as TCO win, but did not have the desired effect and surfaced other issues in the system. Related grafana#2678 Related grafana#2754 Related grafana#2781 Related grafana#2878 Related grafana#3115 Related grafana#3223 Due to the number of issues here, and causing considerable noise on the pager, perhaps the right thing to do is back out the tenant deletion. Raising here for discussion.
What this PR does:
Be a little more forgiving about the sensitivity of
TempoTenantIndexTooOld
.Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]