-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alerts for Rules and Alerting SLOs #300
Conversation
slo.errorburn({ | ||
alertName: 'APIAlertmanagerAvailabilityErrorBudgetBurning', | ||
alertMessage: 'API Thanos Rule failing to send alerts to Alertmanager and is burning too much error budget to guarantee availability SLOs', | ||
metric: 'thanos_alert_sender_alerts_dropped_total', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need this alert - we seem to use already the thanos mixin and it contains the ThanosRuleSenderIsFailingAlerts alert, though I think it is not written specifically for our SLOs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM! 🌟
One minor nit
Signed-off-by: Jéssica Lins <[email protected]>
…alert Signed-off-by: Jéssica Lins <[email protected]>
Signed-off-by: Jéssica Lins <[email protected]>
1e28507
to
15078dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for awesome work!
Signed-off-by: Jéssica Lins <[email protected]>
* Fix memcached dashboard + Updated Telemeter allow-list (#321) * Fix memcached overview dashboard Signed-off-by: Matej Gera <[email protected]> * Sync whitelist Signed-off-by: Matej Gera <[email protected]> Signed-off-by: Matej Gera <[email protected]> * Rename thanos-rule-syncer metrics port (#322) Rename thanos-rule-syncer metrics port Signed-off-by: Jéssica Lins <[email protected]> * Add Rules and Alertmanager SLOs (#298) * Add Rules & alerting SLOs panels * Add Telemeter SLOs * Remove unnecessary comment * Use rate for SLO query about alerts being delivered to upstream targets * Update rules sync SLO query * Add alerts for Rules and Alerting SLOs (#300) * Add alerts for Rules and Alerting SLOs * Add container selector to APIRulesSyncAvailabilityErrorBudgetBurning alert * Fix telemeter namespace, update sync SLO query in alerts * Refactor instanceNamespace function Signed-off-by: Jéssica Lins <[email protected]> * SLOs: Prune unsupported labels (#325) SLOs: Prune unsupported labels (#325) Signed-off-by: Jéssica Lins <[email protected]> * [observatorium-logs] Increase the querier timeout from 3m to 6m (#330) * Fix SLO alerting for metrics (#328) * Fix SLO alerting for metrics Signed-off-by: Saswata Mukherjee <[email protected]> * Add back code labels to exclude 4xx Signed-off-by: Saswata Mukherjee <[email protected]> * Add comment about fork Signed-off-by: Saswata Mukherjee <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> * Add OSD to rules-obsctl-reloader (#329) Signed-off-by: Saswata Mukherjee <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> * Add Loki ruler and static rules for tenant OCM (#331) * Add runbooks for Rules & Alerting SLOs alerts (#327) Signed-off-by: Jéssica Lins <[email protected]> * Fix loki ruler memory requests (#332) * Fix ocm panic logs-based alert (#333) * Remove recycle annotations for loki rules (#334) * Add staging test alerts for rhobs logs (#335) * Fix alertmanager discovery for logs ruler (#338) * Use alertmanager v1 api for Loki ruler (#339) * Update Telemeter rules (#340) Signed-off-by: Douglas Camata <[email protected]> * Enable hashing algorithm for receive to be set via parameter Note, this is a breakinbg change which requires the use of Thanos >= v0.28.0 Signed-off-by: Matej Gera <[email protected]> Signed-off-by: Jéssica Lins <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Signed-off-by: Douglas Camata <[email protected]> Co-authored-by: Matej Gera <[email protected]> Co-authored-by: Jéssica Lins <[email protected]> Co-authored-by: Periklis Tsirakidis <[email protected]> Co-authored-by: Saswata Mukherjee <[email protected]> Co-authored-by: Douglas Camata <[email protected]>
* Add alerts for Rules and Alerting SLOs * Add container selector to APIRulesSyncAvailabilityErrorBudgetBurning alert * Fix telemeter namespace, update sync SLO query in alerts * Refactor instanceNamespace function Signed-off-by: Jéssica Lins <[email protected]>
* Add Rules and Alertmanager SLOs (#298) * Add Rules & alerting SLOs panels * Add Telemeter SLOs * Remove unnecessary comment * Use rate for SLO query about alerts being delivered to upstream targets * Update rules sync SLO query * Add alerts for Rules and Alerting SLOs (#300) * Add alerts for Rules and Alerting SLOs * Add container selector to APIRulesSyncAvailabilityErrorBudgetBurning alert * Fix telemeter namespace, update sync SLO query in alerts * Refactor instanceNamespace function Signed-off-by: Jéssica Lins <[email protected]> * SLOs: Prune unsupported labels (#325) SLOs: Prune unsupported labels (#325) Signed-off-by: Jéssica Lins <[email protected]> * [observatorium-logs] Increase the querier timeout from 3m to 6m (#330) * Fix SLO alerting for metrics (#328) * Fix SLO alerting for metrics Signed-off-by: Saswata Mukherjee <[email protected]> * Add back code labels to exclude 4xx Signed-off-by: Saswata Mukherjee <[email protected]> * Add comment about fork Signed-off-by: Saswata Mukherjee <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> * Add OSD to rules-obsctl-reloader (#329) Signed-off-by: Saswata Mukherjee <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> * Add Loki ruler and static rules for tenant OCM (#331) * Fix loki ruler memory requests (#332) * Fix ocm panic logs-based alert (#333) * Remove recycle annotations for loki rules (#334) * Add staging test alerts for rhobs logs (#335) * Fix alertmanager discovery for logs ruler (#338) * Use alertmanager v1 api for Loki ruler (#339) * Update Telemeter rules (#340) Signed-off-by: Douglas Camata <[email protected]> * Fix ARM64 (M1 Pro) support (#344) * Update Bingo Signed-off-by: Douglas Camata <[email protected]> * Update Bingo deps for aarm64 support Signed-off-by: Douglas Camata <[email protected]> Signed-off-by: Douglas Camata <[email protected]> * Add suppport for Loki ruler to manage rules on object storage (#345) * Add template for Loki ruler CRDs (#349) * Add /rules/raw row to Observatorium API dashboard + refactor (#337) * Refactor titleRow * Refactor query row * Refactor query_range row * Start refactor RED for all row * Refactor all query err * Refactor query duration * Refactor titleRow * Finish refactor * Start to add /rules/raw row * Fix positioning, finish rules row * Remove unused var * sum by code for errors panel * Unify aliasColors * Fix errors query, use scalar Signed-off-by: Jéssica Lins <[email protected]> * Fix Updating dashboards README section (#351) Signed-off-by: Jéssica Lins <[email protected]> * Add obsctl-reloader support for Loki alerting- and recordingrules (#352) * Add cluster-role observatorium-logs-edit for dedicated-admins (#353) * Add dedicated-admin label for obs-logs-edit clusterrole (#354) * Enable hashing algorithm for receive to be set via parameter Note, this is a breakinbg change which requires the use of Thanos >= v0.28.0 * Ketama sync to main (#347) * Update README (#343) * Update README - Added instructions for macOS - Fixed jsonnet deps command Signed-off-by: Douglas Camata <[email protected]> * Update OpenShift templates doc link Signed-off-by: Douglas Camata <[email protected]> Signed-off-by: Douglas Camata <[email protected]> * Disable compression on receive (#346) Signed-off-by: Matej Gera <[email protected]> Signed-off-by: Matej Gera <[email protected]> Signed-off-by: Douglas Camata <[email protected]> Signed-off-by: Matej Gera <[email protected]> Co-authored-by: Douglas Camata <[email protected]> Signed-off-by: Jéssica Lins <[email protected]> Signed-off-by: Saswata Mukherjee <[email protected]> Signed-off-by: Douglas Camata <[email protected]> Signed-off-by: Matej Gera <[email protected]> Co-authored-by: Jéssica Lins <[email protected]> Co-authored-by: Periklis Tsirakidis <[email protected]> Co-authored-by: Saswata Mukherjee <[email protected]> Co-authored-by: Douglas Camata <[email protected]> Co-authored-by: Matej Gera <[email protected]>
Adding alerts for the defined SLOs at #298
I think selectors may change slightly as we update #298 but I already wanted to open for review.
I'll then add runbooks in a separate PR, to make easier for review.
I think we can then have this merged after #298 to stage, test and just when we have all runbooks in place and alerts are ok, have it deployed to production.
Signed-off-by: Jéssica Lins [email protected]