Skip to content

Commit

Permalink
Ketama sync (#342)
Browse files Browse the repository at this point in the history
* Fix memcached dashboard + Updated Telemeter allow-list (#321)

* Fix memcached overview dashboard

Signed-off-by: Matej Gera <[email protected]>

* Sync whitelist

Signed-off-by: Matej Gera <[email protected]>

Signed-off-by: Matej Gera <[email protected]>

* Rename thanos-rule-syncer metrics port (#322)

Rename thanos-rule-syncer metrics port 

Signed-off-by: Jéssica Lins <[email protected]>

* Add Rules and Alertmanager SLOs (#298)

* Add Rules & alerting SLOs panels
* Add Telemeter SLOs
* Remove unnecessary comment
* Use rate for SLO query about alerts being delivered to upstream targets
* Update rules sync SLO query

* Add alerts for Rules and Alerting SLOs (#300)

* Add alerts for Rules and Alerting SLOs
* Add container selector to APIRulesSyncAvailabilityErrorBudgetBurning alert
* Fix telemeter namespace, update sync SLO query in alerts
* Refactor instanceNamespace function

Signed-off-by: Jéssica Lins <[email protected]>

* SLOs: Prune unsupported labels (#325)

SLOs: Prune unsupported labels (#325)

Signed-off-by: Jéssica Lins <[email protected]>

* [observatorium-logs] Increase the querier timeout from 3m to 6m (#330)

* Fix SLO alerting for metrics (#328)

* Fix SLO alerting for metrics

Signed-off-by: Saswata Mukherjee <[email protected]>

* Add back code labels to exclude 4xx

Signed-off-by: Saswata Mukherjee <[email protected]>

* Add comment about fork

Signed-off-by: Saswata Mukherjee <[email protected]>

Signed-off-by: Saswata Mukherjee <[email protected]>

* Add OSD to rules-obsctl-reloader (#329)

Signed-off-by: Saswata Mukherjee <[email protected]>

Signed-off-by: Saswata Mukherjee <[email protected]>

* Add Loki ruler and static rules for tenant OCM (#331)

* Add runbooks for Rules & Alerting SLOs alerts (#327)

Signed-off-by: Jéssica Lins <[email protected]>

* Fix loki ruler memory requests (#332)

* Fix ocm panic logs-based alert (#333)

* Remove recycle annotations for loki rules (#334)

* Add staging test alerts for rhobs logs (#335)

* Fix alertmanager discovery for logs ruler (#338)

* Use alertmanager v1 api for Loki ruler (#339)

* Update Telemeter rules (#340)

Signed-off-by: Douglas Camata <[email protected]>

* Enable hashing algorithm for receive to be set via parameter

Note, this is a breakinbg change which requires the use
of Thanos >= v0.28.0

Signed-off-by: Matej Gera <[email protected]>
Signed-off-by: Jéssica Lins <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: Douglas Camata <[email protected]>
Co-authored-by: Matej Gera <[email protected]>
Co-authored-by: Jéssica Lins <[email protected]>
Co-authored-by: Periklis Tsirakidis <[email protected]>
Co-authored-by: Saswata Mukherjee <[email protected]>
Co-authored-by: Douglas Camata <[email protected]>
  • Loading branch information
6 people authored Sep 30, 2022
1 parent fac1ad9 commit 94cc99a
Show file tree
Hide file tree
Showing 26 changed files with 9,001 additions and 939 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ $(JSONNET_VENDOR_DIR): $(JB) jsonnetfile.json jsonnetfile.lock.json

.PHONY: update
update: $(JB) jsonnetfile.json jsonnetfile.lock.json
@$(JB) update --jsonnetpkg-home="$(JSONNET_VENDOR_DIR)" https://github.com/thanos-io/kube-thanos/jsonnet/kube-thanos@main
@$(JB) update --jsonnetpkg-home="$(JSONNET_VENDOR_DIR)" https://github.com/rhobs/obsctl-reloader/jsonnet/lib@main

.PHONY: format
format: $(JSONNET_SRC) $(JSONNETFMT)
Expand Down
172 changes: 172 additions & 0 deletions docs/sop/observatorium.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@
- [APIMetricsWriteLatencyErrorBudgetBurning](#apimetricswritelatencyerrorbudgetburning)
- [APIMetricsReadAvailabilityErrorBudgetBurning](#apimetricsreadavailabilityerrorbudgetburning)
- [APIMetricsReadLatencyErrorBudgetBurning](#apimetricsreadlatencyerrorbudgetburning)
- [APIRulesRawWriteAvailabilityErrorBudgetBurning](#apirulesrawwriteavailabilityerrorbudgetburning)
- [APIRulesSyncAvailabilityErrorBudgetBurning](#apirulessyncavailabilityerrorbudgetburning)
- [APIRulesReadAvailabilityErrorBudgetBurning](#apirulesreadavailabilityerrorbudgetburning)
- [APIRulesRawReadAvailabilityErrorBudgetBurning](#apirulesrawreadavailabilityerrorbudgetburning)
- [APIAlertmanagerAvailabilityErrorBudgetBurning](#apialertmanageravailabilityerrorbudgetburning)
- [APIAlertmanagerNotificationsAvailabilityErrorBudgetBurning](#apialertmanagernotificationsavailabilityerrorbudgetburning)
- Observatorium Proactive Monitoring
- [ObservatoriumHttpTrafficErrorRateHigh](#observatoriumhttptrafficerrorratehigh)
- [ObservatoriumProActiveMetricsQueryErrorRateHigh](#observatoriumproactivemetricsqueryerrorratehigh)
Expand Down Expand Up @@ -288,6 +294,172 @@ API is returning a high-enough level of slow responses to read requests that we
* Find and inspect a slow query in [Jaeger](https://observatorium-jaeger.api.openshift.com/search)
* Reach out to @observatorium-oncall or @observatorium-support in #forum-observatorium for help.

---
## APIRulesRawWriteAvailabilityErrorBudgetBurning

### Impact

API /rules/raw is currently failing to ingest rules.

### Summary

API /rules/raw is returning a high-enough level of 5XX responses to write requests that we are depleting our SLO error budget.

### Severity
`high`

### Access Required

- Console access to the cluster that runs Observatorium (Currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/project-details/all-namespaces) and [app-sre-stage-0 OSD](https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/project-details/all-namespaces))

### Steps

* Check on the health of the API.
* Check the API [dashboard](https://grafana.app-sre.devshift.net/d/Tg-mH0rizaSJDKSADX/api)
* Check the logs on the API pods.
* Reach out to @observatorium-oncall or @observatorium-support in #forum-observatorium for help.

---
## APIRulesSyncAvailabilityErrorBudgetBurning

### Impact

API is currently failing to sync rules that were ingested via /rules/raw endpoint to Thanos Rule.

### Summary

API is returning a high-enough level of 5XX responses of the /reload endpoint that we are depleting our SLO error budget.

### Severity
`high`

### Access Required

- Console access to the cluster that runs Observatorium (Currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/project-details/all-namespaces) and [app-sre-stage-0 OSD](https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/project-details/all-namespaces))

### Steps

* Check health of Thanos Rule and Thanos Rule Syncer sidecar container.
* Check the Thanos Rule [dashboard](https://grafana.app-sre.devshift.net/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule)
* Check the logs on the Thanos Rule pods.
* Reach out to @observatorium-oncall or @observatorium-support in #forum-observatorium for help.

---
## APIRulesReadAvailabilityErrorBudgetBurning

### Impact

API is currently failing to respond to rules read requests.

### Summary

API is returning a high-enough level of 5XX responses that we are depleting our SLO error budget.

### Severity
`critical`

### Access Required

- Console access to the cluster that runs Observatorium (Currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/project-details/all-namespaces) and [app-sre-stage-0 OSD](https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/project-details/all-namespaces))

### Steps

* Check on the health of the API.
* Check the API [dashboard](https://grafana.app-sre.devshift.net/d/Tg-mH0rizaSJDKSADX/api)
* Check the logs on the API pods.
* Check on the health of Thanos Rule.
* Check the Thanos Rule [dashboard](https://grafana.app-sre.devshift.net/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule)
* Check the logs of the Thanos Rule pods.
* Check on the health of Thanos Query Frontend.
* Check the Thanos Query Frontend [dashboard](https://grafana.app-sre.devshift.net/d/303c4e660a475c4c8cf6aee97da3a24a/thanos-query-frontend)
* Check the logs of the Thanos Query Frontend pods.
* Check on the health of Thanos Query.
* Check the Thanos Query [dashboard](https://grafana.app-sre.devshift.net/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query)
* Check the logs of the Thanos Query pods.
* Reach out to @observatorium-oncall or @observatorium-support in #forum-observatorium for help.

---
## APIRulesRawReadAvailabilityErrorBudgetBurning

### Impact

API is currently failing to respond to rules read requests to the /rules/raw endpoint.

### Summary

API is returning a high-enough level of 5XX responses that we are depleting our SLO error budget.

### Severity
`critical`

### Access Required

- Console access to the cluster that runs Observatorium (Currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/project-details/all-namespaces) and [app-sre-stage-0 OSD](https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/project-details/all-namespaces))

### Steps

* Check on the health of the API.
* Check the API [dashboard](https://grafana.app-sre.devshift.net/d/Tg-mH0rizaSJDKSADX/api)
* Check the logs on the API pods.
* Reach out to @observatorium-oncall or @observatorium-support in #forum-observatorium for help.

---
## APIAlertmanagerAvailabilityErrorBudgetBurning

### Impact

Alerts triggered by Thanos Rule are not being sent to Observatorium Alertmanager.

### Summary

Thanos Rule is returning a high-enough level of dropped alerts that are depleting our SLO error budget.

### Severity
`critical`

### Access Required

- Console access to the cluster that runs Observatorium (Currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/project-details/all-namespaces) and [app-sre-stage-0 OSD](https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/project-details/all-namespaces))

### Steps

* Check on the health of Thanos Rule.
* Check the Thanos Rule [dashboard](https://grafana.app-sre.devshift.net/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule) especially the `Alert Sent` and `Alert Queue` panels.
* Check the logs on the Thanos Rule pods.
* Check on the health of Observatorium Alertmanager.
* Check the Observatorium Alertmanager configuration and status:
* [MST](https://observatorium-alertmanager-mst.api.openshift.com/#/status)
* [Telemeter](https://observatorium-alertmanager.api.openshift.com/#/status)
* Reach out to @observatorium-oncall or @observatorium-support in #forum-observatorium for help.

---
## APIAlertmanagerNotificationsAvailabilityErrorBudgetBurning

### Impact

Notifications to specified receivers are failing to be sent by Observatorium Alertmanager.

### Summary

Observatorium Alertmanager is returning a high-enough level of failed notifications that are depleting our SLO error budget.

### Severity
`critical`

### Access Required

- Console access to the cluster that runs Observatorium (Currently [telemeter-prod-01 OSD](https://console-openshift-console.apps.telemeter-prod.a5j2.p1.openshiftapps.com/project-details/all-namespaces) and [app-sre-stage-0 OSD](https://console-openshift-console.apps.app-sre-stage-0.k3s7.p1.openshiftapps.com/project-details/all-namespaces))

### Steps

* Check on the health of Observatorium Alertmanager.
* Check the logs on the Observatorium Alertmanager pods.
* Check on the health of Observatorium Alertmanager.
* Check the Observatorium Alertmanager configuration and status:
* [MST](https://observatorium-alertmanager-mst.api.openshift.com/#/status)
* [Telemeter](https://observatorium-alertmanager.api.openshift.com/#/status)
* Reach out to @observatorium-oncall or @observatorium-support in #forum-observatorium for help.

---
## ObservatoriumHttpTrafficErrorRateHigh

Expand Down
18 changes: 9 additions & 9 deletions jsonnetfile.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,6 @@
},
"version": "master"
},
{
"source": {
"git": {
"remote": "https://github.com/metalmatze/slo-libsonnet.git",
"subdir": "slo-libsonnet"
}
},
"version": "master"
},
{
"source": {
"git": {
Expand Down Expand Up @@ -130,6 +121,15 @@
},
"version": "main"
},
{
"source": {
"git": {
"remote": "https://github.com/saswatamcode/slo-libsonnet.git",
"subdir": "slo-libsonnet"
}
},
"version": "special-selector"
},
{
"source": {
"git": {
Expand Down
36 changes: 18 additions & 18 deletions jsonnetfile.lock.json
Original file line number Diff line number Diff line change
Expand Up @@ -142,25 +142,15 @@
"version": "7e28ed2722a8edfade41238ee3f0e0117ff041b6",
"sum": "u8gaydJoxEjzizQ8jY8xSjYgWooPmxw+wIWdDxifMAk="
},
{
"source": {
"git": {
"remote": "https://github.com/metalmatze/slo-libsonnet.git",
"subdir": "slo-libsonnet"
}
},
"version": "e238df4fac957357d78a405966c51523ef151cbc",
"sum": "5XrxbGgMFMHnKGeE++fiduDlDEiDunM2ZMTYCmPbWgA="
},
{
"source": {
"git": {
"remote": "https://github.com/observatorium/api.git",
"subdir": "jsonnet/lib"
}
},
"version": "4d5fe09aaab876288355a1fbe5e3eed43734a0d0",
"sum": "o0dOrsTJx4Lt/F0TA/4Pg/wipWPWOwhtxCJ6lHxzvjY=",
"version": "a5834effa711ab261558f489407223e4b3e166cb",
"sum": "ZJlzUf7NGTHUr2RtytxG3i8z7DfQJ3gPtV89hnDNjGE=",
"name": "observatorium-api"
},
{
Expand All @@ -170,8 +160,8 @@
"subdir": "configuration"
}
},
"version": "39bd3b6e85614d09607c41c575a477b9cfd78569",
"sum": "uiddp3srcX6NncnnYYNY6lYEy7oyRBz6Izf8qGfUKnY="
"version": "36071db2f584b7cfb7932946dc9336873c363f09",
"sum": "xKdQFAsNSYo1lK+2HVOSVcyG9sjEduHVbWhxIe5eJ6o="
},
{
"source": {
Expand Down Expand Up @@ -233,8 +223,8 @@
"subdir": "jsonnet/telemeter"
}
},
"version": "dbfc19afe778f1311860dfe9ebf0e94b403de9de",
"sum": "jPX3JQZndZSVPDmkW2HZEib7/oeuVpxGOB/rXSgyOcI="
"version": "8e8125ea5d66d71b4fdd1b75a45268128b4d36d3",
"sum": "VlXNOw0i2l5GD8UDfNX647jaiVxRlA1leACAxFw15WQ="
},
{
"source": {
Expand Down Expand Up @@ -326,8 +316,18 @@
"subdir": "jsonnet/lib"
}
},
"version": "a6a0ff74be63dabac0a1562a364a3a1ccc8f985d",
"sum": "vdcovQefbNENJcAlNTojsUldmCF0QIBUB5zSwB7wF4s="
"version": "17c3d4f4da79c22a4caf752601010ab2ffcf6f35",
"sum": "oQRPgrGDsD1m3JDjrcUFuceXhDY8zmQ9rsvc7ECiHm8="
},
{
"source": {
"git": {
"remote": "https://github.com/saswatamcode/slo-libsonnet.git",
"subdir": "slo-libsonnet"
}
},
"version": "1c00866cc4a5d26139f7a7fcccd1752e9d5ea87e",
"sum": "vF3rkKtymiGbpH2r+vinAdp1RsLCUdgBNhsacZjFgQY="
},
{
"source": {
Expand Down
Loading

0 comments on commit 94cc99a

Please sign in to comment.