Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apps sc: made alert runbooks configurable #2438

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

davidumea
Copy link
Contributor

@davidumea davidumea commented Feb 14, 2025

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • [kind/adr](set-me)

What does this PR do / why do we need this PR?

Makes it possible to configure runbooks, can be configured on an alert group level or per individual alert.

Current override priority:

  1. Use individual alert runbook override if set
  2. Use group alert runbook override if set
  3. Use upstream alert runbook by default if exists
  4. Runbook is set to "Missing runbook" if there's no upstream runbook and nothing is configured.

Information to reviewers

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change updates CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
    • The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
  • Logs checks:
    • The logs do not show any errors after the change
  • PodSecurityPolicy checks:
    • Any changed Pod is covered by Kubernetes Pod Security Standards
    • Any changed Pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any Pods to be blocked by Pod Security Standards or Policies
  • NetworkPolicy checks:
    • Any changed Pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

@davidumea davidumea requested a review from HaoruiPeng February 14, 2025 13:36
@davidumea davidumea requested review from a team as code owners February 14, 2025 13:36
@davidumea davidumea added the kind/improvement Improvement of existing features, e.g. code cleanup or optimizations. label Feb 14, 2025
@davidumea davidumea force-pushed the david/configurable-alert-runbook-urls branch from 402e518 to f64f988 Compare February 14, 2025 13:38
@davidumea davidumea changed the title apps sc: made alertmanager alert runbooks configurable apps sc: made alert runbooks configurable Feb 14, 2025
@davidumea davidumea force-pushed the david/configurable-alert-runbook-urls branch from f64f988 to b581f30 Compare February 14, 2025 13:44
Copy link
Contributor

@anders-elastisys anders-elastisys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, I really like this change. Added some comments, otherwise it looks good

config/sc-config.yaml Outdated Show resolved Hide resolved
helmfile.d/values/prometheus-alerts-sc.yaml.gotmpl Outdated Show resolved Hide resolved
Copy link
Contributor

@anders-elastisys anders-elastisys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

@davidumea davidumea requested a review from aarnq February 17, 2025 16:59
Copy link
Contributor

@viktor-f viktor-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I think that this solution ended up looking very clean.

Comment on lines 4075 to 4086
description: |-
Runbooks for alertmanager alerts

Example:

group: https://runbooks.prometheus-operator.dev/runbooks/alertmanager/
AlertmanagerFailedReload: https://runbooks.prometheus-operator.dev/runbooks/alertmanager/alertmanagerfailedreload/
AlertName: link-to-specific-alert-runbook

Uses upstream runbooks by default

https://runbooks.prometheus-operator.dev/runbooks/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this description is nice, but I see that this one is the only one with the examples. It would be nice if we had the examples for all groups, not just alertmanager, but I also don't want us to duplicate this text for each.

Suggestion: Could you add this to a template instead and add a reference here? Then we can get it to all without having to duplicate the text. You could have two templates, one with the text about using upstream runbook by default and one without. What do you think about this? Did you already have some similar plan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these objects different? I wonder if there's any way to have reusable common bits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names are a bit different, but I think you could just create an example that is generic and use that for all alerts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but for validation? I guess "additionalProperties":{"type":"string"} might be good enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a $def and reuse it for all, and give group a specific one as it has special handling, else @Zash suggestion is the way to go for map[string]string for the alerts themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think this works, at least it validates

c835a3c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used?

runbookUrl: "https://runbooks.prometheus-operator.dev/runbooks/"

If not, then I think it can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidumea davidumea requested review from viktor-f and Zash February 18, 2025 14:32
Comment on lines 929 to 930
group: link-to-alert-group-runbook
AlertName: link-to-specific-alert-runbook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define this object, also you should move the $def under runbookUrls because it is not going to be used anywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define this object

I don't think we should define the object (each alert), to avoid default config clutter.

Or am I missing the point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should define group as a string and define it, then define alerts under additionalProperties also as a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to define every single alert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how you meant? 🙂 6eb927c

@davidumea davidumea requested a review from aarnq February 18, 2025 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/improvement Improvement of existing features, e.g. code cleanup or optimizations.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants