Skip to content

Latest commit

 

History

History
141 lines (110 loc) · 3.97 KB

alertmanager.md

File metadata and controls

141 lines (110 loc) · 3.97 KB

Global configuration

If you notice a delay between an event and the first notification, read the following blog post => https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html.

Prometheus configuration

{% highlight yaml %}

prometheus.yml

global: scrape_interval: 20s

A short evaluation_interval will check alerting rules very often.

It can be costly if you run Prometheus with 100+ alerts.

evaluation_interval: 20s ...

rule_files:

  • 'alerts/*.yml'

scrape_configs: ...

{% endhighlight %}

{% highlight yaml %}

alerts/example-redis.yml

groups:

  • name: ExampleRedisGroup rules:
    • alert: ExampleRedisDown expr: redis_up{} == 0 for: 2m labels: severity: critical annotations: summary: "Redis instance down" description: "Whatever"

{% endhighlight %}

AlertManager configuration

{% highlight yaml %} {% raw %}

alertmanager.yml

route:

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification.

group_wait: 10s

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

group_interval: 30s

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

repeat_interval: 30m

A default receiver

receiver: "slack"

All the above attributes are inherited by all child routes and can

overwritten on each.

routes: - receiver: "slack" group_wait: 10s match_re: severity: critical|warning continue: true

- receiver: "pager"
  group_wait: 10s
  match_re:
    severity: critical
  continue: true

receivers:

{% endraw %} {% endhighlight %}

Reduce Prometheus server load

For expansive or frequent PromQL queries, Prometheus allows to precompute rules.

{% highlight yaml %} {% raw %} groups:

first define the recorded rule

  • name: ExampleRecordedGroup rules:
    • record: job:rabbitmq_queue_messages_delivered_total:rate:5m expr: rate(rabbitmq_queue_messages_delivered_total[5m])

then use it in alerts

  • name: ExampleAlertingGroup rules:
    • alert: ExampleRabbitmqLowMessageDelivery expr: sum(job:rabbitmq_queue_messages_delivered_total:rate:5m) < 10 for: 2m labels: severity: critical annotations: summary: "Low delivery rate in Rabbitmq queues" {% endraw %} {% endhighlight %}

Troubleshooting

If the notification takes too much time to be triggered, check the following delays:

  • scrape_interval = 20s (prometheus.yml)
  • evaluation_interval = 20s (prometheus.yml)
  • increase(mysql_global_status_slow_queries[1m]) > 0 (alerts/example-mysql.yml)
  • for: 5m (alerts/example-mysql.yml)
  • group_wait = 10s (alertmanager.yml)

Also read: