Skip to content

Commit

Permalink
Make the severity of "critical" alerts configurable
Browse files Browse the repository at this point in the history
This addresses the blissful scenario where single-node failures are
unproblematic. No reason to wake somebody up if a node is about to
screw itself up by filling the disk.

Signed-off-by: beorn7 <[email protected]>
  • Loading branch information
beorn7 authored and oblitorum committed Apr 9, 2024
1 parent a94bd9f commit 6767617
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 4 deletions.
8 changes: 4 additions & 4 deletions docs/node-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
||| % $._config,
'for': '1h',
labels: {
severity: 'critical',
severity: '%(nodeCriticalSeverity)s' % $._config,
},
annotations: {
summary: 'Filesystem is predicted to run out of space within the next 4 hours.',
Expand Down Expand Up @@ -73,7 +73,7 @@
||| % $._config,
'for': '1h',
labels: {
severity: 'critical',
severity: '%(nodeCriticalSeverity)s' % $._config,
},
annotations: {
summary: 'Filesystem has less than 3% space left.',
Expand Down Expand Up @@ -113,7 +113,7 @@
||| % $._config,
'for': '1h',
labels: {
severity: 'critical',
severity: '%(nodeCriticalSeverity)s' % $._config,
},
annotations: {
summary: 'Filesystem is predicted to run out of inodes within the next 4 hours.',
Expand Down Expand Up @@ -149,7 +149,7 @@
||| % $._config,
'for': '1h',
labels: {
severity: 'critical',
severity: '%(nodeCriticalSeverity)s' % $._config,
},
annotations: {
summary: 'Filesystem has less than 3% inodes left.',
Expand Down
13 changes: 13 additions & 0 deletions docs/node-mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,19 @@
// them here, e.g. 'device!="tmpfs"'.
diskDeviceSelector: '',

// Some of the alerts are meant to fire if a critical failure of a
// node is imminent (e.g. the disk is about to run full). In a
// true “cloud native” setup, failures of a single node should be
// tolerated. Hence, even imminent failure of a single node is no
// reason to create a paging alert. However, in practice there are
// still many situations where operators like to get paged in time
// before a node runs out of disk space. nodeCriticalSeverity can
// be set to the desired severity for this kind of alerts. This
// can even be templated to depend on labels of the node, e.g. you
// could make this critical for traditional database masters but
// just a warning for K8s nodes.
nodeCriticalSeverity: 'critical',

grafana_prefix: '',
},
}

0 comments on commit 6767617

Please sign in to comment.