This tool lets you monitor a typical home server running applications in containers and receive alerts on your smartphone. It is designed to be light and simple (no database, no GUI, a single configuration file).
- run in a container (tested with both docker and podman)
- send notifications to any supported services by shoutrrr
- alert when a container is restarting forever
- alert when a container isn't started
- alert when a target is unreachable (ping)
- alert when available disk space is low
- alert when systemd service is failed
- notify when a container image is updated (provide an alternative to watchtower if you are running podman with podman-auto-update)
This tool follows semantic versioning.
Pre-built images are available on github packages:
ghcr.io/mcarbonne/minimal-server-monitoring:main
(main
branch)ghcr.io/mcarbonne/minimal-server-monitoring:latest
: latest tagged versionghcr.io/mcarbonne/minimal-server-monitoring:x.x.x
ghcr.io/mcarbonne/minimal-server-monitoring:x.x
ghcr.io/mcarbonne/minimal-server-monitoring:x
For automatic updates (watchtower, podman-auto-update...), using the lastest major tag available (ghcr.io/mcarbonne/minimal-server-monitoring:2
) is recommanded to avoid breaking changes.
docker run -e MACHINENAME=$(hostname) -e SHOUTRRR=XXXXXXX \
-v .../cache.json:/app/cache.json \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /run/systemd:/run/systemd:ro \
-v /:/host:ro \
--name minimal-server-monitoring -d ghcr.io/mcarbonne/minimal-server-monitoring:2
docker run \
-v .../config.yml:/app/config.yml:ro \
-v .../cache.json:/app/cache.json \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /run/systemd:/run/systemd:ro \
-v /:/host:ro \
--name minimal-server-monitoring -d ghcr.io/mcarbonne/minimal-server-monitoring:2
-v .../config.yml:/app/config.yml:ro
: override default configuration file with your settings. Default configuration file is available here. Have a look at example_config.yml for an exhaustive lists of available parameters.-v .../cache.json:/app/cache.json
: persist the cache-v /var/run/docker.sock:/var/run/docker.sock:ro
: give access to the host docker daemon (required for container provider). Use/run/podman/podman.sock:/var/run/docker.sock:ro
if you are using podman.-v /run/systemd:/run/systemd:ro
: give access to the host systemd (required for systemd provider)-v /:/host:ro
: required forfilesystemusage
to discover and monitor all mountpoints. Target in container must matchmountprefix
parameter (see here).
key | type | required | default value |
---|---|---|---|
notifiers | map of notifiers | yes | - |
cache | string (path) | yes | - |
alert.unhealthy_threshold | uint | no | 1 |
alert.healthy_threshold | uint | no | 1 |
alert.failure_reminder | duration * | no | 2h |
alert.grouping.window | duration * | no | 15s |
scrapers | map of scrapers | yes | - |
String with unit. See here for details.
key | type | required | default value |
---|---|---|---|
type | enum (shoutrrr, console) | yes | - |
params | map, see below for details | no | {} |
key | description | required | default value |
---|---|---|---|
url | shoutrrr url | yes | - |
- no parameters
- all notifications are logged on the standard output
key | type | required | default value |
---|---|---|---|
type | enum (systemd, container, filesystemusage, ping) | yes | - |
scrape_interval | duration * | no | 120s |
params | map, see below | no | {} |
- no parameters
- only one instance allowed
- states (for every services):
- service active state (
ActiveState != failed
)
- service active state (
- no parameters
- only one instance allowed
- messages (for every running containers):
- when a container image is updated
- states (for every running containers):
- container status (check if started)
- container restart (check if restarting forever)
- provide two states for each mountpoint (check if there is enough free disk space available and if there are rapid changes)
- multiple instances allowed
parameter | description | required | default value |
---|---|---|---|
mountprefix | mountpoint prefix, when running inside a container | no | "" (empty string) |
fstypes | list of file system types to consider | no | [ext4, btrfs] |
mountpoint_blacklist | list of mountpoints to ignore | no | [] |
mountpoint_whitelist | list of mountpoints to monitor. When set, fstypes and mountpoint_blacklist are ignored and autodiscovery is skipped |
no | [] |
threshold | minimum threshold of available disk space1 | no | 20% |
rate_threshold | rate threshold over rate_threshold_window period1,2 | no | 1g |
rate_threshold_window | window duration2 | no | 5m* |
- thresholds might either be relative (20%) or absolute (50m, 20gb ...). Absolute parsing is done using
ParseBytes
from go-humanize, supported prefix list is available here. - rate threshold triggers an alert when remaining disk space changes by more than
rate_threshold
overrate_threshold_window
(both increase and decrease). Note:rate_threshold_window
must be greater than or equal toscrape_interval
.
- provide one state per target (is target reachable)
- multiple instances allowed
parameter | description | required | default value |
---|---|---|---|
targets | list of ip addresses/hostnames to ping | yes | - |
retry_count | how many times to retry if ping failed | no | 3 |
notifiers:
notifier_1:
type: console
notifier_2:
type: shoutrrr
params:
url: YOUR_SHOUTRRR_URL_HERE
cache: /tmp/cache.json
alert:
unhealthy_threshold: 1
healthy_threshold: 1
scrapers:
docker:
type: container
systemd:
type: systemd
gateway:
type: ping
scrape_interval: 5s
params:
targets:
- 192.168.0.1
ethernet_available:
type: ping
scrape_interval: 2m
params:
targets:
- 8.8.8.8
retry_count: 3
filesystemusage:
type: filesystemusage
params:
mountpoints:
- "/"
threshold: 15%
flowchart TD
subgraph Scraping
Storage
Sc(Schedule scrapers)
Sc-..->S1 & S2 & S3
S1("`**Scraper n°1**
- provider: container
- scrape_interval: 15s`")
S2("`**Scraper n°2**
- provider: ping
- scrape_interval: 30s`")
S3(...)
S1 & S2 & S3 -->SC
SC{{Collect ScrapeResult}}
Storage[(Storage)]
S1 & S2 & S3<-.->Storage
end
SC--"- states\n- messages"-->AlertCenter
subgraph AlertCenter
AC{{"Generate notifications"}}
AC--notifications-->F
F{{Filtering}}
F--filtered notifications-->G
G{{Grouping}}
end
G--filtered and grouped notifications-->Notifier
subgraph Notifier
C{{Send notifications}}
N1(Shoutrrr)
N2(...)
C-->N1
C-->N2
end
Schedule configured scrapers. Each scraper may emit multiple states and multiple messages. On contrary to some other monitoring tools, decisions are taken in scrapers (i.e. are metric healthy).
Multiple instances of a given provider may be allowed (depending on provider).
A State metric is the combination of a metricId, a state (boolean) and a message.
Example: metricId: "container_XXXX_state", isHealthy: false, message: "XXXX isn't running"
A Message metric is the combination of a metricId and a message.
Example: metricId: "container_XXXX_updated", message: "container XXXX was updated ...."
Providers can persist data using Storage, a simple key-value database.
AlertCenter is here to:
- emit notifications from scrape result
- avoid beeing flooded with notifications (filtering + grouping)
If a state is marked as failed unhealthy_threshold
time in a row, a notification is sent (metric XX failed).
If a state is marked as OK healthy_threshold
time in a row, a notification is sent (metric XX OK).
Messages are forwared as notifications (no processing at this step).
Avoid sending too many notifications for a given metricId
.
Each metricId
is allowed to send at most 5 messages every 30 minutes.
When processing a notification, wait up to 15 seconds to group at most 10 notifications.
Send all notifications to all configured notifiers. Multiple instances of each type are allowed.