-
Notifications
You must be signed in to change notification settings - Fork 813
Bernard
Suppose that you have a number of checks deployed on your hosts; checks that when executed, give you a status and a few metrics. These checks come from an existing monitoring tool (e.g. Nagios). How do you turn these checks into Datadog checks?
We have added to our agent, the ability to turn arbitrary executable checks into Datadog events and metrics. Think of it as a pretty generic integration that lets you build on what you already have deployed.
Contrary to most application-specific integration, there is no emphasis on specific metrics and it does not require custom code, simply adherence to a common protocol to communicate between the check and the agent.
In this document we use Nagios checks as an example and a reference implementation.
Please note, Bernard is still experimental and is not part of the default Agent package.
At the core, the agent assumes 3 functions:
- Scheduling: checks are scheduled for execution on a periodic basis, with special provisions for slow and overdue checks.
- Runtime: checks are executed in a environment, their exceptions and signals trapped properly to never cause the agent to terminate because of a bad check. Checks are expected to return a status (ok, warning, fail, unknown).
- Notification: Based on the status returned by each check, the agent will create a Datadog event with the appropriate content. Usually this is done when the status changes but any provision to not notify on transient errors or on flapping states.
To provide more isolation and still take advantage of shared code, the
this new process, called bernard
is meant be standalone.
bernard
shares the same configuration as the collector but only
communicates events and metrics to dogstatsd
. dogstatsd
will in
turn relay these events and metrics to Datadog on a distinct schedule.
By cleanly separating bernard
from the rest and only using UDP to
localhost as coupling we can minimize the impact of checks misbehaving
on the basic metrics collection that the rest of the agent is
responsible for.
Of course bernard
uses the same logging infrastructure as the rest
of the agent, i.e. it uses /var/log/datadog/bernard.log
by default
and can forward messages to syslog
if desired.
Each check has its own schedule configuration.
- run every $period, less often if the scheduler is running behind
- timeout after $timeout
- reschedule for a longer period if the execution failed or a smaller period if it needs to confirm a state change (need $attempts runs to confirm)
All values can be configured via bernard.yaml
.
Each check is expected to be self-contained, i.e. checks should not have external dependencies that they cannot resolve themselves. The environment is set up anew and torn down after execution.
- No environment variables are set prior to execution.
- Signals sent by the check are trapped by
bernard
.
Files, locks, etc. created by the check should be managed by the check
itself. bernard
does not track any resource used by the check
(files, network connections, etc.)
Once the check is forked to execute, it will do so with a timer set to terminate the check if more than $timeout seconds have elapsed (wall-clock time).
The 2 basic questions that this part answers are:
- Whether to notify
- Who to notify (and what to say)
Notifications are created as Datadog events.
The decision to notify or not is based on the past check history. The last $attempts+1 results are maintained. If the last $attempts results confirm the state change, an event is triggered. This logic is self-contained and only rely on access to the check history.
For the initial implementation, the mapping between a check and a
recipient is initially statically defined in the configuration of
bernard
.
Since notifications are Datadog events and to have as little coupling
as possible between bernard
and the rest of the agent, events are
sent to dogstatsd
.
To effect any change, simply restart bernard
.
In future versions, this mapping will be dynamically controlled from our servers so that notification routes can be updated without changing the configuration.
Currently, Bernard only support Nagios checks.
Nagios checks return results like this:
OK - load average: 0.44, 0.17, 0.13|load1=0.440;2.000;3.000;0; load5=0.170;2.000;3.000;0; load15=0.130;2.000;3.000;0;
WARNING - load average: 0.85, 0.27, 0.16|load1=0.850;0.100;1.000;0; load5=0.270;0.100;1.000;0; load15=0.160;0.100;1.000;0;
bernard
destructures these results into 2 distinct parts:
- the event part (before the |)
- the metrics part (after the |)
The event part is turned into a check result, complete with status and message. Notifications will be based on events.
The metrics part is turned into individual metrics, prefixed with
bernard.$check_name
.
In the examples above we would have:
- event: status=ok, message='load average: 0.44, 0.17, 0.13'
- metrics: bernard.load.load1=0.44, bernard.load.load5=0.17, bernard.load.load15=0.13
Bernard
is a stand alone process which will run if the bernard.yaml
file exists
and contains the definition of at least one check.
Here is a description of the bernard.yaml
structure and the meaning of each field. It can also be found in bernard.yaml.example
:
## Default core configuration. All fields are optional and can be override at the check-level.
core:
schedule:
timeout: 5 # To check will timeout and exit after {timeout} seconds
period: 60 # Scheduled once every {period} seconds
attempts: 3 # The state change is confirmed only after {attempts} attempts (1 for instant change).
notification: "" # String added in the event body
notify_startup: none # Which state to notify at startup, can be all, warning, critical or none
## Checks configuration example. To run, Bernard needs at least one valid check definition.
## To be defined, a check needs at least the `filename` or the `path` option. Other fields are optional.
checks:
- filename: /usr/lib/nagios/plugins/check_http # Check to execute
args: [-H, www.google.com] # Arguments passed to the check
period: 20 # Override core `period` option for this check
timeout: 1 # Override core `timeout` option for this check
name: http_google # Set check name (displayed in info page and event content).
# By default, it is based on the filename
- path: /Users/me/dd-agent/nagios/ # Create one check per file in the `path` directory
attempts: 2 # Override core `attempts` option for these checks
notify_startup: all # Override core `notify_startup` option for these checks
notification: "That is not good..." # Override core `notification` option for these checks
To see the state of Bernard
, look at the info page.
It will display the execution status and the result of each check.
Example:
==================
Bernard (v 3.99.0)
==================
Status date: 2013-07-29 13:10:23 (2s ago)
Pid: 38726
Platform: Darwin-12.4.0-x86_64-i386-64bit
Python Version: 2.6.7
Logs: <stderr>, syslog:/var/run/syslog
Schedule count: 15
Check count: 15
Checks
======
http_staging: [ok] #1 run is ok
HTTP OK: HTTP/1.1 302 Found - 121 bytes in 0.942 second response time
load: [ok] #1 run is warning
WARNING - load average: 3.06, 3.01, 2.00
swap: [invalid_output] #1 run is unknown
Failed to parse the output of the check: swap, returncode: 127, output:
tcp_google: [ok] #1 run is ok
TCP OK - 0.009 second response time on port 80
error: [invalid_output] #1 run is unknown
Failed to parse the output of the check: error, returncode: 12, output: fake_wrong_check
donotexist: [exception] #1 run is unknown
Failed to execute the check: donotexist, exception: [Errno 2] No such file or directory
Notifications appear as event on the event stream. The aggregation is based on the check name.
Example:
load is warning on server-42
WARNING - load average: 3.09, 3.04, 2.77
Bernard has something to tell us @all