Skip to content

Bernard

Remi Hakim edited this page Jul 27, 2015 · 4 revisions

Purpose

Suppose that you have a number of checks deployed on your hosts; checks that when executed, give you a status and a few metrics. These checks come from an existing monitoring tool (e.g. Nagios). How do you turn these checks into Datadog checks?

We have added to our agent, the ability to turn arbitrary executable checks into Datadog events and metrics. Think of it as a pretty generic integration that lets you build on what you already have deployed.

Contrary to most application-specific integration, there is no emphasis on specific metrics and it does not require custom code, simply adherence to a common protocol to communicate between the check and the agent.

In this document we use Nagios checks as an example and a reference implementation.

Disclaimer

Please note, Bernard is still experimental and is not part of the default Agent package.

Architecture

At the core, the agent assumes 3 functions:

  1. Scheduling: checks are scheduled for execution on a periodic basis, with special provisions for slow and overdue checks.
  2. Runtime: checks are executed in a environment, their exceptions and signals trapped properly to never cause the agent to terminate because of a bad check. Checks are expected to return a status (ok, warning, fail, unknown).
  3. Notification: Based on the status returned by each check, the agent will create a Datadog event with the appropriate content. Usually this is done when the status changes but any provision to not notify on transient errors or on flapping states.

Relationship to the other agent processes

To provide more isolation and still take advantage of shared code, the this new process, called bernard is meant be standalone.

bernard shares the same configuration as the collector but only communicates events and metrics to dogstatsd. dogstatsd will in turn relay these events and metrics to Datadog on a distinct schedule.

By cleanly separating bernard from the rest and only using UDP to localhost as coupling we can minimize the impact of checks misbehaving on the basic metrics collection that the rest of the agent is responsible for.

Of course bernard uses the same logging infrastructure as the rest of the agent, i.e. it uses /var/log/datadog/bernard.log by default and can forward messages to syslog if desired.

Scheduling

Each check has its own schedule configuration.

  • run every $period, less often if the scheduler is running behind
  • timeout after $timeout
  • reschedule for a longer period if the execution failed or a smaller period if it needs to confirm a state change (need $attempts runs to confirm)

All values can be configured via bernard.yaml.

Runtime

Each check is expected to be self-contained, i.e. checks should not have external dependencies that they cannot resolve themselves. The environment is set up anew and torn down after execution.

  1. No environment variables are set prior to execution.
  2. Signals sent by the check are trapped by bernard.

Files, locks, etc. created by the check should be managed by the check itself. bernard does not track any resource used by the check (files, network connections, etc.)

Once the check is forked to execute, it will do so with a timer set to terminate the check if more than $timeout seconds have elapsed (wall-clock time).

Notifications

The 2 basic questions that this part answers are:

  1. Whether to notify
  2. Who to notify (and what to say)

Notifications are created as Datadog events.

Whether to notify

The decision to notify or not is based on the past check history. The last $attempts+1 results are maintained. If the last $attempts results confirm the state change, an event is triggered. This logic is self-contained and only rely on access to the check history.

Who to notify

For the initial implementation, the mapping between a check and a recipient is initially statically defined in the configuration of bernard.

How notifications are sent

Since notifications are Datadog events and to have as little coupling as possible between bernard and the rest of the agent, events are sent to dogstatsd.

To effect any change, simply restart bernard.

In future versions, this mapping will be dynamically controlled from our servers so that notification routes can be updated without changing the configuration.

Nagios Checks

Currently, Bernard only support Nagios checks.

Nagios checks return results like this:

OK - load average: 0.44, 0.17, 0.13|load1=0.440;2.000;3.000;0; load5=0.170;2.000;3.000;0; load15=0.130;2.000;3.000;0; 
WARNING - load average: 0.85, 0.27, 0.16|load1=0.850;0.100;1.000;0; load5=0.270;0.100;1.000;0; load15=0.160;0.100;1.000;0; 

bernard destructures these results into 2 distinct parts:

  1. the event part (before the |)
  2. the metrics part (after the |)

The event part is turned into a check result, complete with status and message. Notifications will be based on events.

The metrics part is turned into individual metrics, prefixed with bernard.$check_name.

In the examples above we would have:

  1. event: status=ok, message='load average: 0.44, 0.17, 0.13'
  2. metrics: bernard.load.load1=0.44, bernard.load.load5=0.17, bernard.load.load15=0.13

Set up Bernard

Configuration

Bernard is a stand alone process which will run if the bernard.yaml file exists and contains the definition of at least one check.

Here is a description of the bernard.yaml structure and the meaning of each field. It can also be found in bernard.yaml.example:

## Default core configuration. All fields are optional and can be override at the check-level.
core:
  schedule:
    timeout:  5           # To check will timeout and exit after {timeout} seconds
    period:  60           # Scheduled once every {period} seconds
    attempts: 3           # The state change is confirmed only after {attempts} attempts (1 for instant change).
  notification: ""        # String added in the event body
  notify_startup: none    # Which state to notify at startup, can be all, warning, critical or none

## Checks configuration example. To run, Bernard needs at least one valid check definition. 
## To be defined, a check needs at least the `filename` or the `path` option. Other fields are optional. 
checks:
- filename: /usr/lib/nagios/plugins/check_http    # Check to execute
  args: [-H, www.google.com]                      # Arguments passed to the check
  period: 20                                      # Override core `period` option for this check
  timeout: 1                                      # Override core `timeout` option for this check
  name: http_google                               # Set check name (displayed in info page and event content). 
                                                  # By default, it is based on the filename
- path: /Users/me/dd-agent/nagios/                # Create one check per file in the `path` directory
  attempts: 2                                     # Override core `attempts` option for these checks
  notify_startup: all                             # Override core `notify_startup` option for these checks
  notification: "That is not good..."             # Override core `notification` option for these checks

Usage

To see the state of Bernard, look at the info page. It will display the execution status and the result of each check.

Example:

==================
Bernard (v 3.99.0)
==================

  Status date: 2013-07-29 13:10:23 (2s ago)
  Pid: 38726
  Platform: Darwin-12.4.0-x86_64-i386-64bit
  Python Version: 2.6.7
  Logs: <stderr>, syslog:/var/run/syslog

  Schedule count: 15
  Check count: 15

  Checks
  ======
  
    http_staging: [ok] #1 run is ok
      HTTP OK: HTTP/1.1 302 Found - 121 bytes in 0.942 second response time
    load: [ok] #1 run is warning
      WARNING - load average: 3.06, 3.01, 2.00
    swap: [invalid_output] #1 run is unknown
      Failed to parse the output of the check: swap, returncode: 127, output: 
    tcp_google: [ok] #1 run is ok
      TCP OK - 0.009 second response time on port 80
    error: [invalid_output] #1 run is unknown
      Failed to parse the output of the check: error, returncode: 12, output: fake_wrong_check
    donotexist: [exception] #1 run is unknown
      Failed to execute the check: donotexist, exception: [Errno 2] No such file or directory

Notifications

Notifications appear as event on the event stream. The aggregation is based on the check name.

Example:

load is warning on server-42


WARNING - load average: 3.09, 3.04, 2.77

Bernard has something to tell us @all

Clone this wiki locally