Skip to content

Commit

Permalink
stats: physically separate Prometheus and StatsD; build and lint
Browse files Browse the repository at this point in the history
* new build tag: `statsd`
* update make and lint scripts and associated yaml
  - add build and lint permutations
* extract common constants and helpers, reduce code duplication
* update docs: document `statsd` and other build tags
  - remove `AIS_PROMETHEUS` environment

Signed-off-by: Alex Aizman <[email protected]>
  • Loading branch information
alex-aizman committed Jul 9, 2024
1 parent 0058646 commit 860c136
Show file tree
Hide file tree
Showing 11 changed files with 1,020 additions and 243 deletions.
22 changes: 16 additions & 6 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,22 @@ jobs:
- name: Build AIStore on ${{ matrix.os }}
run: |
export GOPATH="$(go env GOPATH)"
MODE="" make node # Build node without backends in production mode.
MODE="debug" make node # Build node without backends in debug mode.
AIS_BACKEND_PROVIDERS="aws azure gcp" MODE="" make node # Build with all backends (production mode).
AIS_BACKEND_PROVIDERS="aws azure gcp" MODE="debug" make node # Build with all backends (debug mode).
MEM_PROFILE="/tmp/mem" CPU_PROFILE="/tmp/cpu" make node # Build with profile.
TAGS="nethttp" make node # Build with net/http transport support (fasthttp is used by default).
# 1) no build tags, no debug
MODE="" make node
# 2) no build tags, debug
MODE="debug" make node
# 3) cloud backends, no debug
AIS_BACKEND_PROVIDERS="aws azure gcp" MODE="" make node
# 4) cloud backends, debug
AIS_BACKEND_PROVIDERS="aws azure gcp" MODE="debug" make node
# 5) cloud backends, debug, statsd
# (build with StatsD, and note that Prometheus is the default when `statsd` tag is not defined)
TAGS="aws azure gcp statsd debug" make node
# 6) statsd, debug, nethttp (note that fasthttp is used by default)
TAGS="nethttp statsd debug" make node
# 7) w/ mem profile (see cmd/aisnodeprofile)
MEM_PROFILE="/tmp/mem" CPU_PROFILE="/tmp/cpu" make node
# 8) authn, cli, aisloader
make authn
make cli
make aisloader
3 changes: 3 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,5 +34,8 @@ jobs:
run: |
export GOPATH="$(go env GOPATH)"
make lint
TAGS=statsd make lint
TAGS="statsd nethttp debug" make lint
TAGS="aws gcp azure" make lint
make fmt-check
make spell-check
6 changes: 5 additions & 1 deletion .golangci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -176,10 +176,14 @@ run:
concurrency: 4
timeout: 6m

# Build hrw and backend providers so that staticcheck doesn't complain about unused export functions.
# NOTE: these are the default build tags for the linter;
# use `TAGS=... make lint` to check that the corresponding alternative sources do lint correctly;
# e.g. `TAGS=statsd make lint`
build-tags:
- hrw
- aws
- azure
- gcp
# - nethttp
# - statsd

3 changes: 2 additions & 1 deletion deploy/dev/local/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,8 @@ parse_backend_providers

create_loopbacks

if ! AIS_BACKEND_PROVIDERS=${AIS_BACKEND_PROVIDERS} make --no-print-directory -C ${AISTORE_PATH} node; then
## NOTE: statsd is Local Playground's default
if ! AIS_BACKEND_PROVIDERS=${AIS_BACKEND_PROVIDERS} TAGS=statsd make --no-print-directory -C ${AISTORE_PATH} node; then
exit_error "failed to compile 'aisnode' binary"
fi

Expand Down
28 changes: 25 additions & 3 deletions docs/environment-vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,13 +207,35 @@ AIStore is a fully compliant [Prometheus exporter](https://prometheus.io/docs/in

In addition and separately, AIStore supports [StatsD](https://github.com/etsy/statsd), and via StatsD - Graphite (collection) and Grafana (graphics).

The corresponding binary choice between StatsD and Prometheus is a **deployment-time** switch controlled by a single environment variable: **AIS_PROMETHEUS**.
The corresponding binary choice between StatsD and Prometheus is a **build-time** switch controlled by a single build tag: **statsd**.

Namely:
> Generally, the entire assortment of supported build tags is demonstrated by the following `aisnode` building examples:
```console
# 1) no build tags, no debug
MODE="" make node

# 2) no build tags, debug
MODE="debug" make node

# 3) cloud backends, no debug
AIS_BACKEND_PROVIDERS="aws azure gcp" MODE="" make node

# 4) cloud backends, debug
AIS_BACKEND_PROVIDERS="aws azure gcp" MODE="debug" make node

# 5) cloud backends, debug, statsd
# (build with StatsD, and note that Prometheus is the default when `statsd` tag is not defined)
TAGS="aws azure gcp statsd debug" make node

# 6) statsd, debug, nethttp (note that fasthttp is used by default)
TAGS="nethttp statsd debug" make node
```

As far as, specifically, StatsD alternative, additional environment includes:

| name | comment |
| ---- | ------- |
| `AIS_PROMETHEUS` | e.g. usage: `export AIS_PROMETHEUS=true` |
| `AIS_STATSD_PORT` | use it to override the default `8125` (see https://github.com/etsy/stats) |
| `AIS_STATSD_PROBE` | a startup option that, when true, tells an ais node to _probe_ whether StatsD server exists (and responds); if the probe fails, the node will disable its StatsD functionality completely - i.e., will not be sending any metrics to the StatsD port (above) |

Expand Down
32 changes: 29 additions & 3 deletions docs/prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,40 @@ This document mostly talks about the "Prometheus" option. Other related document

AIStore is a fully compliant [Prometheus exporter](https://prometheus.io/docs/instrumenting/writing_exporters/) that natively supports [Prometheus](https://prometheus.io/) stats collection. There's no special configuration - the only thing required to enable the corresponding integration is letting AIStore know whether to publish its stats via StatsD **or** Prometheus.

The corresponding binary choice between StatsD and Prometheus is a **deployment-time** switch that is a single environment variable: **AIS_PROMETHEUS**. When a starting-up AIS node (gateway or storage target) sees `AIS_PROMETHEUS` in the environment it registers all its metric descriptions (names, labels, and helps) with Prometheus and provides HTTP endpoint `/metrics` for subsequent collection (aka "scraping") by Prometheus.
The corresponding binary choice between StatsD and Prometheus is a **build-time** switch that is a single build tag: `statsd`.

> With no `AIS_PROMETHEUS` in the environment, AIS nodes default to StatsD.
> Generally, the entire assortment of supported build tags is demonstrated by the following `aisnode` building examples:
```console
# 1) no build tags, no debug
MODE="" make node

# 2) no build tags, debug
MODE="debug" make node

# 3) cloud backends, no debug
AIS_BACKEND_PROVIDERS="aws azure gcp" MODE="" make node

# 4) cloud backends, debug
AIS_BACKEND_PROVIDERS="aws azure gcp" MODE="debug" make node

# 5) cloud backends, debug, statsd
# (build with StatsD, and note that Prometheus is the default when `statsd` tag is not defined)
TAGS="aws azure gcp statsd debug" make node

# 6) statsd, debug, nethttp (note that fasthttp is used by default)
TAGS="nethttp statsd debug" make node
```

When a starting-up AIS node (gateway or storage target) is built with Prometheus support (ie., **without** build tag `statsd`) it will:

* register all its metric descriptions (names, labels, and helps) with Prometheus, and
* provide HTTP endpoint `/metrics` for subsequent collection (aka "scraping") by Prometheus.

Here's a simplified example:

```console
$ AIS_PROMETHEUS=true aisnode -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=target
$ aisnode -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=target

# Assuming the target with hostname "hostname" listens on port 8081:
$ curl http://hostname:8081/metrics | grep ais
Expand Down
8 changes: 7 additions & 1 deletion scripts/bootstrap.sh
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,13 @@ source ${SCRIPTS_DIR}/utils.sh
case $1 in
lint)
echo "Running lint..." >&2
golangci-lint --timeout=15m run $(list_all_go_dirs)
if [[ -z ${TAGS} ]]; then
# using build tags from .golangci.yml
golangci-lint --timeout=15m run $(list_all_go_dirs)
else
# using build tags from env
golangci-lint --timeout=15m --build-tags="${TAGS}" run $(list_all_go_dirs)
fi
exit $?
;;

Expand Down
121 changes: 121 additions & 0 deletions stats/common.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
// Package stats provides methods and functionality to register, track, log,
// and StatsD-notify statistics that, for the most part, include "counter" and "latency" kinds.
/*
* Copyright (c) 2018-2024, NVIDIA CORPORATION. All rights reserved.
*/
package stats

import (
"strings"
"time"

"github.com/NVIDIA/aistore/cmn"
)

const (
dfltPeriodicFlushTime = time.Minute // when `config.Log.FlushTime` is 0 (zero)
dfltPeriodicTimeStamp = time.Hour // extended date/time complementary to log timestamps (e.g., "11:29:11.644596")
maxStatsLogInterval = int64(3 * time.Minute) // when idle; secondly, an upper limit on `config.Log.StatsTime`
maxCapLogInterval = int64(4 * time.Hour) // to see capacity at least few times a day (when idle)
)

// more periodic
const (
maxLogSizeCheckTime = 48 * time.Minute // periodically check the logs for max accumulated size
startupSleep = 300 * time.Millisecond // periodically poll ClusterStarted()
numGorHighCheckTime = 2 * time.Minute // periodically log a warning if the number of goroutines remains high
)

// number-of-goroutines watermarks expressed as multipliers over the number of available logical CPUs (GOMAXPROCS)
const (
numGorHigh = 100
numGorExtreme = 1000
)

// metrics
const (
// KindCounter:
// all basic counters are accompanied by the corresponding (errPrefix + kind) error count:
// e.g.: "get.n" => "err.get.n", "put.n" => "err.put.n", etc.
// See also: `IncErr`, `regCommon`
GetCount = "get.n" // GET(object) count = (cold + warm)
PutCount = "put.n" // ditto PUT
AppendCount = "append.n" // ditto etc.
DeleteCount = "del.n" // ditto
RenameCount = "ren.n" // ditto
ListCount = "lst.n" // list-objects

// statically defined err counts (NOTE: update regCommon when adding/updating)
ErrHTTPWriteCount = errPrefix + "http.write.n"
ErrDownloadCount = errPrefix + "dl.n"
ErrPutMirrorCount = errPrefix + "put.mirror.n"

// KindLatency
GetLatency = "get.ns"
GetLatencyTotal = "get.ns.total"
ListLatency = "lst.ns"
KeepAliveLatency = "kalive.ns"

// KindSpecial
Uptime = "up.ns.time"

// KindGauge, cos.NodeStateFlags enum
NodeStateFlags = "state.flags"
)

// interfaces
type (
// implemented by the stats runners
statsLogger interface {
log(now int64, uptime time.Duration, config *cmn.Config)
statsTime(newval time.Duration)
standingBy() bool
}
)

// primitives: values and maps
type (
// Stats are tracked via a map of stats names (key) to statsValue (values).
// There are two main types of stats: counter and latency declared
// using the the kind field. Only latency stats have numSamples used to compute latency.
statsValue struct {
kind string // enum { KindCounter, ..., KindSpecial }
label struct {
comm string // common part of the metric label (as in: <prefix> . comm . <suffix>)
stsd string // StatsD label
prom string // Prometheus label
}
Value int64 `json:"v,string"`
numSamples int64 // (log + StatsD) only
cumulative int64
}
copyValue struct {
Value int64 `json:"v,string"`
}
copyTracker map[string]copyValue // aggregated every statsTime interval
)

// sample name ais.ip-10-0-2-19.root.log.INFO.20180404-031540.2249
var logtypes = []string{".INFO.", ".WARNING.", ".ERROR."}

var ignoreIdle = []string{"kalive", Uptime, "disk."}

func ignore(s string) bool {
for _, p := range ignoreIdle {
if strings.HasPrefix(s, p) {
return true
}
}
return false
}

// convert bytes to meGabytes with a fixed rounding precision = 2 digits
// - KindThroughput and KindComputedThroughput only
// - MB, not MiB
// - math.Ceil wouldn't produce two decimals
func roundMBs(val int64) (mbs float64) {
mbs = float64(val) / 1000 / 10
num := int(mbs + 0.5)
mbs = float64(num) / 100
return
}
Loading

0 comments on commit 860c136

Please sign in to comment.