Statsd output is not good #327

drawks · 2017-08-01T15:50:03Z

Aside from the actual broken protocol packets in #207 the implementation of instrumentation with statsd in this project is just really bad. Exporting gauges of pre-aggregated metrics is a serious anti-pattern to typical/normal statsd instrumentation best practices. I'm curious if there is some specific reason why this model was adopted. Is there any appetite for re-instrumenting in a more standard manner?

magiconair · 2017-08-01T16:21:46Z

You could ask the original committer of the approach or provide a better way of doing this. However, before you do this I'd like to understand what "really bad" means in this context. So far nobody has complained about it and aside from the one bug it seems to have been working for people. I usually prefer a fact based approach on discussions and I'm curious which problems the current implementation creates and how you would solve them.

drawks · 2017-08-01T16:38:49Z

statsd is meant primarily as a stat aggregator. a typical pattern for using statsd is to emit events on instrumented code paths and allow the statsd implementation handle the aggregation. For example emit a counter event when a code path is hit and then allow for statsd to flush that counter at a configurable interval so you get roll up aggregates in hits per time unit where time unit is a property of the statsd server. Or to emit a timer event when an instrumented code path is hit and allow for statsd to bucket that timer into a configurable histograms/calculate quantiles etc.

The pattern in use in fabio is to handle aggregation internally and expose pre-aggregated metrics only as gauges. This limits the ability of consumers of the metrics to shape the data as desired. For instance a consumer is limited to only the quantiles which you've chosen to expose via the code.

AFAICT it looks like this is a side effect of using github.com/rcrowley/go-metrics which handles aggregation with intent of periodic flushing as gauges to backends that depend on that model, i.e. graphite

discobean · 2017-08-02T05:02:22Z

Using raw metrics when sending to statsd would certainly make life simpler

magiconair · 2017-08-02T14:56:05Z

Can you provide a specific example? Which metric, what is the current format, what is the desired format? Please keep in mind that I'm not a statsd expert and that you may have to explain more to me than to people already familiar with the matter.

drawks · 2017-08-02T17:45:24Z

I'm not sure I can give a better explanation without just writing an example PR. but....

Within the fabio code you've got the notion of a Registry interface that has a few functions to service counters and timers. In order to implement a more normally functioning statsd client each call to Registry.Counter.Inc(n) would emit a single statsd udp packet destined for a statsd server and likewise for Registry.Timer.Update and Registry.Timer.UpdateSince.

The current implementation of statsd sending relies on github.com/magiconair/go-metrics-statsd which is pasted on top of github.com/rcrowley/go-metrics that has a notion of storing the aggregates internally to the process.

discobean · 2017-08-03T04:19:51Z

I have found a few key points that are frustrating us:

The metrics for count are absolute count received since the process started
I actually have no idea what min/max/mean/std-dev/50-percentile etc actually show
The metrics for one-minute, five-minute are not correct

I'll elaborate on 2 and 3:

min/max/mean/std-dev/50-percentile:

As a test I have pushed through some data to fabio and recorded the metrics. The initial values for min/max/mean are: 0/0/0

I push through one request, the values are now each: 44462314.0/44462314.0/44462314.0.

What exactly is 44462314.0? My guess is it is latency, but is it in ms? microseconds? Probably it is just me, but I have no idea. The actual request time from my client was ~353ms

Now each flush of metrics the value of min/max/mean remains the same, even though I am not putting any traffic through. So the mean is always 44462314.0, forever.

So the value is the min/mean/mean value for all the requests since the beginning of the process? That data is not really usable.

one-minute, five-minute issue

I found after putting through requests over time, the one-minute value never goes to 0. Even though there is no traffic going out for the past minute :S

If I wait sometime, and then push traffic through for a minute, the 1 minute value != the values of traffic I actually sent in that last minute.

What would be a solution

I think the metrics required would be:

the number of requests
the min/max/mean/average latency

Importantly they would be relative to that metrics period. So if metrics.interval is set to 60, the data would be the total request count for 60 seconds, etc..

For our purposes only, we just require metrics per service, 2 values: count/latency, we do not require a full suite of metrics per URL/host/fabioserver combination as that is just too much (and we push metrics to cloudwatch which makes it more expensive).

ie. service-xyz.count and service-xyz.latency_min/max/avg/mean

Or something along those lines...

* adds new metric sink `raw_statsd` uses same config as `statsd` * new sink sends singl raw events directly to statsd server without any preaggreation or rollup. * TODO - use more featureful/performant statsd client implementation such as https://github.com/alexcesaro/statsd or https://github.com/quipo/statsd which implement features such as optional buffering.

drawks · 2017-08-03T16:19:42Z

I've issued a quick PR which demonstrates how a simple raw statsd sink can be implemented which doesn't do any internal aggregation and fits more directly into the ~~intended~~ typical statsd instrumentation model. This PR probably isn't appropriate to merge as written, but builds and works as intended in some limited local testing.

magiconair · 2017-08-06T06:24:36Z

Sorry for the delay but I'm on vacation with the family.

The key point here is that you want to send an event to statsd when it happens and want to do the aggregation yourself. So far I've relied on go-metrics to do The Right Thing (™) but I recall some discussions around integrating the circonus support that the histograms in go-metrics aren't that useful. Mostly because they're a port of the codahale model from Java. I didn't dig deeper there. Maybe @maier can help here.

@drawks thx for the PR. I'll try to look at it soon. Kids waking up now :)

This patch adds a statds reporter which does not aggregate values. Fixes #327

magiconair · 2017-08-06T10:20:09Z

@drawks I think your PR would be good enough for an experimental inclusion. Don't you need to add newlines between multiple metrics?

The downside of metrics that are not pre-aggregated is that the API will no longer provide these values. If you are using this provider you probably don't care about that.

However, integrating one of the libraries you suggested was simple enough that I've tried that in #329. To avoid the alloc per metrics I need to refactor the registry interface a bit but I'd appreciate if you could test the integration.

You should set:

./fabio -metrics.target statsd_raw -metrics.interval 100ms

I'll sleep a bit on the naming of statsd_raw vs raw_stats and the corresponding struct names. Also need to update the documentation.

maier · 2017-08-06T13:16:51Z

Similar situation to the one we faced with the Circonus integration. The receiving system (Circonus, StatsD, etc.) need the individual samples for metrics not a single value from the registry.

magiconair · 2017-08-06T21:07:34Z

I think a better solution is to remove the Counter and Timer structs from my metrics package and add Count and Time methods to the registry directly. Then every driver can choose whether or not to aggregate. That might also be a good path to supporting tags.

This patch adds a statds reporter which does not aggregate values. Fixes #327

This patch refactors the metrics support within fabio to achieve several goals: * support labels for DogstatD, Prometheus and others * support raw events for statsd and others without aggregation * support multiple backends simultaneously to support migration * make integration of new metrics backends and/or libraries simple * being fully backwards compatible to not break existing setups One of the main challenges is that fabio generates names for the metrics targets through templates and that these templates are configurable. A metrics backend which supports labels needs to generate a different name for the same metric than one that does not support labels. Combining this with the need to support multiple different metrics backends at runtime means that none of the existing metrics libraries like go-kit/metrics and hashicorp/go-metrics is a good fit. However, they can be used as drivers which reduces the need for testing and makes integration of new backends simpler since fabio does not need to rely on a single metrics library. This patch is work in progress with the new metrics code in the metrics4 package and the old metrics code in the 'metrics_old' package where it is kept for reference until it has been migrated or removed. The basic architecture should be good enough to add more providers and functionality so that the community can help. Right now there are two providers "flat" and "label" to demonstrate the concepts. They provide counter and timers in statsd/dogstatd format and can be configured with -metrics.target=flat[,label]. Other providers should be added in the same way. The metrics_old code should be converted to the new provider scheme. The go-kit/metrics library currently supports most of the necessary drivers and is the preferred way of integrating new drivers. The list below is a probably incomplete list of things that need to be done: * The configuration is not used except for metrics.target * The drivers from the metrics_old package need to be migrated * Additional drivers from go-kit need to be added * The old rcrowley/go-metrics code relies on 'registries' for namespacing. After the routing table has changed the registry needs to be pruned of services which no longer exist so that go-metrics stops reporting them. * The RX/TX counters in the TCP and TCP+SNI proxy are probably sub-optimal or broken. * Some counters may not have been initialized properly, e.g. WSConn * WSConn should probably be a gauge which also needs to be set to 0 on startup. * The approach of injecting a noop metrics provider if none has been set has already bitten me once since some metrics are reported while others are not. I need to think about this a bit more. Fixes #126 Fixes #165 Fixes #211 Fixes #253 Fixes #326 Fixes #327 Fixes #371 Closes #329 Closes #331 Closes #334 Closes #335

magiconair · 2018-03-24T12:06:55Z

Please see #211 (comment)

This patch refactors the metrics support within fabio to achieve several goals: * support labels for DogstatD, Prometheus and others * support raw events for statsd and others without aggregation * support multiple backends simultaneously to support migration * make integration of new metrics backends and/or libraries simple * being fully backwards compatible to not break existing setups One of the main challenges is that fabio generates names for the metrics targets through templates and that these templates are configurable. A metrics backend which supports labels needs to generate a different name for the same metric than one that does not support labels. Combining this with the need to support multiple different metrics backends at runtime means that none of the existing metrics libraries like go-kit/metrics and hashicorp/go-metrics is a good fit. However, they can be used as drivers which reduces the need for testing and makes integration of new backends simpler since fabio does not need to rely on a single metrics library. This patch is work in progress with the new metrics code in the metrics4 package and the old metrics code in the 'metrics_old' package where it is kept for reference until it has been migrated or removed. The basic architecture should be good enough to add more providers and functionality so that the community can help. Right now there are two providers "flat" and "label" to demonstrate the concepts. They provide counter and timers in statsd/dogstatd format and can be configured with -metrics.target=flat[,label]. Other providers should be added in the same way. The metrics_old code should be converted to the new provider scheme. The go-kit/metrics library currently supports most of the necessary drivers and is the preferred way of integrating new drivers. The list below is a probably incomplete list of things that need to be done: * The configuration is not used except for metrics.target * The drivers from the metrics_old package need to be migrated * Additional drivers from go-kit need to be added * The old rcrowley/go-metrics code relies on 'registries' for namespacing. After the routing table has changed the registry needs to be pruned of services which no longer exist so that go-metrics stops reporting them. * The RX/TX counters in the TCP and TCP+SNI proxy are probably sub-optimal or broken. * Some counters may not have been initialized properly, e.g. WSConn * WSConn should probably be a gauge which also needs to be set to 0 on startup. * The approach of injecting a noop metrics provider if none has been set has already bitten me once since some metrics are reported while others are not. I need to think about this a bit more. Fixes #126 Fixes #165 Fixes #211 Fixes #253 Fixes #326 Fixes #327 Fixes #371 Closes #329 Closes #331 Closes #334 Closes #335 Remove 'Register' methods and create objects directly. Add Gauge and use it for websocket connections Vendoring in github.com/alexcesaro/statsd First stab at integrating raw statsd from #335 Rebase off master 10/23/2020 - NJ

drawks mentioned this issue Aug 3, 2017

Resolves #327 PoC raw_statsd #328

Closed

magiconair added a commit that referenced this issue Aug 6, 2017

metrics: add raw statsd reporter

906433f

This patch adds a statds reporter which does not aggregate values. Fixes #327

magiconair added a commit that referenced this issue Aug 6, 2017

metrics: add raw statsd reporter

1f0e9fa

This patch adds a statds reporter which does not aggregate values. Fixes #327

magiconair added a commit that referenced this issue Aug 6, 2017

metrics: add raw statsd reporter

d6c62e7

This patch adds a statds reporter which does not aggregate values. Fixes #327

magiconair mentioned this issue Aug 6, 2017

Add raw statsd reporter #329

Closed

magiconair added a commit that referenced this issue Aug 9, 2017

metrics: add raw statsd reporter

1b454db

This patch adds a statds reporter which does not aggregate values. Fixes #327

magiconair added a commit that referenced this issue Aug 11, 2017

metrics: add raw statsd reporter

3d3f615

This patch adds a statds reporter which does not aggregate values. Fixes #327

magiconair added a commit that referenced this issue Aug 11, 2017

metrics: add raw statsd reporter

565c8f6

This patch adds a statds reporter which does not aggregate values. Fixes #327

magiconair added the theme-metrics label Oct 1, 2017

magiconair added this to the 1.6.0 milestone Oct 10, 2017

magiconair mentioned this issue Mar 24, 2018

Refactor metrics #476

Merged

magiconair added the pinned label Oct 29, 2018

nathanejohnson closed this as completed in #476 Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statsd output is not good #327

Statsd output is not good #327

drawks commented Aug 1, 2017 •

edited

Loading

magiconair commented Aug 1, 2017

drawks commented Aug 1, 2017 •

edited

Loading

discobean commented Aug 2, 2017

magiconair commented Aug 2, 2017

drawks commented Aug 2, 2017

discobean commented Aug 3, 2017 •

edited

Loading

drawks commented Aug 3, 2017

magiconair commented Aug 6, 2017

magiconair commented Aug 6, 2017

maier commented Aug 6, 2017

magiconair commented Aug 6, 2017

magiconair commented Mar 24, 2018

Statsd output is not good #327

Statsd output is not good #327

Comments

drawks commented Aug 1, 2017 • edited Loading

magiconair commented Aug 1, 2017

drawks commented Aug 1, 2017 • edited Loading

discobean commented Aug 2, 2017

magiconair commented Aug 2, 2017

drawks commented Aug 2, 2017

discobean commented Aug 3, 2017 • edited Loading

min/max/mean/std-dev/50-percentile:

one-minute, five-minute issue

What would be a solution

drawks commented Aug 3, 2017

magiconair commented Aug 6, 2017

magiconair commented Aug 6, 2017

maier commented Aug 6, 2017

magiconair commented Aug 6, 2017

magiconair commented Mar 24, 2018

drawks commented Aug 1, 2017 •

edited

Loading

drawks commented Aug 1, 2017 •

edited

Loading

discobean commented Aug 3, 2017 •

edited

Loading