Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statsd output is not good #327

Closed
drawks opened this issue Aug 1, 2017 · 12 comments · Fixed by #476
Closed

Statsd output is not good #327

drawks opened this issue Aug 1, 2017 · 12 comments · Fixed by #476

Comments

@drawks
Copy link

drawks commented Aug 1, 2017

Aside from the actual broken protocol packets in #207 the implementation of instrumentation with statsd in this project is just really bad. Exporting gauges of pre-aggregated metrics is a serious anti-pattern to typical/normal statsd instrumentation best practices. I'm curious if there is some specific reason why this model was adopted. Is there any appetite for re-instrumenting in a more standard manner?

@magiconair
Copy link
Contributor

You could ask the original committer of the approach or provide a better way of doing this. However, before you do this I'd like to understand what "really bad" means in this context. So far nobody has complained about it and aside from the one bug it seems to have been working for people. I usually prefer a fact based approach on discussions and I'm curious which problems the current implementation creates and how you would solve them.

@drawks
Copy link
Author

drawks commented Aug 1, 2017

statsd is meant primarily as a stat aggregator. a typical pattern for using statsd is to emit events on instrumented code paths and allow the statsd implementation handle the aggregation. For example emit a counter event when a code path is hit and then allow for statsd to flush that counter at a configurable interval so you get roll up aggregates in hits per time unit where time unit is a property of the statsd server. Or to emit a timer event when an instrumented code path is hit and allow for statsd to bucket that timer into a configurable histograms/calculate quantiles etc.

The pattern in use in fabio is to handle aggregation internally and expose pre-aggregated metrics only as gauges. This limits the ability of consumers of the metrics to shape the data as desired. For instance a consumer is limited to only the quantiles which you've chosen to expose via the code.

AFAICT it looks like this is a side effect of using github.com/rcrowley/go-metrics which handles aggregation with intent of periodic flushing as gauges to backends that depend on that model, i.e. graphite

@discobean
Copy link

Using raw metrics when sending to statsd would certainly make life simpler

@magiconair
Copy link
Contributor

Can you provide a specific example? Which metric, what is the current format, what is the desired format? Please keep in mind that I'm not a statsd expert and that you may have to explain more to me than to people already familiar with the matter.

@drawks
Copy link
Author

drawks commented Aug 2, 2017

I'm not sure I can give a better explanation without just writing an example PR. but....

Within the fabio code you've got the notion of a Registry interface that has a few functions to service counters and timers. In order to implement a more normally functioning statsd client each call to Registry.Counter.Inc(n) would emit a single statsd udp packet destined for a statsd server and likewise for Registry.Timer.Update and Registry.Timer.UpdateSince.

The current implementation of statsd sending relies on github.com/magiconair/go-metrics-statsd which is pasted on top of github.com/rcrowley/go-metrics that has a notion of storing the aggregates internally to the process.

@discobean
Copy link

discobean commented Aug 3, 2017

I have found a few key points that are frustrating us:

  1. The metrics for count are absolute count received since the process started
  2. I actually have no idea what min/max/mean/std-dev/50-percentile etc actually show
  3. The metrics for one-minute, five-minute are not correct

I'll elaborate on 2 and 3:

min/max/mean/std-dev/50-percentile:

As a test I have pushed through some data to fabio and recorded the metrics. The initial values for min/max/mean are: 0/0/0

I push through one request, the values are now each: 44462314.0/44462314.0/44462314.0.

What exactly is 44462314.0? My guess is it is latency, but is it in ms? microseconds? Probably it is just me, but I have no idea. The actual request time from my client was ~353ms

Now each flush of metrics the value of min/max/mean remains the same, even though I am not putting any traffic through. So the mean is always 44462314.0, forever.

So the value is the min/mean/mean value for all the requests since the beginning of the process? That data is not really usable.

one-minute, five-minute issue

I found after putting through requests over time, the one-minute value never goes to 0. Even though there is no traffic going out for the past minute :S

If I wait sometime, and then push traffic through for a minute, the 1 minute value != the values of traffic I actually sent in that last minute.

What would be a solution

I think the metrics required would be:

  1. the number of requests
  2. the min/max/mean/average latency

Importantly they would be relative to that metrics period. So if metrics.interval is set to 60, the data would be the total request count for 60 seconds, etc..

For our purposes only, we just require metrics per service, 2 values: count/latency, we do not require a full suite of metrics per URL/host/fabioserver combination as that is just too much (and we push metrics to cloudwatch which makes it more expensive).

ie. service-xyz.count and service-xyz.latency_min/max/avg/mean

Or something along those lines...

drawks pushed a commit to drawks/fabio that referenced this issue Aug 3, 2017
* adds new metric sink `raw_statsd` uses same config as `statsd`
* new sink sends singl raw events directly to statsd server without any
  preaggreation or rollup.
* TODO - use more featureful/performant statsd client implementation such as
  https://github.com/alexcesaro/statsd or https://github.com/quipo/statsd
  which implement features such as optional buffering.
@drawks
Copy link
Author

drawks commented Aug 3, 2017

I've issued a quick PR which demonstrates how a simple raw statsd sink can be implemented which doesn't do any internal aggregation and fits more directly into the intended typical statsd instrumentation model. This PR probably isn't appropriate to merge as written, but builds and works as intended in some limited local testing.

@magiconair
Copy link
Contributor

Sorry for the delay but I'm on vacation with the family.

The key point here is that you want to send an event to statsd when it happens and want to do the aggregation yourself. So far I've relied on go-metrics to do The Right Thing (™) but I recall some discussions around integrating the circonus support that the histograms in go-metrics aren't that useful. Mostly because they're a port of the codahale model from Java. I didn't dig deeper there. Maybe @maier can help here.

@drawks thx for the PR. I'll try to look at it soon. Kids waking up now :)

magiconair added a commit that referenced this issue Aug 6, 2017
This patch adds a statds reporter which does not aggregate values.

Fixes #327
magiconair added a commit that referenced this issue Aug 6, 2017
This patch adds a statds reporter which does not aggregate values.

Fixes #327
magiconair added a commit that referenced this issue Aug 6, 2017
This patch adds a statds reporter which does not aggregate values.

Fixes #327
@magiconair
Copy link
Contributor

@drawks I think your PR would be good enough for an experimental inclusion. Don't you need to add newlines between multiple metrics?

The downside of metrics that are not pre-aggregated is that the API will no longer provide these values. If you are using this provider you probably don't care about that.

However, integrating one of the libraries you suggested was simple enough that I've tried that in #329. To avoid the alloc per metrics I need to refactor the registry interface a bit but I'd appreciate if you could test the integration.

You should set:

./fabio -metrics.target statsd_raw -metrics.interval 100ms

I'll sleep a bit on the naming of statsd_raw vs raw_stats and the corresponding struct names. Also need to update the documentation.

@maier
Copy link
Contributor

maier commented Aug 6, 2017

Similar situation to the one we faced with the Circonus integration. The receiving system (Circonus, StatsD, etc.) need the individual samples for metrics not a single value from the registry.

@magiconair
Copy link
Contributor

I think a better solution is to remove the Counter and Timer structs from my metrics package and add Count and Time methods to the registry directly. Then every driver can choose whether or not to aggregate. That might also be a good path to supporting tags.

magiconair added a commit that referenced this issue Aug 9, 2017
This patch adds a statds reporter which does not aggregate values.

Fixes #327
magiconair added a commit that referenced this issue Aug 11, 2017
This patch adds a statds reporter which does not aggregate values.

Fixes #327
magiconair added a commit that referenced this issue Aug 11, 2017
This patch adds a statds reporter which does not aggregate values.

Fixes #327
@magiconair magiconair added this to the 1.6.0 milestone Oct 10, 2017
magiconair added a commit that referenced this issue Mar 24, 2018
This patch refactors the metrics support within fabio to
achieve several goals:

 * support labels for DogstatD, Prometheus and others
 * support raw events for statsd and others without aggregation
 * support multiple backends simultaneously to support migration
 * make integration of new metrics backends and/or libraries simple
 * being fully backwards compatible to not break existing setups

One of the main challenges is that fabio generates names for the metrics
targets through templates and that these templates are configurable. A
metrics backend which supports labels needs to generate a different name
for the same metric than one that does not support labels.

Combining this with the need to support multiple different metrics
backends at runtime means that none of the existing metrics libraries
like go-kit/metrics and hashicorp/go-metrics is a good fit. However,
they can be used as drivers which reduces the need for testing and makes
integration of new backends simpler since fabio does not need to rely on
a single metrics library.

This patch is work in progress with the new metrics code in the metrics4
package and the old metrics code in the 'metrics_old' package where it
is kept for reference until it has been migrated or removed. The basic
architecture should be good enough to add more providers and
functionality so that the community can help.

Right now there are two providers "flat" and "label" to demonstrate the
concepts. They provide counter and timers in statsd/dogstatd format and
can be configured with -metrics.target=flat[,label]. Other providers
should be added in the same way.

The metrics_old code should be converted to the new provider scheme.

The go-kit/metrics library currently supports most of the necessary
drivers and is the preferred way of integrating new drivers.

The list below is a probably incomplete list of things that need to be
done:

 * The configuration is not used except for metrics.target
 * The drivers from the metrics_old package need to be migrated
 * Additional drivers from go-kit need to be added
 * The old rcrowley/go-metrics code relies on 'registries' for
   namespacing. After the routing table has changed the registry needs
   to be pruned of services which no longer exist so that go-metrics
   stops reporting them.
 * The RX/TX counters in the TCP and TCP+SNI proxy are probably
   sub-optimal or broken.
 * Some counters may not have been initialized properly, e.g. WSConn
 * WSConn should probably be a gauge which also needs to be set to 0 on
   startup.
 * The approach of injecting a noop metrics provider if none has been
   set has already bitten me once since some metrics are reported while
   others are not. I need to think about this a bit more.

Fixes #126
Fixes #165
Fixes #211
Fixes #253
Fixes #326
Fixes #327
Fixes #371
Closes #329
Closes #331
Closes #334
Closes #335
@magiconair
Copy link
Contributor

Please see #211 (comment)

nathanejohnson pushed a commit that referenced this issue Oct 23, 2020
This patch refactors the metrics support within fabio to
achieve several goals:

 * support labels for DogstatD, Prometheus and others
 * support raw events for statsd and others without aggregation
 * support multiple backends simultaneously to support migration
 * make integration of new metrics backends and/or libraries simple
 * being fully backwards compatible to not break existing setups

One of the main challenges is that fabio generates names for the metrics
targets through templates and that these templates are configurable. A
metrics backend which supports labels needs to generate a different name
for the same metric than one that does not support labels.

Combining this with the need to support multiple different metrics
backends at runtime means that none of the existing metrics libraries
like go-kit/metrics and hashicorp/go-metrics is a good fit. However,
they can be used as drivers which reduces the need for testing and makes
integration of new backends simpler since fabio does not need to rely on
a single metrics library.

This patch is work in progress with the new metrics code in the metrics4
package and the old metrics code in the 'metrics_old' package where it
is kept for reference until it has been migrated or removed. The basic
architecture should be good enough to add more providers and
functionality so that the community can help.

Right now there are two providers "flat" and "label" to demonstrate the
concepts. They provide counter and timers in statsd/dogstatd format and
can be configured with -metrics.target=flat[,label]. Other providers
should be added in the same way.

The metrics_old code should be converted to the new provider scheme.

The go-kit/metrics library currently supports most of the necessary
drivers and is the preferred way of integrating new drivers.

The list below is a probably incomplete list of things that need to be
done:

 * The configuration is not used except for metrics.target
 * The drivers from the metrics_old package need to be migrated
 * Additional drivers from go-kit need to be added
 * The old rcrowley/go-metrics code relies on 'registries' for
   namespacing. After the routing table has changed the registry needs
   to be pruned of services which no longer exist so that go-metrics
   stops reporting them.
 * The RX/TX counters in the TCP and TCP+SNI proxy are probably
   sub-optimal or broken.
 * Some counters may not have been initialized properly, e.g. WSConn
 * WSConn should probably be a gauge which also needs to be set to 0 on
   startup.
 * The approach of injecting a noop metrics provider if none has been
   set has already bitten me once since some metrics are reported while
   others are not. I need to think about this a bit more.

Fixes #126
Fixes #165
Fixes #211
Fixes #253
Fixes #326
Fixes #327
Fixes #371
Closes #329
Closes #331
Closes #334
Closes #335

Remove 'Register' methods and create objects directly.

Add Gauge and use it for websocket connections

Vendoring in github.com/alexcesaro/statsd

First stab at integrating raw statsd from #335

Rebase off master 10/23/2020 - NJ
nathanejohnson pushed a commit that referenced this issue Oct 23, 2020
This patch refactors the metrics support within fabio to
achieve several goals:

 * support labels for DogstatD, Prometheus and others
 * support raw events for statsd and others without aggregation
 * support multiple backends simultaneously to support migration
 * make integration of new metrics backends and/or libraries simple
 * being fully backwards compatible to not break existing setups

One of the main challenges is that fabio generates names for the metrics
targets through templates and that these templates are configurable. A
metrics backend which supports labels needs to generate a different name
for the same metric than one that does not support labels.

Combining this with the need to support multiple different metrics
backends at runtime means that none of the existing metrics libraries
like go-kit/metrics and hashicorp/go-metrics is a good fit. However,
they can be used as drivers which reduces the need for testing and makes
integration of new backends simpler since fabio does not need to rely on
a single metrics library.

This patch is work in progress with the new metrics code in the metrics4
package and the old metrics code in the 'metrics_old' package where it
is kept for reference until it has been migrated or removed. The basic
architecture should be good enough to add more providers and
functionality so that the community can help.

Right now there are two providers "flat" and "label" to demonstrate the
concepts. They provide counter and timers in statsd/dogstatd format and
can be configured with -metrics.target=flat[,label]. Other providers
should be added in the same way.

The metrics_old code should be converted to the new provider scheme.

The go-kit/metrics library currently supports most of the necessary
drivers and is the preferred way of integrating new drivers.

The list below is a probably incomplete list of things that need to be
done:

 * The configuration is not used except for metrics.target
 * The drivers from the metrics_old package need to be migrated
 * Additional drivers from go-kit need to be added
 * The old rcrowley/go-metrics code relies on 'registries' for
   namespacing. After the routing table has changed the registry needs
   to be pruned of services which no longer exist so that go-metrics
   stops reporting them.
 * The RX/TX counters in the TCP and TCP+SNI proxy are probably
   sub-optimal or broken.
 * Some counters may not have been initialized properly, e.g. WSConn
 * WSConn should probably be a gauge which also needs to be set to 0 on
   startup.
 * The approach of injecting a noop metrics provider if none has been
   set has already bitten me once since some metrics are reported while
   others are not. I need to think about this a bit more.

Fixes #126
Fixes #165
Fixes #211
Fixes #253
Fixes #326
Fixes #327
Fixes #371
Closes #329
Closes #331
Closes #334
Closes #335

Remove 'Register' methods and create objects directly.

Add Gauge and use it for websocket connections

Vendoring in github.com/alexcesaro/statsd

First stab at integrating raw statsd from #335

Rebase off master 10/23/2020 - NJ
nathanejohnson pushed a commit that referenced this issue Oct 23, 2020
This patch refactors the metrics support within fabio to
achieve several goals:

 * support labels for DogstatD, Prometheus and others
 * support raw events for statsd and others without aggregation
 * support multiple backends simultaneously to support migration
 * make integration of new metrics backends and/or libraries simple
 * being fully backwards compatible to not break existing setups

One of the main challenges is that fabio generates names for the metrics
targets through templates and that these templates are configurable. A
metrics backend which supports labels needs to generate a different name
for the same metric than one that does not support labels.

Combining this with the need to support multiple different metrics
backends at runtime means that none of the existing metrics libraries
like go-kit/metrics and hashicorp/go-metrics is a good fit. However,
they can be used as drivers which reduces the need for testing and makes
integration of new backends simpler since fabio does not need to rely on
a single metrics library.

This patch is work in progress with the new metrics code in the metrics4
package and the old metrics code in the 'metrics_old' package where it
is kept for reference until it has been migrated or removed. The basic
architecture should be good enough to add more providers and
functionality so that the community can help.

Right now there are two providers "flat" and "label" to demonstrate the
concepts. They provide counter and timers in statsd/dogstatd format and
can be configured with -metrics.target=flat[,label]. Other providers
should be added in the same way.

The metrics_old code should be converted to the new provider scheme.

The go-kit/metrics library currently supports most of the necessary
drivers and is the preferred way of integrating new drivers.

The list below is a probably incomplete list of things that need to be
done:

 * The configuration is not used except for metrics.target
 * The drivers from the metrics_old package need to be migrated
 * Additional drivers from go-kit need to be added
 * The old rcrowley/go-metrics code relies on 'registries' for
   namespacing. After the routing table has changed the registry needs
   to be pruned of services which no longer exist so that go-metrics
   stops reporting them.
 * The RX/TX counters in the TCP and TCP+SNI proxy are probably
   sub-optimal or broken.
 * Some counters may not have been initialized properly, e.g. WSConn
 * WSConn should probably be a gauge which also needs to be set to 0 on
   startup.
 * The approach of injecting a noop metrics provider if none has been
   set has already bitten me once since some metrics are reported while
   others are not. I need to think about this a bit more.

Fixes #126
Fixes #165
Fixes #211
Fixes #253
Fixes #326
Fixes #327
Fixes #371
Closes #329
Closes #331
Closes #334
Closes #335

Remove 'Register' methods and create objects directly.

Add Gauge and use it for websocket connections

Vendoring in github.com/alexcesaro/statsd

First stab at integrating raw statsd from #335

Rebase off master 10/23/2020 - NJ
nathanejohnson pushed a commit that referenced this issue Dec 1, 2020
This patch refactors the metrics support within fabio to
achieve several goals:

 * support labels for DogstatD, Prometheus and others
 * support raw events for statsd and others without aggregation
 * support multiple backends simultaneously to support migration
 * make integration of new metrics backends and/or libraries simple
 * being fully backwards compatible to not break existing setups

One of the main challenges is that fabio generates names for the metrics
targets through templates and that these templates are configurable. A
metrics backend which supports labels needs to generate a different name
for the same metric than one that does not support labels.

Combining this with the need to support multiple different metrics
backends at runtime means that none of the existing metrics libraries
like go-kit/metrics and hashicorp/go-metrics is a good fit. However,
they can be used as drivers which reduces the need for testing and makes
integration of new backends simpler since fabio does not need to rely on
a single metrics library.

This patch is work in progress with the new metrics code in the metrics4
package and the old metrics code in the 'metrics_old' package where it
is kept for reference until it has been migrated or removed. The basic
architecture should be good enough to add more providers and
functionality so that the community can help.

Right now there are two providers "flat" and "label" to demonstrate the
concepts. They provide counter and timers in statsd/dogstatd format and
can be configured with -metrics.target=flat[,label]. Other providers
should be added in the same way.

The metrics_old code should be converted to the new provider scheme.

The go-kit/metrics library currently supports most of the necessary
drivers and is the preferred way of integrating new drivers.

The list below is a probably incomplete list of things that need to be
done:

 * The configuration is not used except for metrics.target
 * The drivers from the metrics_old package need to be migrated
 * Additional drivers from go-kit need to be added
 * The old rcrowley/go-metrics code relies on 'registries' for
   namespacing. After the routing table has changed the registry needs
   to be pruned of services which no longer exist so that go-metrics
   stops reporting them.
 * The RX/TX counters in the TCP and TCP+SNI proxy are probably
   sub-optimal or broken.
 * Some counters may not have been initialized properly, e.g. WSConn
 * WSConn should probably be a gauge which also needs to be set to 0 on
   startup.
 * The approach of injecting a noop metrics provider if none has been
   set has already bitten me once since some metrics are reported while
   others are not. I need to think about this a bit more.

Fixes #126
Fixes #165
Fixes #211
Fixes #253
Fixes #326
Fixes #327
Fixes #371
Closes #329
Closes #331
Closes #334
Closes #335

Remove 'Register' methods and create objects directly.

Add Gauge and use it for websocket connections

Vendoring in github.com/alexcesaro/statsd

First stab at integrating raw statsd from #335

Rebase off master 10/23/2020 - NJ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants