Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject invalid metrics/mappings early instead of breaking /metrics #186

Closed
amithad opened this issue Feb 28, 2019 · 15 comments
Closed

Reject invalid metrics/mappings early instead of breaking /metrics #186

amithad opened this issue Feb 28, 2019 · 15 comments

Comments

@amithad
Copy link

amithad commented Feb 28, 2019

When a user submits an invalid metric (such as mapping a counter and a gauge to the same metric name), we do not log anything, but from that point on (until the exporter is restarted) /metrics will return a 500. We ought to not accept the second (then invalid) sample at all, and log a descriptive error message instead. This way users do not lose visibility into their other metrics.

Original issue:

Statsd exporter results an internal error (Showed in Prometheus) when malformed statsd lines are sent. However, this does not show any error in the log as well.

Steps to recreate:

  1. Use the statsd package 'hot-shots' (nodejs)
  2. Send an 'increment' without specifying the increment number.

import {StatsD} from 'hot-shots'; let client = new StatsD(this.initConfig); client.increment('adapter.' + adapter + '.errors'); //note the buggy invocation
On Prometheus targets, exporter goes to down state. No logs will be shown.

@amithad amithad changed the title Malformed statsd lines results an internal error Malformed statsd lines result an internal error Feb 28, 2019
@matthiasr matthiasr added the bug label Feb 28, 2019
@matthiasr
Copy link
Contributor

Is there anything logged with --log.level=debug?
Is this a regression in the latest release, or did it happen before?
What is the line that hot-shots sends to the exporter?

@matthiasr
Copy link
Contributor

Thank you for reporting! This is definitely an issue, we should reject this line immediately and not poison the Prometheus client.

@amithad
Copy link
Author

amithad commented Feb 28, 2019

I tried recreating it with a single stat line, didn't happen. However, collectively with all stats being published, it fails randomly. Bottom line, stats made the crash. (Isolating the error causing line from the tcpdump was difficult

@matthiasr
Copy link
Contributor

What is the error message returned with the 500?

@amithad
Copy link
Author

amithad commented Feb 28, 2019

Error message : server returned HTTP status 500 Internal Server Error

@amithad
Copy link
Author

amithad commented Feb 28, 2019

However, statsd-exporter docker container seems to continue without exit

@amithad
Copy link
Author

amithad commented Feb 28, 2019

@matthiasr I shall clone the repo and run the exporter on debug mode and collect logs for you if you need further information about the bug. Let me know

@matthiasr
Copy link
Contributor

matthiasr commented Feb 28, 2019 via email

@amithad
Copy link
Author

amithad commented Mar 1, 2019

Here's the exact scenario;

I publish two stats from my nodejs application using hotshots.
1. Stat 1 - increment
2. Stat 2 - gauge

tcpdump of the stat lines sent to statsd-exporter are as follows:

E..O$.@[email protected]#..;.NMYAPP.Integrator.HEADLESS.adapter.Adapter.errors:1|c

14:57:20.496690 IP localhost.38727 > localhost.9125: UDP, length 55
E..S4.@[email protected]#..?.RMYAPP.Integrator.HEADLESS.adapter.Adapter.health:69.61|g

Log extract of statsd-exporter:

DEBU[0029] A change of configuration created inconsistent metrics for "MYAPP_Integrator_adapter_status". You have to restart the statsd_exporter, and you should consider the effects on your monitoring setup. Error: duplicate metrics collector registration attempted  source="exporter.go:385"

mapping entry for the above two stats:

- match: MYAPP.Integrator.*.adapter.*.*
  name: "MYAPP_Integrator_adapter_status"
  labels:
    integrator: "$1"
    adapter: "$2"
    measure: "$3"

Issue occurs when the two types of stats, gauge and increment are both mapped to one metric name.

@matthiasr
Copy link
Contributor

That's not valid in the Prometheus metrics model – for a given metric name ( MYAPP_Integrator_adapter_status) the metric can only be either a gauge or a counter. Additionally, you should only use labels when you could sensibly aggregate across the label values. Your measure label violates that – in Prometheus "errors" and "health" must be different metrics (with different names).

I would recommend a mapping like

- match: MYAPP.Integrator.*.adapter.*.errors
  name: "MYAPP_Integrator_adapter_errors_total"
  labels:
    integrator: "$1"
    adapter: "$2"
- match: MYAPP.Integrator.*.adapter.*.health
  name: "MYAPP_Integrator_adapter_health"
  labels:
    integrator: "$1"
    adapter: "$2"

It is unfortunate that we only detect this at scrape time. I am not sure how easy it would be to detect this early though. In any case, we will not be able to support this mapping for these inputs.

@matthiasr
Copy link
Contributor

I'm going to rename the issue, and add a high-level description at the top, in case anyone wants to pick this up.

@matthiasr matthiasr changed the title Malformed statsd lines result an internal error Reject invalid metrics/mappings early instead of breaking /metrics Mar 1, 2019
@matthiasr matthiasr added enhancement and removed bug labels Mar 1, 2019
@amithad
Copy link
Author

amithad commented Mar 1, 2019

Thanks. And yes it's clearly a violation. Didn't realize until the I debugged the code. (What I have done is not a sensible aggregation. I agree with your point). Also thanks for the suggestion

@claytono
Copy link
Contributor

We've run into this issue also. In our case it's caused by the metrics generated by the ruby-kafka library:

https://github.com/zendesk/ruby-kafka/blob/02f7e2816e1130c5202764c275e36837f57ca4af/lib/kafka/datadog.rb#L286-L290

The repro case I've come up with is:

$ echo -e "stat_count:1|c\nstat:2|ms\n" |nc -w0 -u  localhost 9125
$ curl http://localhost:9102/metrics
An error has occurred while serving metrics:

collected histogram or summary named "stat" collides with previously collected metric named "stat_count"

The prom client library already checks to see if there is metric name overlap when the metric is registered:

https://github.com/prometheus/client_golang/blob/fa4aa9000d2863904891d193dea354d23f3d712a/prometheus/registry.go#L293-L295

But when generating the output for the metrics endpoint, it calls the checkSuffixCollisions function which checks for the *_count, *_sum, *_bucket cases also. I think the fix for this probably should be to update the Register function to do the same checks as checkSuffixCollisions, ideally with the same code. Unfortunately the type signature of checkSuffixCollisions appear to make it hard to call from Register.

We're deploying statsd_exporter to a large environment, which includes a variety of mixed tenancy Kubernetes clusters. I plan to fix this specific scenario with the kafka metrics with a mapping in the short term, but I think we will need a more comprehensive fix in the long term. I suspect that this is going to come up often enough that having the statsd_exporter go offline until restarted because of this will be a significant operational burden.

We would be glad to do the work, but some direction on the preferred way to fix this would be appreciated. An easy fix would be to just duplicate the logic in checkSuffixCollisions in the Register function, but I admit that seems a bit ugly.

@claytono
Copy link
Contributor

claytono commented May 2, 2019

I recently tried deploying binaries built from the master branch and ran into a new issue. If a metric has been emitted with conflicting types (counter vs histogram) it's now a runtime error, whereas it used to be detected at collection time and a debug message was emitted. I'm guessing this might be related to the switch to unchecked collectors (#194)

@claytono
Copy link
Contributor

@matthiasr I think this is safe to close out at this point. We haven't seen this since 0.10.x was released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants