Reject invalid metrics/mappings early instead of breaking `/metrics` #186

amithad · 2019-02-28T12:20:05Z

When a user submits an invalid metric (such as mapping a counter and a gauge to the same metric name), we do not log anything, but from that point on (until the exporter is restarted) /metrics will return a 500. We ought to not accept the second (then invalid) sample at all, and log a descriptive error message instead. This way users do not lose visibility into their other metrics.

Original issue:

Statsd exporter results an internal error (Showed in Prometheus) when malformed statsd lines are sent. However, this does not show any error in the log as well.

Steps to recreate:

Use the statsd package 'hot-shots' (nodejs)
Send an 'increment' without specifying the increment number.

import {StatsD} from 'hot-shots'; let client = new StatsD(this.initConfig); client.increment('adapter.' + adapter + '.errors'); //note the buggy invocation
On Prometheus targets, exporter goes to down state. No logs will be shown.

The text was updated successfully, but these errors were encountered:

matthiasr · 2019-02-28T12:36:33Z

Is there anything logged with --log.level=debug?
Is this a regression in the latest release, or did it happen before?
What is the line that hot-shots sends to the exporter?

matthiasr · 2019-02-28T12:37:11Z

Thank you for reporting! This is definitely an issue, we should reject this line immediately and not poison the Prometheus client.

amithad · 2019-02-28T12:58:47Z

I tried recreating it with a single stat line, didn't happen. However, collectively with all stats being published, it fails randomly. Bottom line, stats made the crash. (Isolating the error causing line from the tcpdump was difficult

matthiasr · 2019-02-28T13:01:12Z

What is the error message returned with the 500?

amithad · 2019-02-28T13:05:42Z

Error message : server returned HTTP status 500 Internal Server Error

amithad · 2019-02-28T13:07:04Z

However, statsd-exporter docker container seems to continue without exit

amithad · 2019-02-28T13:10:10Z

@matthiasr I shall clone the repo and run the exporter on debug mode and collect logs for you if you need further information about the bug. Let me know

matthiasr · 2019-02-28T13:18:53Z

Yes, please do! Any additional information helps. What version of the exporter are you running? What version is running in the Docker container? When you check out the repo, what happens with the latest code?

…

On Thu, Feb 28, 2019 at 1:10 PM TJ ***@***.***> wrote: @matthiasr <https://github.com/matthiasr> I shall clone the repo and run the exporter on debug mode and collect logs for you if you need further information about the bug. Let me know — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#186 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAICBiys-UUezGJ6yCe9X_1EgAw-_aLAks5vR9UzgaJpZM4bWsQI> .

amithad · 2019-03-01T10:03:48Z

Here's the exact scenario;

I publish two stats from my nodejs application using hotshots.
1. Stat 1 - increment
2. Stat 2 - gauge

tcpdump of the stat lines sent to statsd-exporter are as follows:

E..O$.@[email protected]#..;.NMYAPP.Integrator.HEADLESS.adapter.Adapter.errors:1|c

14:57:20.496690 IP localhost.38727 > localhost.9125: UDP, length 55
E..S4.@[email protected]#..?.RMYAPP.Integrator.HEADLESS.adapter.Adapter.health:69.61|g

Log extract of statsd-exporter:

DEBU[0029] A change of configuration created inconsistent metrics for "MYAPP_Integrator_adapter_status". You have to restart the statsd_exporter, and you should consider the effects on your monitoring setup. Error: duplicate metrics collector registration attempted  source="exporter.go:385"

mapping entry for the above two stats:

- match: MYAPP.Integrator.*.adapter.*.*
  name: "MYAPP_Integrator_adapter_status"
  labels:
    integrator: "$1"
    adapter: "$2"
    measure: "$3"

Issue occurs when the two types of stats, gauge and increment are both mapped to one metric name.

matthiasr · 2019-03-01T10:41:56Z

That's not valid in the Prometheus metrics model – for a given metric name ( MYAPP_Integrator_adapter_status) the metric can only be either a gauge or a counter. Additionally, you should only use labels when you could sensibly aggregate across the label values. Your measure label violates that – in Prometheus "errors" and "health" must be different metrics (with different names).

I would recommend a mapping like

- match: MYAPP.Integrator.*.adapter.*.errors
  name: "MYAPP_Integrator_adapter_errors_total"
  labels:
    integrator: "$1"
    adapter: "$2"
- match: MYAPP.Integrator.*.adapter.*.health
  name: "MYAPP_Integrator_adapter_health"
  labels:
    integrator: "$1"
    adapter: "$2"

It is unfortunate that we only detect this at scrape time. I am not sure how easy it would be to detect this early though. In any case, we will not be able to support this mapping for these inputs.

matthiasr · 2019-03-01T10:42:54Z

I'm going to rename the issue, and add a high-level description at the top, in case anyone wants to pick this up.

amithad · 2019-03-01T11:01:48Z

Thanks. And yes it's clearly a violation. Didn't realize until the I debugged the code. (What I have done is not a sensible aggregation. I agree with your point). Also thanks for the suggestion

claytono · 2019-03-15T18:22:37Z

We've run into this issue also. In our case it's caused by the metrics generated by the ruby-kafka library:

https://github.com/zendesk/ruby-kafka/blob/02f7e2816e1130c5202764c275e36837f57ca4af/lib/kafka/datadog.rb#L286-L290

The repro case I've come up with is:

$ echo -e "stat_count:1|c\nstat:2|ms\n" |nc -w0 -u  localhost 9125
$ curl http://localhost:9102/metrics
An error has occurred while serving metrics:

collected histogram or summary named "stat" collides with previously collected metric named "stat_count"

The prom client library already checks to see if there is metric name overlap when the metric is registered:

https://github.com/prometheus/client_golang/blob/fa4aa9000d2863904891d193dea354d23f3d712a/prometheus/registry.go#L293-L295

But when generating the output for the metrics endpoint, it calls the checkSuffixCollisions function which checks for the *_count, *_sum, *_bucket cases also. I think the fix for this probably should be to update the Register function to do the same checks as checkSuffixCollisions, ideally with the same code. Unfortunately the type signature of checkSuffixCollisions appear to make it hard to call from Register.

We're deploying statsd_exporter to a large environment, which includes a variety of mixed tenancy Kubernetes clusters. I plan to fix this specific scenario with the kafka metrics with a mapping in the short term, but I think we will need a more comprehensive fix in the long term. I suspect that this is going to come up often enough that having the statsd_exporter go offline until restarted because of this will be a significant operational burden.

We would be glad to do the work, but some direction on the preferred way to fix this would be appreciated. An easy fix would be to just duplicate the logic in checkSuffixCollisions in the Register function, but I admit that seems a bit ugly.

claytono · 2019-05-02T13:42:53Z

I recently tried deploying binaries built from the master branch and ran into a new issue. If a metric has been emitted with conflicting types (counter vs histogram) it's now a runtime error, whereas it used to be detected at collection time and a debug message was emitted. I'm guessing this might be related to the switch to unchecked collectors (#194)

claytono · 2019-05-26T12:58:26Z

@matthiasr I think this is safe to close out at this point. We haven't seen this since 0.10.x was released.

amithad changed the title ~~Malformed statsd lines results an internal error~~ Malformed statsd lines result an internal error Feb 28, 2019

matthiasr added the bug label Feb 28, 2019

matthiasr changed the title ~~Malformed statsd lines result an internal error~~ Reject invalid metrics/mappings early instead of breaking /metrics Mar 1, 2019

matthiasr added enhancement and removed bug labels Mar 1, 2019

claytono mentioned this issue May 14, 2019

Add checking for conflicting metrics #213

Merged

matthiasr closed this as completed May 27, 2019

diranged mentioned this issue Aug 4, 2022

Submitting statsd.distribution's with _count can be dangerous #456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject invalid metrics/mappings early instead of breaking `/metrics` #186

Reject invalid metrics/mappings early instead of breaking `/metrics` #186

amithad commented Feb 28, 2019 •

edited by matthiasr

Loading

matthiasr commented Feb 28, 2019

matthiasr commented Feb 28, 2019

amithad commented Feb 28, 2019 •

edited

Loading

matthiasr commented Feb 28, 2019

amithad commented Feb 28, 2019

amithad commented Feb 28, 2019

amithad commented Feb 28, 2019

matthiasr commented Feb 28, 2019 via email

amithad commented Mar 1, 2019

matthiasr commented Mar 1, 2019

matthiasr commented Mar 1, 2019

amithad commented Mar 1, 2019

claytono commented Mar 15, 2019

claytono commented May 2, 2019

claytono commented May 26, 2019

Reject invalid metrics/mappings early instead of breaking /metrics #186

Reject invalid metrics/mappings early instead of breaking /metrics #186

Comments

amithad commented Feb 28, 2019 • edited by matthiasr Loading

Original issue:

matthiasr commented Feb 28, 2019

matthiasr commented Feb 28, 2019

amithad commented Feb 28, 2019 • edited Loading

matthiasr commented Feb 28, 2019

amithad commented Feb 28, 2019

amithad commented Feb 28, 2019

amithad commented Feb 28, 2019

matthiasr commented Feb 28, 2019 via email

amithad commented Mar 1, 2019

matthiasr commented Mar 1, 2019

matthiasr commented Mar 1, 2019

amithad commented Mar 1, 2019

claytono commented Mar 15, 2019

claytono commented May 2, 2019

claytono commented May 26, 2019

Reject invalid metrics/mappings early instead of breaking `/metrics` #186

Reject invalid metrics/mappings early instead of breaking `/metrics` #186

amithad commented Feb 28, 2019 •

edited by matthiasr

Loading

amithad commented Feb 28, 2019 •

edited

Loading