Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert for Clashing StatsD Metrics that Cause The StatsD Exporter to Crash #13265

Closed
2 tasks
omgitsbillryan opened this issue Sep 4, 2020 · 4 comments
Closed
2 tasks
Assignees
Labels
devops practice area categorization -- NOT a team assignment operations

Comments

@omgitsbillryan
Copy link
Contributor

The Problem

It's possible for StatsD metrics in the vets-api codebase to "clash", such as when the label names on the metric are inconsistent. When this happens, the statsd-exporter process that runs on the vets-api instances, crashes and restarts, which prevents new stats from being scraped by prometheus.

In summary, we lose reporting of many of the vets-api metrics.

More Context

This occurred on Sept 3, 2020 - link

Metrics were introduced that clashed with existing metrics. This resulted in fatal error messages in the statsd-exporter log file:

time="2020-09-03T19:03:30Z" level=fatal msg="A change of configuration created inconsistent metrics for \"api_auth_saml_request\". You have to restart the statsd_exporter, and you should consider the effects on your monitoring setup. Error: a previously registered descriptor with the same fully-qualified name as Desc{fqName: \"api_auth_saml_request\", help: \"Metric autogenerated by statsd_exporter.\", constLabels: {type=\"mhv\",version=\"v0\"}, variableLabels: []} has different label names or a different help string" source="exporter.go:82" 

(Cloudwatch logs source)

Work to be Done

  • Create a metric filter in Cloudwatch that looks for
  • Create an alert that notifies the ops/backend tools team when this occurs
@omgitsbillryan omgitsbillryan added operations devops practice area categorization -- NOT a team assignment labels Sep 4, 2020
@ericbuckley
Copy link
Contributor

Another possibility worth research, can we configure statsd-exporter to not restart when an invalid set of tags are received?

@ericbuckley
Copy link
Contributor

Another possibility worth research, can we configure statsd-exporter to not restart when an invalid set of tags are received?

I believe we're still running version 0.3.0 of statsd_exporter, its possible that if we upgrade to a more recent version, post 0.10.2, inconsistent labels won't cause the exporter to restart 🤷‍♂️

prometheus/statsd_exporter#194

@omgitsbillryan
Copy link
Contributor Author

Good idea. Even in the newest version, it still seems like having inconsistent labels is a bad thing, it just won't cause a restart. Probably best to upgrade and then create a metric filter on the new error message.

@jhouse-solvd jhouse-solvd self-assigned this Sep 18, 2020
@omgitsbillryan
Copy link
Contributor Author

Nearly a year has passed, and no progress was made to such a metric filter. The fact that the new version of statsd-exporter no longer drops metrics w/ inconsistent labels makes it so that we really don't need to know about this anymore. I'm gonna close. Anyone who disagrees, feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops practice area categorization -- NOT a team assignment operations
Projects
None yet
Development

No branches or pull requests

3 participants