Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metrics registering #1258

Merged
merged 1 commit into from
Mar 11, 2017
Merged

Conversation

matevzmihalic
Copy link
Contributor

Functions prometheus.NewCounterFrom and prometheus.NewHistogramFrom are using stdprometheus.MustRegister which will panic if you try to register the same metric multiple times.

Because of this you would get Error in Go routine: duplicate metrics collector registration attempted when traefik tries to update configuration and new services will not get added.

With this fix traefik will use already registered metrics if that happens.

@emilevauge
Copy link
Member

@matevzmihalic I think it would be better to rebase this PR on branch v1.2 in order to get this in the next release :)

@matevzmihalic matevzmihalic changed the base branch from master to v1.2 March 8, 2017 17:16
@matevzmihalic
Copy link
Contributor Author

Rebased to v1.2

Copy link
Contributor

@timoreimann timoreimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I also ran into this problem but couldn't identify the cause immediately.

Could we maybe add some testing for the new functionality?

@matevzmihalic
Copy link
Contributor Author

Added tests

Copy link
Member

@emilevauge emilevauge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @matevzmihalic, thanks a lot :)
LGTM!

n.ServeHTTP(recorder, req2)

body = recorder.Body.String()
re := regexp.MustCompile(`(?m)traefik_requests_total.*?(\d)$`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems to be testing the implementation of the metrics handler. While I'm all in favor of doing that, I believe it should go into (the yet missing) metrics_test.go.

More importantly though, I'd probably test it differently by means of comparing the counter label name and value of the underlying LabelPair. Doing a regular expression to extract the number of total requests is somewhat brittle since we rely on an implementation detail of Prometheus that could change in a future version.

My proposal would be to remove this part of the test and leave it for another PR (which I'd be happy to do as well). I think the next test paragraph covers what we want to achieve with the PR at hand.

}

metrics, _ := prometheus.DefaultGatherer.Gather()
if metrics[len(metrics)-1].Metric[0].GetCounter().GetValue() != 3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I don't understand: Why do we expect a total request count of 2 in line 61 and 3 here? Can you explain? I suppose I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First we make 2 requests, then we make another request (line 57) which outputs metrics and after that the counter gets updated to 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the request happens on line 57, I'd have expected the check on line 61 to also compare against 3; however, it expects 2.

What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time when request 3 happens the counter still has value 2 and this value is outputted to body (which is used in comparison on line 61). Only after that the counter increases. (But this doesn't really matter because I will remove the regex check.)

metrics, _ := prometheus.DefaultGatherer.Gather()
if metrics[len(metrics)-1].Metric[0].GetCounter().GetValue() != 3 {
t.Error("gathered metrics does not contain correct value for total requests")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extend the test to also cover the Histogram tracking the request duration? I think we can use the Histogram SampleCount to assert on.

t.Error("body does not contain correct value for total requests")
}

metrics, _ := prometheus.DefaultGatherer.Gather()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check the error and panic if it's non-nil? It's basically an invariant we want to ensure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename metrics to metricsFamily? This will make understanding the next line easier I think.

@matevzmihalic
Copy link
Contributor Author

I've added labels tests and histogram test.


metricsFamily, err := prometheus.DefaultGatherer.Gather()
if err != nil {
t.Error(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to

t.Fatalf("could not gather metrics family: %s", err)

Fatal because we shouldn't try to continue the test if we cannot even gather the metrics family.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, I really do not agree on this. Why would Traefik stop because of this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: this isn't about Traefik stopping; it's about the test bailing out because the remainder of it relies upon the metric family being available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops... Again, read too quickly sorry... I didn't see it was in a test...

}

if reqsMetricFamily == nil {
t.Errorf("gathered metrics doesn not contain request total entry '%s'", reqsName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: doesn -> doesn't.

}

if latencyMetricFamily == nil {
t.Errorf("gathered metrics doesn not contain request duration entry '%s'", latencyName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeated typo.

}

if reqsMetricFamily == nil {
t.Errorf("gathered metrics does not contain request total entry '%s'", reqsName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: do not

}

if latencyMetricFamily == nil {
t.Errorf("gathered metrics does not contain request duration entry '%s'", latencyName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: do not

}

if reqsMetricFamily.Metric[0].Counter.GetValue() != 3 {
t.Error("gathered metrics does not contain correct value for total requests")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do t.Errorf here and include the actual counter value.

}

if latencyMetricFamily.Metric[0].Histogram.GetSampleCount() != 3 {
t.Error("gathered metrics does not contain correct sample count for request duration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do t.Errorf here and include the actual sample count value.

labels map[string]string
}{
{
reqsName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add the struct field names on each line, i.e., name: reqsName here? This improves readability and pleases some linters.

}
}

if reqsMetricFamily == nil {
Copy link
Contributor

@timoreimann timoreimann Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in its current implementation has two shortcomings:

  • If reqsMetricFamily here (or latencyMetricFamily later) is nil, we still continue with the test execution. At best, this leads to follow-up test errors that we should not have to care about but will aggravate interpreting the test result. At worst, we run into nil pointer dereferences in lines 106, 117, and/or 121.
  • We cannot easily ignore/shortcut an individual nil case given the chosen test design.

There are a few options to fix this. Here's the one I prefer:

  • Rename the expectedLabelsForMetricFamily struct into tests and have it include a custom assert field of type func(*dto.MetricFamily). This function is supposed to implement the check in line 117 for the reqsMetricFamily case and the one in line 121 for latencyMetricFamily. (The idea is somewhat similar to the parserFunc in this test.)
  • Move the lookup of the metric families by metric name into the test loop (i.e., between line 105 and 106). This way, we can t.Errorf and continue the loop if we cannot find a metric family. Also, we can move the metricFamily variable from the test struct to the stack inside the loop.
  • Execute assert in the loop, passing it the metric family we just looked up.

I'd also suggest to move the lookup functionality into a small helper function, i.e., findMetricFamily(name string, families []*dto.MetricFamily) *dto.MetricFamily.

"service": "test",
},
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right after the definition of the test cases, I think we should do a check between the number of metric families we found and the number of test cases we have. This way, we can catch the situation where have added a new metric without updating the test.

@matevzmihalic
Copy link
Contributor Author

Updated tests with your suggestions.

Copy link
Contributor

@timoreimann timoreimann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superb job, @matevzmihalic -- thanks for bearing with me.

LGTM! 🎉

@timoreimann
Copy link
Contributor

Syncing with master via Github UI and merging afterwards if CI goes green.

@timoreimann
Copy link
Contributor

Rebased like a good Traefik citizen (and squashed those fixup commits). :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants