-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle DELTA metrics better #116
Comments
That sounds like a reasonable idea. We would need to generate a cache pool for delas. We would also need to have some way to figure out if a metric stops so we can remove items from the cache. It's tricky to do, since I don't think Stackdriver has any concept of staleness. |
We could reuse the expiring registry from the statsd_exporter? If we get a delta, pull a gauge from the registry and add the value to it. I'm not familiar enough with the current collector operation to know how to integrate the two, though. |
After thinking about this some more, this won't be that straightforward. In the case of counters represented by deltas, that's easy since we don't really care about the absolute value of the metric, but for gauges represented by deltas, it would be meaningless without having some way of reconciling the value on the exporter end with the true value in stackdriver. I don't know how many of those cases there are (metrics from stackdriver that behave as gauges but are represented by deltas), but it's a case that wouldn't make sense to return data for... Is there a stackdriver api method to get the value of a delta? might be worth the extra api call to get accurate data |
Whats the latest thinking on this, does it seem feasible? I'm happy to help with an implementation if so. It seems to me this is quite important -- you cant calculate rate() of GAUGE metrics in prometheus properly, which nullifies a lot of the usefulness of importing many stackdriver DELTA metrics (for which which rate() works fine on the stackdriver side) |
I think it should be just switched back to counter in Prometheus, and stackdriver-exporter needs to accumulate counts in memory to produce monotonically increasing Prometheus counters. For example, The same issue applies to histograms ( |
We're encountering a similar issue here with the GCP composer's The datapoint in Cloud Monitoring is 1, however, when exported to Prometheus, it was showing as 4. |
@nilebox That seems like a reasonable option to me. One thing I've been looking at is how similar things happen in the fastly-exporter. It's receiving a series of deltas and adding them directly as "normal" metrics in a registry. The down side to this is that I'm seeing jitter in the time between the return of the delta value. This leads to some deltas arriving a fraction of a second too late for the scrape. It's probably better to accumulate in a custom structure, that includes the timestamp of "settled time" of that delta value, which is exposed at scrape time. This way any delays in API calls don't leak into Prometheus. |
Any updates on this? This makes the exporter nearly unusable in many cases. |
My recommendation is the exporter should integrate the delta values to produce a counter. In other words, hold an internal counter, add the delta to it each time you fetch it, and export the counter. (EDIT: just noticed this was said a year ago #116 (comment)) |
I had started a naive implementation here. I believe in order to produce an accurate value I will need to update the collector to consider time since last scrape as a part of calculating the start/end time and incrementing the counter for all data points found when querying for metric descriptors with kind = delta. Does that sound correct? |
I didn’t understand the bit about time. |
Let me clarify, you are correct the value will never truly be accurate, I'm trying to get the value incremented from scrape to scrape to be as accurate as reasonably possible. This might be over complicating the implementation though. Example, scraping the exporter every 5 minutes for the apigateway proxy/request_count metric which is sampled by GCP every 1 minute. I would expect there to be 5 data points from GCP which need to increment the counter to produce an accurate increase between scrapes. The current logic uses the configuration values |
I should note I've never looked at this code before. stackdriver_exporter/collectors/monitoring_collector.go Lines 218 to 219 in b06e49c
What I would expect is we remember the last time we called the API, so that's the start, and end is "now", then we get however many points are new, and we add them all up. |
I also have a working implementation although I still need to write unit tests and it needs some minor cleanup from us debugging. This handles the scrape timing stuff correctly but in a pretty simple way...I think. I'm at Mailchimp where the issue was first opened by Greg (and @SpencerMalone re-queried about it). This seems to meet our needs and would love to get it (or a version of this) upstreamed! Of note -- this does not currently explicitly handle DISTRIBUTION DELTAs -- I missed the reference to this above when I put this together. I'll take another gander at the code, but I feel like this solution can be amended to handle that also. Mentioned here #25 |
This is awesome especially if you have feedback from running it live! I do wonder about a few things based on the slightly different approach we both took,
By no means am I an expert on GCP metrics just comparing our design decisions. |
I swear I replied to the above a few weeks ago....but apparently I did not. I had a laptop switch then spent 2 weeks doing trail work in Maine, so maybe the reply died in an input box somewhere. So lemme try again:
I am also not an expert (or well even a journeyman) on GCP metrics or prometheus so I'm happy to have input and interest! Since the last update we are now handling DISTRIBUTION DELTAs similarly (and have fixed a few bugs). I went ahead and opened a PR with the changes we are currently running in our prod environment. We're still in week 1 with this code, but will update if we notice issues. |
Awesome work! Related to the points I originally brought up,
|
This PR introduces support for aggregating DELTA metrics in-memory. This implementation is somewhat based on #167 but with extra functionality which is intended to solve the problem identified here, #167 (comment). I chose to inject interfaces for the delta and distribution stores at the Collector level to try to keep library friendliness intact, #157 I did as much as I could to explain the solution and the tradeoffs with aggregating deltas in this manner in the README.md. The bulk of the changes are in `monitoring_metrics.go` where I tried to obfuscate as much of the delta logic as possible. Should hopefully resolve #116 and #25 Signed-off-by: Kyle Eckhart <[email protected]>
Right now, there are several issues stemming from stack driver's DELTA metrics, which just record the change in a metric. This leads to inconsistent data in prometheus, since the value reported by the exporter is a gauge, which is only set to the value of the change. Ideally, we'd update a gauge with the value of the delta change to reduce the impedance mismatch between stackdriver and prometheus. This is most noticeable with infrequently updated metrics, which produce a single gauge data point in prometheus once in a while, with huge gaps between them.
It looks like DELTA metrics used to be handled as counters, but that was changed with #15 since DELTAs can go down. There probably just needs to be some different handling of how we set the gauge. If it's a GAUGE type from stackdriver, we set the value of the prometheus gauge to the value from stackdriver. If it's a DELTA type from stackdriver, we add the value of the stackdriver delta to the value of the prometheus gauge.
The text was updated successfully, but these errors were encountered: