-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(outputs/stackdriver): cumulative interval start times #10097
fix(outputs/stackdriver): cumulative interval start times #10097
Conversation
Note that as mentioned in the readme update, even with this change there is the potential for undefined behavior if one of the underlying counters resets without telegraf restarting in concert with it. Doing this "right" would probably involve caching a separate start time and the last observed value for each counter metric and resetting the start time any time the metric resets, which seems to somewhat cut against the grain of telegraf's normal near-statelessness on the output side, but if there's interest I could attempt to implement that. |
Actually, having thought on it a bit, the per-metric cache really seems to be the only way to do this: setting the start time to the process start time means that if the input source starts exposing a new counter >25 hours after we start, that counter will never make it to stackdriver. So, an update, complete with unit tests. |
df3a332
to
eca8a69
Compare
0014ce5
to
742a671
Compare
742a671
to
38129d2
Compare
I guess this would benefit from plugin state-persistence (PR #9476)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for approaching this @n-oden! While the overall approach looks good, I have some comments in the code. Please try to reduce the non-related changes (renaming of aliases, reordering of imports, etc) to an absolute minimum and submit them as separate PRs to ease review. Furthermore, do you think it's possible to leave getStackdriverTimeInterval()
as is and instead determine the start
and end
times in a separate function?
hey @srebhan glad to finally have a review for this! Your comments seem reasonable; I'll try to get them all addressed today or tomorrow. |
@n-oden sorry for taking so long to come to this... :-( |
38129d2
to
4f0b09b
Compare
@srebhan should be good to go, I think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n-oden thanks for the update. I have some more comments, but we are almost there I think.
@srebhan all done I think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linter found some more... :-)
📦 Looks like new artifacts were built from this PR. Expand this list to get them here ! 🐯Artifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for fixing this long-standing issue @n-oden!
- Cherry-pick pgpool/pgpool2_exporter#14 into our build of pgpool_exporter - Build telegraf from source; circle-ci is now 404ing the artifact link from influxdata/telegraf#10097 :( - Bump version to 1.0.5
(cherry picked from commit 697855c)
Required for all PRs:
resolves #7758
Cumulative metrics, when posted to google cloud monitoring (nee
Stackdriver), require a TimeInterval with a start time that corresponds
to the last time at which the counter/cumulative metric was reset to
zero, cannot overlap previous intervals, and cannot be more than 25
hours in the past:
https://cloud.google.com/monitoring/api/ref_v3/rest/v3/TimeInterval
The stackdriver output plugin had been creating StartTimes of
Thu Jan 1 00:00:01 UTC 1970
leading to bizarre behavior: stackdriver graphswould reflect flatlined fractional values no matter what data telegraf
was sending, and direct queries to the projects.timeSeries.list API
would initially return the last handful of posted values but then
eventually return only
{"unit": "not_a_unit"}
for the same query,presumably due to stackdriver having GC'ed the invalid intervals on the
backend.
So, herein: