Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(outputs/stackdriver): cumulative interval start times #10097

Merged

Conversation

n-oden
Copy link
Contributor

@n-oden n-oden commented Nov 11, 2021

Required for all PRs:

resolves #7758

Cumulative metrics, when posted to google cloud monitoring (nee
Stackdriver), require a TimeInterval with a start time that corresponds
to the last time at which the counter/cumulative metric was reset to
zero, cannot overlap previous intervals, and cannot be more than 25
hours in the past:

https://cloud.google.com/monitoring/api/ref_v3/rest/v3/TimeInterval

The stackdriver output plugin had been creating StartTimes of
Thu Jan 1 00:00:01 UTC 1970 leading to bizarre behavior: stackdriver graphs
would reflect flatlined fractional values no matter what data telegraf
was sending, and direct queries to the projects.timeSeries.list API
would initially return the last handful of posted values but then
eventually return only {"unit": "not_a_unit"} for the same query,
presumably due to stackdriver having GC'ed the invalid intervals on the
backend.

So, herein:

  • create a cache of start times and observed last values for all counter metrics (keyed by name, tags and field)
  • reset the start time of a cache entry if the newest value is less than the previous observed value (counter reset)
  • reset the start time of a cache entry if it is greater than 24 hours old
  • while we're at it, use the telegraf logger rather than the base golang logger

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Nov 11, 2021
@n-oden
Copy link
Contributor Author

n-oden commented Nov 11, 2021

Note that as mentioned in the readme update, even with this change there is the potential for undefined behavior if one of the underlying counters resets without telegraf restarting in concert with it. Doing this "right" would probably involve caching a separate start time and the last observed value for each counter metric and resetting the start time any time the metric resets, which seems to somewhat cut against the grain of telegraf's normal near-statelessness on the output side, but if there's interest I could attempt to implement that.

@n-oden
Copy link
Contributor Author

n-oden commented Nov 12, 2021

Actually, having thought on it a bit, the per-metric cache really seems to be the only way to do this: setting the start time to the process start time means that if the input source starts exposing a new counter >25 hours after we start, that counter will never make it to stackdriver.

So, an update, complete with unit tests.

@n-oden n-oden force-pushed the fix-stackdriver-cumulative-intervals branch 2 times, most recently from df3a332 to eca8a69 Compare November 12, 2021 17:58
@n-oden n-oden force-pushed the fix-stackdriver-cumulative-intervals branch 2 times, most recently from 0014ce5 to 742a671 Compare November 28, 2021 22:35
@n-oden
Copy link
Contributor Author

n-oden commented Nov 28, 2021

@sspaink @Hipska @powersj Could I get a look taken at this? (Apologies in advance if I've missed a step here.)

@n-oden n-oden force-pushed the fix-stackdriver-cumulative-intervals branch from 742a671 to 38129d2 Compare November 30, 2021 23:34
@srebhan
Copy link
Member

srebhan commented Dec 21, 2021

I guess this would benefit from plugin state-persistence (PR #9476)?

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for approaching this @n-oden! While the overall approach looks good, I have some comments in the code. Please try to reduce the non-related changes (renaming of aliases, reordering of imports, etc) to an absolute minimum and submit them as separate PRs to ease review. Furthermore, do you think it's possible to leave getStackdriverTimeInterval() as is and instead determine the start and end times in a separate function?

@srebhan srebhan self-assigned this Dec 21, 2021
@srebhan srebhan added the plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins label Dec 21, 2021
@n-oden
Copy link
Contributor Author

n-oden commented Dec 21, 2021

hey @srebhan glad to finally have a review for this! Your comments seem reasonable; I'll try to get them all addressed today or tomorrow.

@srebhan
Copy link
Member

srebhan commented Dec 21, 2021

@n-oden sorry for taking so long to come to this... :-(

@n-oden n-oden force-pushed the fix-stackdriver-cumulative-intervals branch from 38129d2 to 4f0b09b Compare December 21, 2021 17:33
@n-oden n-oden requested a review from srebhan December 21, 2021 18:06
@n-oden
Copy link
Contributor Author

n-oden commented Dec 21, 2021

@srebhan should be good to go, I think!

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n-oden thanks for the update. I have some more comments, but we are almost there I think.

@n-oden n-oden requested a review from srebhan December 22, 2021 15:21
@n-oden
Copy link
Contributor Author

n-oden commented Dec 22, 2021

@srebhan all done I think?

Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linter found some more... :-)

@telegraf-tiger
Copy link
Contributor

@n-oden n-oden requested a review from srebhan December 22, 2021 18:32
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for fixing this long-standing issue @n-oden!

@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Dec 22, 2021
@powersj powersj merged commit 697855c into influxdata:master Dec 22, 2021
@n-oden n-oden deleted the fix-stackdriver-cumulative-intervals branch December 22, 2021 22:24
powersj pushed a commit to powersj/telegraf that referenced this pull request Jan 21, 2022
n-oden added a commit to odenio/pgpool-cloudsql that referenced this pull request Jan 24, 2022
- Cherry-pick pgpool/pgpool2_exporter#14
  into our build of pgpool_exporter

- Build telegraf from source; circle-ci is now 404ing the artifact
  link from influxdata/telegraf#10097 :(

- Bump version to 1.0.5
reimda pushed a commit that referenced this pull request Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stackdriver output plugin sends CUMULATIVE metrics incorrect
3 participants