-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: windows service - graceful shutdown of telegraf #9616
Conversation
Any maintainers able to review/comment on this draft PR? |
17d0f26
to
d1c4047
Compare
…e to end when running as Windows service
d1c4047
to
ee0fc31
Compare
📦 Looks like new artifacts were built from this PR. Expand this list to get them here! 🐯Artifact URLs |
Bump - can we get this reviewed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to play with this on Windows 10 and it does seem to do a better job of cleaning up the Telegraf and stopping more gracefully.
(cherry picked from commit b954e56)
* origin/master: (133 commits) chore: restart service if it is already running and upgraded via RPM (influxdata#9970) feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10237) fix: Handle duplicate registration of protocol-buffer files gracefully. (influxdata#10188) fix(http_listener_v2): fix panic on close (influxdata#10132) feat: add Vault input plugin (influxdata#10198) feat: support aws managed service for prometheus (influxdata#10202) fix: Make telegraf compile on Windows with golang 1.16.2 (influxdata#10246) Update changelog feat: Modbus add per-request tags (influxdata#10231) fix: Implement NaN and inf handling for elasticsearch output (influxdata#10196) feat: add nomad input plugin (influxdata#10106) fix: Print loaded plugins and deprecations for once and test (influxdata#10205) fix: eliminate MIB dependency for ifname processor (influxdata#10214) feat: Optimize locking for SNMP MIBs loading. (influxdata#10206) feat: Add SMART plugin concurrency configuration option, nvme-cli v1.14+ support and lint fixes. (influxdata#10150) feat: update configs (influxdata#10236) fix(inputs/kube_inventory): set TLS server name config properly (influxdata#9975) fix: Sudden close of Telegraf caused by OPC UA input plugin (influxdata#10230) fix: bump github.com/eclipse/paho.mqtt.golang from 1.3.0 to 1.3.5 (influxdata#9913) fix: json_v2 parser timestamp setting (influxdata#10221) fix: ensure graylog spec fields not prefixed with '_' (influxdata#10209) docs: remove duplicate links in CONTRIBUTING.md (influxdata#10218) fix: pool detection and metrics gathering for ZFS >= 2.1.x (influxdata#10099) fix: parallelism fix for ifname processor (influxdata#10007) chore: Forbids "log" package only for aggregators, inputs, outputs, parsers and processors (influxdata#10191) docs: address documentation gap when running telegraf in k8s (influxdata#10215) feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10211) fix: mqtt topic extracting no longer requires all three fields (influxdata#10208) fix: windows service - graceful shutdown of telegraf (influxdata#9616) feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10201) feat: Modbus support multiple slaves (gateway feature) (influxdata#9279) fix: Revert unintented corruption of the Makefile from influxdata#10200. (influxdata#10203) chore: remove triggering update-config bot in CI (influxdata#10195) Update changelog feat: Implement deprecation infrastructure (influxdata#10200) fix: extra lock on init for safety (influxdata#10199) fix: resolve influxdata#10027 (influxdata#10112) fix: register bigquery to output plugins influxdata#10177 (influxdata#10178) fix: sysstat use unique temp file vs hard-coded (influxdata#10165) refactor: snmp to use gosmi (influxdata#9518) ...
Required for all PRs:
Resolves #7876, however the issue affects telegraf agent behaviour across all plugins when telegraf runs as service, not just the execd plugin.
The current telegraf service implementation does not appear to wait for the main telegraf agent loop to gracefully finish before the main thread exits. This causes an abrupt exit and plugins do not have a chance to shut down / clean up properly (ie. outputs are not flushed, inputs with external dependencies are not closed properly).
I'm submitting this PR initially as a review - I am not sure this is the right solution to cover all scenarios. However from testing, it does work as intended under ideal scenarios - the main goroutine thread is blocked until the agent thread runs to completion.
The code changes are fairly simple. When the service stop control signal is triggered in Windows, the service interface
Stop
is called (see https://github.com/kardianos/service/blob/v1.0.0/service_windows.go#L282). TheStop
interface sends a struct to thestop
channel so that thereloadLoop
runs the context cancel function to commence agent shutdown. While shutdown is in progress, theStop
interface is blocked waiting for thestop
channel to be closed. OncereloadLoop
completes, we close thestop
channel to allow the main thread to complete.My hesitancy with this PR being a full fix is that I do not fully understand the windows service specific concurrency implementation, so I'm unsure if this PR opens up telegraf to other potential race condition or deadlocks.
@ssoroka