-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vphere-Input for Telegraf stops working #5057
Comments
We are working on this particular issue as we speak. It seems to be related to intermittent network issues not being handled correctly. |
@danielnelson This goes back to the discussion we had a while ago about exposing the interval or sending in a context with a timeout to Gather(). Every time I make an API call, I have to wrap it like this: ctx1, cancel1 := context.WithTimeout(ctx, timeout)
defer cancel1()
APICall(ctx1, params) As you might have guessed, I forgot that wrapping around one call, causing it to hang indefinitely if the network was dropped. If I knew the interval, I could have created a root context with a deadline that matches the interval and passed it to all calls. Alternative, if Gather() would have taken a context, the Telegraf core could cancel the context when the interval was exceeded. Of course I shouldn't have forgotten the wrapping, but having to deal with timeouts manually for every call is very error prone. |
It is expected that the input will to continue to work towards completing the collection even across multiple intervals, this way we don't end up continually restarting the plugin which will degrade more gracefully. Better to have a reduced sampling rate than no data at all. I do want to add a context to the Gather function, but it should be used only for canceling on shutdown/restart. |
https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/vsphere/README.md documents the cause of this issue (realtime vs. historical metrics in vsphere) quite nicely and gives a workaround. TL;DR: Just use two instances of the plugin, one for vm and host (realtime) metrics, the second for the other (historical) metrics. |
Relevant telegraf.conf:
System info:
CentOS Linux release 7.5.1804
Telegraf 1.9.0
InfluxDB 1.6.3
Telegraf startup log has following:
2018-11-28T21:56:12Z I! Loaded inputs: inputs.influxdb inputs.jolokia2_agent inputs.vsphere inputs.cpu inputs.disk
2018-11-28T21:56:12Z I! Loaded aggregators:
2018-11-28T21:56:12Z I! Loaded processors:
2018-11-28T21:56:12Z I! Loaded outputs: influxdb
2018-11-28T21:56:12Z I! Tags enabled: host=xxxx
2018-11-28T21:56:12Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"rxxxx", Flush Interval:10s
2018-11-28T21:56:12Z D! [agent] Connecting outputs
2018-11-28T21:56:12Z D! [agent] Attempting connection to output: influxdb
2018-11-28T21:56:12Z D! [agent] Successfully connected to output: influxdb
2018-11-28T21:56:12Z D! [agent] Starting service inputs
2018-11-28T21:56:12Z D! [input.vsphere]: Starting plugin
2018-11-28T21:56:12Z D! [input.vsphere]: Running initial discovery and waiting for it to finish
2018-11-28T21:56:12Z D! [input.vsphere]: Creating client: xxxxx
2018-11-28T21:56:12Z I! [input.vsphere] Option query for maxQueryMetrics failed. Using default
2018-11-28T21:56:12Z D! [input.vsphere] vCenter version is: 6.5.0
Steps to reproduce:
[[inputs.vsphere]]
vcenters = [ “rzzz” ]
username = “zzzz”
password = “zzz”
interval = “30s”
vm_metric_include = [
“sys.uptime.latest” ,
“cpu.usage.average” ,
“cpu.ready.summation” ,
“cpu.readiness.average” ,
“cpu.usagemhz.average” ,
“cpu.wait.summation” ,
“cpu.system.summation” ,
“cpu.used.summation” ,
“mem.usage.average” ,
“mem.consumed.average” ,
“mem.active.average” ,
“mem.vmmemctl.average” ,
“mem.swapused.average” ,
“mem.swapIn.average” ,
“mem.swapOut.average” ,
“disk.maxTotalLatency.latest” ,
“net.usage.average” ,
“net.bytesRx.average” ,
“net.bytesTx.average” ,
“net.packetsRx.summation” ,
“net.packetsTx.summation” ,
“net.received.average” ,
“net.transmitted.average” ,
“virtualDisk.read.average” ,
“virtualDisk.write.average” ,
“virtualDisk.totalWriteLatency.average” ,
“virtualDisk.totalReadLatency.average” ,
“virtualDisk.numberReadAveraged.average” ,
“virtualDisk.numberWriteAveraged.average” ,
“virtualDisk.readOIO.latest” ,
“virtualDisk.writeOIO.latest”
]
vm_metric_exclude = []
vm_instances = true ## true by default
host_metric_include = [
“cpu.usagemhz.average” ,
“cpu.usage.average” ,
“cpu.corecount.provisioned.average” ,
“mem.capacity.provisioned.average” ,
“mem.active.average” ,
“net.throughput.usage.average” ,
“net.throughput.contention.summation” ,
“vmop.numSVMotion.latest” ,
“vmop.numVMotion.latest” ,
“vmop.numXVMotion.latest” ,
“storageAdapter.numberReadAveraged.average” ,
“storageAdapter.numberWriteAveraged.average” ,
“storageAdapter.read.average” ,
“storageAdapter.write.average” ,
“storageAdapter.totalReadLatency.average” ,
“storageAdapter.totalWriteLatency.average” ,
“cpu.utilization.average” ,
“cpu.readiness.average” ,
“cpu.ready.summation” ,
“net.bytesRx.average” ,
“net.bytesTx.average” ,
“virtualDisk.totalWriteLatency.average” ,
“virtualDisk.totalReadLatency.average” ,
“net.received.average” ,
“net.transmitted.average” ,
“net.packetsRx.summation” ,
“net.packetsTx.summation” ,
“mem.consumed.average” ,
“mem.totalmb.average”
]
host_metric_exclude = []
host_instances = true ## true by default
datastore_metric_include = [
“datastore.numberReadAveraged.average” ,
“datastore.numberWriteAveraged.average” ,
“datastore.read.average” ,
“datastore.write.average” ,
“datastore.totalReadLatency.average” ,
“datastore.totalWriteLatency.average” ,
“datastore.datastoreVMObservedLatency.latest” ,
“disk.capacity.latest” ,
“disk.used.latest” ,
“disk.numberReadAveraged.average” ,
“disk.numberWriteAveraged.average”
] ## if omitted or empty, all metrics are collected
datastore_metric_exclude = []
datastore_instances = true ## false by default for Datastores only
datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
datacenter_metric_exclude = [] ## Datacenters are not collected by default.
datacenter_instances = false
cluster_metric_include = [] ## if omitted or empty, all metrics are collected
cluster_metric_exclude = [] ## Nothing excluded by default
cluster_instances = false ## true by default
separator = “_”
max_query_objects = 70
max_query_metrics = 70
collect_concurrency = 4
discover_concurrency = 1
force_discover_on_init = true
object_discovery_interval = “30s”
timeout = “20s”
insecure_skip_verify = true
Expected behavior:
Data is collected
Actual behavior:
Data stops collected after you see following message in the log:
2018-11-28T21:13:00Z W! [agent] input “inputs.vsphere” did not complete within its interval
Additional info:
The same configuration works for Telegraf 1.8.3. I upgraded telegraf from to 1.9.0.
Related forum link: https://community.influxdata.com/t/telegraf-vsphere/7457/11
The text was updated successfully, but these errors were encountered: