Vphere-Input for Telegraf stops working #5057

v4vamsee · 2018-11-29T00:23:26Z

Relevant telegraf.conf:

System info:

CentOS Linux release 7.5.1804
Telegraf 1.9.0
InfluxDB 1.6.3

Telegraf startup log has following:

2018-11-28T21:56:12Z I! Loaded inputs: inputs.influxdb inputs.jolokia2_agent inputs.vsphere inputs.cpu inputs.disk
2018-11-28T21:56:12Z I! Loaded aggregators:
2018-11-28T21:56:12Z I! Loaded processors:
2018-11-28T21:56:12Z I! Loaded outputs: influxdb
2018-11-28T21:56:12Z I! Tags enabled: host=xxxx
2018-11-28T21:56:12Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"rxxxx", Flush Interval:10s
2018-11-28T21:56:12Z D! [agent] Connecting outputs
2018-11-28T21:56:12Z D! [agent] Attempting connection to output: influxdb
2018-11-28T21:56:12Z D! [agent] Successfully connected to output: influxdb
2018-11-28T21:56:12Z D! [agent] Starting service inputs
2018-11-28T21:56:12Z D! [input.vsphere]: Starting plugin
2018-11-28T21:56:12Z D! [input.vsphere]: Running initial discovery and waiting for it to finish
2018-11-28T21:56:12Z D! [input.vsphere]: Creating client: xxxxx
2018-11-28T21:56:12Z I! [input.vsphere] Option query for maxQueryMetrics failed. Using default
2018-11-28T21:56:12Z D! [input.vsphere] vCenter version is: 6.5.0

Steps to reproduce:

Install telegraf 1.9.0, influxdb 1.6.3
configure vsphere input with following configuration:

[[inputs.vsphere]]
vcenters = [ “rzzz” ]
username = “zzzz”
password = “zzz”
interval = “30s”

vm_metric_include = [
“sys.uptime.latest” ,
“cpu.usage.average” ,
“cpu.ready.summation” ,
“cpu.readiness.average” ,
“cpu.usagemhz.average” ,
“cpu.wait.summation” ,
“cpu.system.summation” ,
“cpu.used.summation” ,
“mem.usage.average” ,
“mem.consumed.average” ,
“mem.active.average” ,
“mem.vmmemctl.average” ,
“mem.swapused.average” ,
“mem.swapIn.average” ,
“mem.swapOut.average” ,
“disk.maxTotalLatency.latest” ,
“net.usage.average” ,
“net.bytesRx.average” ,
“net.bytesTx.average” ,
“net.packetsRx.summation” ,
“net.packetsTx.summation” ,
“net.received.average” ,
“net.transmitted.average” ,
“virtualDisk.read.average” ,
“virtualDisk.write.average” ,
“virtualDisk.totalWriteLatency.average” ,
“virtualDisk.totalReadLatency.average” ,
“virtualDisk.numberReadAveraged.average” ,
“virtualDisk.numberWriteAveraged.average” ,
“virtualDisk.readOIO.latest” ,
“virtualDisk.writeOIO.latest”
]
vm_metric_exclude = []
vm_instances = true ## true by default

host_metric_include = [
“cpu.usagemhz.average” ,
“cpu.usage.average” ,
“cpu.corecount.provisioned.average” ,
“mem.capacity.provisioned.average” ,
“mem.active.average” ,
“net.throughput.usage.average” ,
“net.throughput.contention.summation” ,
“vmop.numSVMotion.latest” ,
“vmop.numVMotion.latest” ,
“vmop.numXVMotion.latest” ,
“storageAdapter.numberReadAveraged.average” ,
“storageAdapter.numberWriteAveraged.average” ,
“storageAdapter.read.average” ,
“storageAdapter.write.average” ,
“storageAdapter.totalReadLatency.average” ,
“storageAdapter.totalWriteLatency.average” ,
“cpu.utilization.average” ,
“cpu.readiness.average” ,
“cpu.ready.summation” ,
“net.bytesRx.average” ,
“net.bytesTx.average” ,
“virtualDisk.totalWriteLatency.average” ,
“virtualDisk.totalReadLatency.average” ,
“net.received.average” ,
“net.transmitted.average” ,
“net.packetsRx.summation” ,
“net.packetsTx.summation” ,
“mem.consumed.average” ,
“mem.totalmb.average”
]
host_metric_exclude = []
host_instances = true ## true by default

datastore_metric_include = [
“datastore.numberReadAveraged.average” ,
“datastore.numberWriteAveraged.average” ,
“datastore.read.average” ,
“datastore.write.average” ,
“datastore.totalReadLatency.average” ,
“datastore.totalWriteLatency.average” ,
“datastore.datastoreVMObservedLatency.latest” ,
“disk.capacity.latest” ,
“disk.used.latest” ,
“disk.numberReadAveraged.average” ,
“disk.numberWriteAveraged.average”
] ## if omitted or empty, all metrics are collected
datastore_metric_exclude = []
datastore_instances = true ## false by default for Datastores only

datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
datacenter_metric_exclude = [] ## Datacenters are not collected by default.
datacenter_instances = false

cluster_metric_include = [] ## if omitted or empty, all metrics are collected
cluster_metric_exclude = [] ## Nothing excluded by default
cluster_instances = false ## true by default

separator = “_”
max_query_objects = 70
max_query_metrics = 70
collect_concurrency = 4
discover_concurrency = 1
force_discover_on_init = true
object_discovery_interval = “30s”
timeout = “20s”
insecure_skip_verify = true

Expected behavior:

Data is collected

Actual behavior:

Data stops collected after you see following message in the log:

2018-11-28T21:13:00Z W! [agent] input “inputs.vsphere” did not complete within its interval

Additional info:

The same configuration works for Telegraf 1.8.3. I upgraded telegraf from to 1.9.0.

Related forum link: https://community.influxdata.com/t/telegraf-vsphere/7457/11

prydin · 2018-11-29T14:13:17Z

We are working on this particular issue as we speak. It seems to be related to intermittent network issues not being handled correctly.

prydin · 2018-11-29T14:31:10Z

@danielnelson This goes back to the discussion we had a while ago about exposing the interval or sending in a context with a timeout to Gather().

Every time I make an API call, I have to wrap it like this:

ctx1, cancel1 := context.WithTimeout(ctx, timeout)
defer cancel1()
APICall(ctx1, params)

As you might have guessed, I forgot that wrapping around one call, causing it to hang indefinitely if the network was dropped.

If I knew the interval, I could have created a root context with a deadline that matches the interval and passed it to all calls. Alternative, if Gather() would have taken a context, the Telegraf core could cancel the context when the interval was exceeded.

Of course I shouldn't have forgotten the wrapping, but having to deal with timeouts manually for every call is very error prone.

danielnelson · 2018-11-29T21:13:49Z

It is expected that the input will to continue to work towards completing the collection even across multiple intervals, this way we don't end up continually restarting the plugin which will degrade more gracefully. Better to have a reduced sampling rate than no data at all.

I do want to add a context to the Gather function, but it should be used only for canceling on shutdown/restart.

einhirn · 2019-03-07T16:05:22Z

https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/vsphere/README.md documents the cause of this issue (realtime vs. historical metrics in vsphere) quite nicely and gives a workaround. TL;DR: Just use two instances of the plugin, one for vm and host (realtime) metrics, the second for the other (historical) metrics.

danielnelson · 2019-03-07T19:39:41Z

Closing, should be fixed by #5113 but also try the workaround that @einhirn pointed out. If this is still trouble with 1.10, let's check if there are any similar issues open and if not open a new issue.

danielnelson added bug unexpected problem or unintended behavior area/vsphere labels Nov 29, 2018

prydin mentioned this issue Dec 6, 2018

Scalability overhaul of [inputs.vsphere] #5113

Merged

3 tasks

danielnelson closed this as completed Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vphere-Input for Telegraf stops working #5057

Vphere-Input for Telegraf stops working #5057

v4vamsee commented Nov 29, 2018 •

edited

Loading

prydin commented Nov 29, 2018

prydin commented Nov 29, 2018 •

edited

Loading

danielnelson commented Nov 29, 2018

einhirn commented Mar 7, 2019

danielnelson commented Mar 7, 2019

Vphere-Input for Telegraf stops working #5057

Vphere-Input for Telegraf stops working #5057

Comments

v4vamsee commented Nov 29, 2018 • edited Loading

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

prydin commented Nov 29, 2018

prydin commented Nov 29, 2018 • edited Loading

danielnelson commented Nov 29, 2018

einhirn commented Mar 7, 2019

danielnelson commented Mar 7, 2019

v4vamsee commented Nov 29, 2018 •

edited

Loading

prydin commented Nov 29, 2018 •

edited

Loading