Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vphere-Input for Telegraf stops working #5057

Closed
v4vamsee opened this issue Nov 29, 2018 · 5 comments
Closed

Vphere-Input for Telegraf stops working #5057

v4vamsee opened this issue Nov 29, 2018 · 5 comments
Labels
area/vsphere bug unexpected problem or unintended behavior

Comments

@v4vamsee
Copy link

v4vamsee commented Nov 29, 2018

Relevant telegraf.conf:

System info:

CentOS Linux release 7.5.1804
Telegraf 1.9.0
InfluxDB 1.6.3

Telegraf startup log has following:

2018-11-28T21:56:12Z I! Loaded inputs: inputs.influxdb inputs.jolokia2_agent inputs.vsphere inputs.cpu inputs.disk
2018-11-28T21:56:12Z I! Loaded aggregators:
2018-11-28T21:56:12Z I! Loaded processors:
2018-11-28T21:56:12Z I! Loaded outputs: influxdb
2018-11-28T21:56:12Z I! Tags enabled: host=xxxx
2018-11-28T21:56:12Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"rxxxx", Flush Interval:10s
2018-11-28T21:56:12Z D! [agent] Connecting outputs
2018-11-28T21:56:12Z D! [agent] Attempting connection to output: influxdb
2018-11-28T21:56:12Z D! [agent] Successfully connected to output: influxdb
2018-11-28T21:56:12Z D! [agent] Starting service inputs
2018-11-28T21:56:12Z D! [input.vsphere]: Starting plugin
2018-11-28T21:56:12Z D! [input.vsphere]: Running initial discovery and waiting for it to finish
2018-11-28T21:56:12Z D! [input.vsphere]: Creating client: xxxxx
2018-11-28T21:56:12Z I! [input.vsphere] Option query for maxQueryMetrics failed. Using default
2018-11-28T21:56:12Z D! [input.vsphere] vCenter version is: 6.5.0

Steps to reproduce:

  1. Install telegraf 1.9.0, influxdb 1.6.3
  2. configure vsphere input with following configuration:

[[inputs.vsphere]]
vcenters = [ “rzzz” ]
username = “zzzz”
password = “zzz”
interval = “30s”

vm_metric_include = [
“sys.uptime.latest” ,
“cpu.usage.average” ,
“cpu.ready.summation” ,
“cpu.readiness.average” ,
“cpu.usagemhz.average” ,
“cpu.wait.summation” ,
“cpu.system.summation” ,
“cpu.used.summation” ,
“mem.usage.average” ,
“mem.consumed.average” ,
“mem.active.average” ,
“mem.vmmemctl.average” ,
“mem.swapused.average” ,
“mem.swapIn.average” ,
“mem.swapOut.average” ,
“disk.maxTotalLatency.latest” ,
“net.usage.average” ,
“net.bytesRx.average” ,
“net.bytesTx.average” ,
“net.packetsRx.summation” ,
“net.packetsTx.summation” ,
“net.received.average” ,
“net.transmitted.average” ,
“virtualDisk.read.average” ,
“virtualDisk.write.average” ,
“virtualDisk.totalWriteLatency.average” ,
“virtualDisk.totalReadLatency.average” ,
“virtualDisk.numberReadAveraged.average” ,
“virtualDisk.numberWriteAveraged.average” ,
“virtualDisk.readOIO.latest” ,
“virtualDisk.writeOIO.latest”
]
vm_metric_exclude = []
vm_instances = true ## true by default

host_metric_include = [
“cpu.usagemhz.average” ,
“cpu.usage.average” ,
“cpu.corecount.provisioned.average” ,
“mem.capacity.provisioned.average” ,
“mem.active.average” ,
“net.throughput.usage.average” ,
“net.throughput.contention.summation” ,
“vmop.numSVMotion.latest” ,
“vmop.numVMotion.latest” ,
“vmop.numXVMotion.latest” ,
“storageAdapter.numberReadAveraged.average” ,
“storageAdapter.numberWriteAveraged.average” ,
“storageAdapter.read.average” ,
“storageAdapter.write.average” ,
“storageAdapter.totalReadLatency.average” ,
“storageAdapter.totalWriteLatency.average” ,
“cpu.utilization.average” ,
“cpu.readiness.average” ,
“cpu.ready.summation” ,
“net.bytesRx.average” ,
“net.bytesTx.average” ,
“virtualDisk.totalWriteLatency.average” ,
“virtualDisk.totalReadLatency.average” ,
“net.received.average” ,
“net.transmitted.average” ,
“net.packetsRx.summation” ,
“net.packetsTx.summation” ,
“mem.consumed.average” ,
“mem.totalmb.average”
]
host_metric_exclude = []
host_instances = true ## true by default

datastore_metric_include = [
“datastore.numberReadAveraged.average” ,
“datastore.numberWriteAveraged.average” ,
“datastore.read.average” ,
“datastore.write.average” ,
“datastore.totalReadLatency.average” ,
“datastore.totalWriteLatency.average” ,
“datastore.datastoreVMObservedLatency.latest” ,
“disk.capacity.latest” ,
“disk.used.latest” ,
“disk.numberReadAveraged.average” ,
“disk.numberWriteAveraged.average”
] ## if omitted or empty, all metrics are collected
datastore_metric_exclude = []
datastore_instances = true ## false by default for Datastores only

datacenter_metric_include = [] ## if omitted or empty, all metrics are collected
datacenter_metric_exclude = [] ## Datacenters are not collected by default.
datacenter_instances = false

cluster_metric_include = [] ## if omitted or empty, all metrics are collected
cluster_metric_exclude = [] ## Nothing excluded by default
cluster_instances = false ## true by default

separator = “_”
max_query_objects = 70
max_query_metrics = 70
collect_concurrency = 4
discover_concurrency = 1
force_discover_on_init = true
object_discovery_interval = “30s”
timeout = “20s”
insecure_skip_verify = true

Expected behavior:

Data is collected

Actual behavior:

Data stops collected after you see following message in the log:

2018-11-28T21:13:00Z W! [agent] input “inputs.vsphere” did not complete within its interval

Additional info:

The same configuration works for Telegraf 1.8.3. I upgraded telegraf from to 1.9.0.

Related forum link: https://community.influxdata.com/t/telegraf-vsphere/7457/11

@danielnelson danielnelson added bug unexpected problem or unintended behavior area/vsphere labels Nov 29, 2018
@prydin
Copy link
Contributor

prydin commented Nov 29, 2018

We are working on this particular issue as we speak. It seems to be related to intermittent network issues not being handled correctly.

@prydin
Copy link
Contributor

prydin commented Nov 29, 2018

@danielnelson This goes back to the discussion we had a while ago about exposing the interval or sending in a context with a timeout to Gather().

Every time I make an API call, I have to wrap it like this:

ctx1, cancel1 := context.WithTimeout(ctx, timeout)
defer cancel1()
APICall(ctx1, params)

As you might have guessed, I forgot that wrapping around one call, causing it to hang indefinitely if the network was dropped.

If I knew the interval, I could have created a root context with a deadline that matches the interval and passed it to all calls. Alternative, if Gather() would have taken a context, the Telegraf core could cancel the context when the interval was exceeded.

Of course I shouldn't have forgotten the wrapping, but having to deal with timeouts manually for every call is very error prone.

@danielnelson
Copy link
Contributor

It is expected that the input will to continue to work towards completing the collection even across multiple intervals, this way we don't end up continually restarting the plugin which will degrade more gracefully. Better to have a reduced sampling rate than no data at all.

I do want to add a context to the Gather function, but it should be used only for canceling on shutdown/restart.

@einhirn
Copy link

einhirn commented Mar 7, 2019

https://github.com/influxdata/telegraf/blob/release-1.10/plugins/inputs/vsphere/README.md documents the cause of this issue (realtime vs. historical metrics in vsphere) quite nicely and gives a workaround. TL;DR: Just use two instances of the plugin, one for vm and host (realtime) metrics, the second for the other (historical) metrics.

@danielnelson
Copy link
Contributor

Closing, should be fixed by #5113 but also try the workaround that @einhirn pointed out. If this is still trouble with 1.10, let's check if there are any similar issues open and if not open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vsphere bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants