-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inputs.vsphere] Error in plugin: ServerFaultCode: XML document element count exceeds configured maximum 500000 #5041
Comments
I looks like the plugin is trying to send a huge query to the server. Limit max_query_objects and/or max_query_metrics. For example: max_query_objects = 100 |
The next release will also limit queries to 100,000 metrics at a time, regardless of settings. This should prevent this from happening again. As a side note, @phreak2599, I'd be interested in knowing a bit more about your configuration. Assuming you have the default 256 objects per query, 500,000 metrics sounds incredibly high. How many VMs/hosts are in that vCenter, if you don't mind sharing? |
It was actually set to 64, since we are currently running 5.5, although we are in the process of upgrading to 6.5. I have since set it to 32 to see if that helps. This plugin is currently running in one of our data centers against 4 vCenter hosts. Not sure which vCenter was causing the issue though. Update: Seems to be working well with both the query settings set to 32. Thanks for the help @prydin ! |
Spoke too soon. Looks like now I am getting: 2018-11-28T15:40:54Z W! [agent] input "inputs.vsphere" did not complete within its interval |
What's your collect_concurrency setting? Try to increase it to, say, 5. Also, if you don't need instance-level (per CPU etc) metrics, you can turn that off per resource type, which should save you a lot of collection time. Another thing you can try is to reduce the number of metrics collected to only those you need. We're just at the tail end of a huge scale testing and performance tuning effort and should be providing an update soon that has some performance tweaks. In our lab, we're collecting metrics for 7000 VMs, including instance data, in about 6 seconds. |
Not sure if I understand exactly what is happening, but it seems the initial discovery runs, then the plugin runs fine until the next discovery. When the next discovery runs, it doesn't complete, then the plugin doesn't seem to be sending any metrics, most likely due to the discovery failing. Does that sound plausible? If I raise the concurrency settings I think I will have to give more CPU to my vCenter DB servers. They have pegged out when I was playing with those in the past. |
Try increasing the discovery interval to 30 minutes. The discovery logic is greatly improved in the version we're about to release. Should run 50-100 times faster! I can post a binary if you feel like testing it out. |
BTW, the concurrency settings for metric collection shouldn't have a huge impact on database servers, at least not for VM and host metrics, since they are scraped from ESXi memory. |
I can try the latest repo. Let me see how difficult it is to compile. |
I had the same issue with the new version. The discovery finishes, the initial metric collection seems to finish (but I don't think it does). I think the plugin is hanging, and not ever finishing the initial collection. I am just getting: 2018-11-29T15:46:36Z D! [input.vsphere] Query for cluster returned metrics for 2 objects and the last bit keeps repeating. Never starts collecting metrics again. |
Are you collecting datastore metrics? Try disabling that.
If that solves the problem, move the datastore collection to a separate instance of [inputs.vsphere] with an interval >= 300s. Collection of datastore metrics can take a VERY long time due to the way vCenter manages that data. If it doesn't complete within the interval, you'll see these kinds of problems. Also, let me point you to the very latest version that has some pretty radical performance improvements. Stand by! |
Cool, that seemed to get things going on this current release. Do you know the timeframe for the new release? |
The actual release timing is up to the influx team, but I can get you a snapshot from my branch today. Use at you own risk and all that, of course... |
Here's a snapshot that's been tested in our lab for a few days without any issues. You're welcome to try it (at your own risk). I attached a compiled binary for Linux. Let me know if you need any other flavors. https://github.com/prydin/telegraf/releases/tag/PR-SCALE-IMPROVEMENT-BETA1 |
I still have the same issue with the binary you provided. 2018-12-03T13:50:00Z W! [agent] input "inputs.vsphere" did not complete within its interval I get 204 http status on the influxdb side API. Dec 03 14:42:00 XXXXXXXXX influxd[32499]: [httpd] 10.12.168.11 - - [03/Dec/2018:14:42:00 +0100] "POST /write?db=iaaspriv HTTP/1.1" 204 0 "-" "Telegraf/unknown" 35891433-f701-11e8-846a-005056bc0ddf 5270 I only try to get few information on Vcenter that contain 7669 VM, here is my conf:
|
The "exclude" statements should read: datstore_metric_exclude = [ "*" ] |
Also, do you get any debug statements starting with [input.vsphere]? You should at least see some statements saying that it's attempting to collect. |
Sorry i've forget to display my conf in markdown. And yes i get debug entry with [input.vsphere]:
After that none of this entry appears in my telegraf.log |
I'd need to see all the [input.vsphere] log lines to troubleshoot this. It looks like discovering the datastores takes a really long time. How many datastores do you have? |
Also, what is the output of |
Telegraf version: For security reason i can't give you the complete log, but the last interval didn't show up any errors with the key [input.vsphere]. The output only says
After this last line the process still running and send request to influxdb but without data (204 http/code). |
@bashrc666 If possible, could you run telegraf with the curl http://localhost:6060/debug/pprof/goroutine?debug=1 Copy and paste the output to this thread. The output doesn't contain any application data, so it should be safe to share. This will tell me exactly where the code locks up. |
Context I try to collect simple vm metrics on a vcenter that manage: 259 host BUG GO DUMP :
Latest Log :
|
THANK YOU!!!! This gives me a pretty good idea what's wrong! |
@bashrc666 Thanks again for the detailed information. It was extremely helpful. Here's a pre-release of what's on PR #5113 https://github.com/prydin/telegraf/releases/tag/PR-SCALE-IMPROVEMENT-RC1 Try it if you like. As always with a pre-release, you use it at your own risk. |
Hello, I still have the same issue with the same vcenter. version :
GO DUMP
STDOUT
|
@bashrc666 that output doesn't match the thread dump. The WorkerPool class doesn't exist anymore. Are you sure that's the right output? As for the dump, it looks like it's stuck on a slow call to vCenter. What's your concurrency setting? Is the vCenter slow in general? |
My conf
It only happened on this particular very big vcenter that contain 29 cluster and 259 host and 8129 VM and so many datastore. Maybe i'have something to improve on this config ??? @prydin Thank's so much for the help |
@bashrc666 It's probably the datastore collection that takes a long time. Break it out into a separate declaration of [[inputs.vsphere]] and set the interval for that instance to 300s. Also, you're collecting every metric on the datastores. You can save some collection time by specifying a smaller set. |
@prydin i've decided to get ride of the datastore metric for the moment, et get back on it when i'm sure that the VMS and HOST collecting will work on that vcenter. but between 10 to 20min telegraf stop working. CONFIG
GO DUMP
|
Can you grab the full goroutine stack dump from here: http://localhost:6060/debug/pprof/goroutine?debug=2 |
TELEGRAF VERSION
CONTEXT I try to collect simple vm metrics on a vcenter that manage: 259 host Telegraf stop working between 20 or 30 min after it started.
CONFIG
GO DUMP LEVEL 2
NOTE I just figured that, when i run telegraf as a systemd unit it fail like this case. but when i run it into a linux jobs with the same parameters of the systemd unit it work properly for more than an 2hours. I really dont get it. right now i'm trying to setup a proper InfluxDB Enterprise Cluster to check if this collecting failure it's not because of a standalone Influxdb. |
Update My bad, The plugin working fine in release Telegraf unknown (git: prydin-scale-improvement 646c596). I just forget to tell grafana to connect each point of metric in an interval superior of 1min. I appologize for my huge misstake.. I have increase my interval at 120s and it's working like a charm with all my Vcenter |
I believe this is working, and now available, in 1.10.0 |
I am receiving this when the plugin receives metrics from the vCenter servers.
Any idea on what is wrong / how to fix?
2018-11-26T22:24:47Z E! [inputs.vsphere]: Error in plugin: ServerFaultCode: XML document element count exceeds configured maximum 500000
while parsing serialized DataObject of type vim.PerformanceManager.MetricId
at line 2, column 19637665
while parsing property "metricId" of static type ArrayOfPerfMetricId
while parsing serialized DataObject of type vim.PerformanceManager.QuerySpec
at line 2, column 19598059
while parsing call information for method QueryPerf
at line 2, column 66
while parsing SOAP body
at line 2, column 60
while parsing SOAP envelope
at line 2, column 0
while parsing HTTP request for method queryStats
on object of type vim.PerformanceManager
at line 1, column 0
The text was updated successfully, but these errors were encountered: