vSphere plugin issue 4789 (datastore metrics missing) #4968

prydin · 2018-11-06T13:05:32Z

Required for all PRs:

Signed CLA.
Associated README.md updated.
Has appropriate unit tests.

Addresses issue #4789 (datastore metrics missing). Solved by widening the time window when collecting metrics to allow for metrics that are posted late by vCenter.

* Added "hack" for limited chunk size for clusters. * Added lookback to include late metrics

glinton · 2018-11-06T18:24:37Z

plugins/inputs/vsphere/client.go

@@ -191,3 +211,43 @@ func (c *Client) GetServerTime(ctx context.Context) (time.Time, error) {
 	}
 	return *t, nil
 }
+
+// GetMaxQueryMetrics returns the max_query_metrics setting as configured in vCenter
+func (c *Client) GetMaxQueryMetrics(ctx context.Context) (int, error) {


ctx passed in isn't used, is this intended?

Look a couple of lines down at om.Query(...). The context is used there.

isn't that the new context created on line 217 from a context.Background() though

Oh! Yeah, I'm supposed to pass ctx into that call. The idea was to replace ctx with one that had timeout set.

glinton · 2018-11-06T18:27:23Z

plugins/inputs/vsphere/endpoint.go

@@ -588,6 +591,17 @@ func (e *Endpoint) Collect(ctx context.Context, acc telegraf.Accumulator) error
 }

 func (e *Endpoint) chunker(ctx context.Context, f PushFunc, res *resourceKind, now time.Time, latest time.Time) {
+	maxMetrics := e.Parent.MaxQueryMetrics
+	if maxMetrics == 0 {


Maybe check if maxMetrics < 1

Good point.

danielnelson · 2018-11-06T19:16:17Z

plugins/inputs/vsphere/client.go

-	}, nil
+	}
+	// Adjust max query size if needed
+	ctx3, cancel3 := context.WithTimeout(ctx, vs.Timeout.Duration)


This doesn't have to change but you could probably just use a single context in this function since they all are canceled at the same time and with the same timeout duration. When you cancel a context all the child contexts are canceled as well. I see this in a few spots throughout the code and its not new but I thought I would mention it.

I could set it in the beginning, but it wouldn't be the same timeouts for every call. context.WithTimeout() creates a context with an absolute deadline, so each call would have a shorter timeout than the previous one. One could argue how to interpret "timeout", but I assumed it meant the timeout for each call, hence the multiple contexts.

danielnelson · 2018-11-06T19:19:13Z

plugins/inputs/vsphere/client.go

+	if err != nil {
+		return nil, err
+	}
+	log.Printf("D! [input.vsphere] vCenter says max_query_metrics should be %d", n)


Overall this plugin has a very high number of log messages, obviously it's also one of the more complicated plugins so to some degree that's to be expected but I would like it to be a bit less. I think this one could probably go.

Pruned the debug logs a bit.

danielnelson · 2018-11-06T19:27:26Z

plugins/inputs/vsphere/client.go

+					log.Printf("D! [input.vsphere] vCenter maxQueryMetrics is defined: %d", v)
+					if v == -1 {
+						// Whatever the server says, we never ask for more metrics than this.
+						return absoluteMaxMetrics, nil


This would be a much larger max metrics than the current default of 256, is this intended?

The only situation where this would kick in is if the setting in the config file is higher than absoluteMaxMetrics. However, 100000 was way too high, so I lowered it to 10000.

danielnelson · 2018-11-06T19:31:10Z

plugins/inputs/vsphere/client.go

+	}
+
+	// No usable maxQueryMetrics setting. Infer based on version
+	parts := strings.Split(c.Client.Client.ServiceContent.About.Version, ".")


Check that this splits to the expected number of parts and if not return the default.

danielnelson · 2018-11-06T19:36:23Z

plugins/inputs/vsphere/client.go

+			if s, ok := res[0].GetOptionValue().Value.(string); ok {
+				v, err := strconv.Atoi(s)
+				if err == nil {
+					log.Printf("D! [input.vsphere] vCenter maxQueryMetrics is defined: %d", v)


I think any path through this function should only emit at most a single log message at debug level, I would just include what the max_query_metrics is set to and what method is used:

D! [input.vsphere] Set max_query_metrics to %d using vCenter settings D! [input.vsphere] Set max_query_metrics to %d using vCenter version

Removed some logs.

danielnelson · 2018-11-06T19:38:07Z

plugins/inputs/vsphere/endpoint.go

+	// when checking query size, so keep it at a low value.
+	// Revisit this when we better understand the reason why vCenter counts it this way!
+	if res.name == "cluster" && maxMetrics > 10 {
+		maxMetrics = 10


I think ideally this maxMetrics adjustment should be done when the Endpoint is created.

I'm not changing the setting, just setting a local variable for this turn of the loop that goes through resource types. This is a bit of an ugly hack, since clusters seem to need special treatment for some reason. We're investigating this, but this hack solves the immediate problem.

danielnelson · 2018-11-06T19:41:19Z

plugins/inputs/vsphere/endpoint.go

+				// Since non-realtime metrics are queries with a lookback, we need to check the high-water mark
+				// to determine if this should be included. Only samples not seen before should be included.
+				if !(res.realTime || e.hwMarks.IsNew(tsKey, ts)) {
+					//log.Printf("D! [input.vsphere] Skipped %s for %s because we've already seen it", name, ts)


Remove commented out code.

danielnelson · 2018-11-06T19:47:36Z

plugins/inputs/vsphere/tscache.go

+		table: make(map[string]time.Time),
+		done:  make(chan struct{}),
+	}
+	go func(t *TSCache) {


I don't think we need this done as a goroutine, what if we just run purge once after all chunks have been processed? We should aim for as little concurrency as we can get away with.

Good point. Changing.

danielnelson · 2018-11-06T20:10:21Z

plugins/inputs/vsphere/vsphere.go

 			}
 		}(ep)
 	}

 	wg.Wait()
+	if len(merr) > 0 {
+		log.Printf("E! [input.vsphere] Error during Gather: %s", merr)


Don't log here since you are returning the error.

danielnelson · 2018-11-06T20:11:03Z

plugins/inputs/vsphere/vsphere.go

@@ -306,7 +312,7 @@ func init() {
 			DiscoverConcurrency:     1,
 			ForceDiscoverOnInit:     false,
 			ObjectDiscoveryInterval: internal.Duration{Duration: time.Second * 300},
-			Timeout:                 internal.Duration{Duration: time.Second * 20},
+			Timeout:                 internal.Duration{Duration: time.Second * 60},


Update the sampleconfig and readme with the new default

Lowered absoluteMaxMetrics to 10,000 Update sample config and README.

(cherry picked from commit 2d782fb)

…#4968)

prydin added 9 commits October 15, 2018 12:57

Implemented LUN to datasource translation

41691f6

Cleaned up logging

5b4e56f

Merge branch 'prydin-issue-4855' into prydin-issue-4789

2bb834f

* Fixed error reporting for failed metric collections.

0591b1d

* Added "hack" for limited chunk size for clusters. * Added lookback to include late metrics

Increased lookback to 3

f2579e1

Removed some debug statements

7f0aa4a

Merged from upstream

c4babb1

Get maxQueryMetrics from server

bfeaaf4

Added test cases

e867953

prydin changed the title ~~Prydin issue 4789~~ vSphere plugin issue 4789 (datastore metrics missing) Nov 6, 2018

glinton reviewed Nov 6, 2018

View reviewed changes

prydin added 2 commits November 6, 2018 11:47

Made check on maxMetrics a bit safer

b881b06

Fixed context handling in GetMaxQueryMetrics

2462336

danielnelson reviewed Nov 6, 2018

View reviewed changes

Removed some debug statements.

c3730a6

Lowered absoluteMaxMetrics to 10,000 Update sample config and README.

danielnelson added this to the 1.9.0 milestone Nov 6, 2018

danielnelson added fix pr to fix corresponding bug area/vsphere labels Nov 6, 2018

danielnelson merged commit 2d782fb into influxdata:master Nov 6, 2018

danielnelson mentioned this pull request Nov 6, 2018

vSphere Input does not collect datastore metrics #4789

Closed

danielnelson pushed a commit that referenced this pull request Nov 6, 2018

Fix potential missing datastore metrics in vSphere plugin (#4968)

fc531c6

(cherry picked from commit 2d782fb)

prydin mentioned this pull request Nov 7, 2018

vSphere: Automatically detect maximum query size #4949

Closed

otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019

Fix potential missing datastore metrics in vSphere plugin (influxdata…

0f82cd7

…#4968)

otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019

Fix potential missing datastore metrics in vSphere plugin (influxdata…

fc19756

…#4968)

dupondje pushed a commit to dupondje/telegraf that referenced this pull request Apr 22, 2019

Fix potential missing datastore metrics in vSphere plugin (influxdata…

33c0665

…#4968)

athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020

Fix potential missing datastore metrics in vSphere plugin (influxdata…

7d73503

…#4968)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vSphere plugin issue 4789 (datastore metrics missing) #4968

vSphere plugin issue 4789 (datastore metrics missing) #4968

prydin commented Nov 6, 2018

glinton Nov 6, 2018

prydin Nov 6, 2018

glinton Nov 6, 2018

prydin Nov 6, 2018

glinton Nov 6, 2018

prydin Nov 6, 2018

prydin Nov 6, 2018

danielnelson Nov 6, 2018

prydin Nov 6, 2018 •

edited

Loading

danielnelson Nov 6, 2018

prydin Nov 6, 2018

danielnelson Nov 6, 2018

prydin Nov 6, 2018

danielnelson Nov 6, 2018

prydin Nov 6, 2018

danielnelson Nov 6, 2018

prydin Nov 6, 2018

danielnelson Nov 6, 2018

prydin Nov 6, 2018

danielnelson Nov 6, 2018

prydin Nov 6, 2018

danielnelson Nov 6, 2018

prydin Nov 6, 2018

danielnelson Nov 6, 2018

prydin Nov 6, 2018

danielnelson Nov 6, 2018

prydin Nov 6, 2018

vSphere plugin issue 4789 (datastore metrics missing) #4968

vSphere plugin issue 4789 (datastore metrics missing) #4968

Conversation

prydin commented Nov 6, 2018

Required for all PRs:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prydin Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prydin Nov 6, 2018 •

edited

Loading