Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vSphere plugin issue 4789 (datastore metrics missing) #4968

Merged
merged 12 commits into from
Nov 6, 2018
Merged

vSphere plugin issue 4789 (datastore metrics missing) #4968

merged 12 commits into from
Nov 6, 2018

Conversation

prydin
Copy link
Contributor

@prydin prydin commented Nov 6, 2018

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

Addresses issue #4789 (datastore metrics missing). Solved by widening the time window when collecting metrics to allow for metrics that are posted late by vCenter.

@prydin prydin changed the title Prydin issue 4789 vSphere plugin issue 4789 (datastore metrics missing) Nov 6, 2018
@@ -191,3 +211,43 @@ func (c *Client) GetServerTime(ctx context.Context) (time.Time, error) {
}
return *t, nil
}

// GetMaxQueryMetrics returns the max_query_metrics setting as configured in vCenter
func (c *Client) GetMaxQueryMetrics(ctx context.Context) (int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx passed in isn't used, is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look a couple of lines down at om.Query(...). The context is used there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that the new context created on line 217 from a context.Background() though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Yeah, I'm supposed to pass ctx into that call. The idea was to replace ctx with one that had timeout set.

@@ -588,6 +591,17 @@ func (e *Endpoint) Collect(ctx context.Context, acc telegraf.Accumulator) error
}

func (e *Endpoint) chunker(ctx context.Context, f PushFunc, res *resourceKind, now time.Time, latest time.Time) {
maxMetrics := e.Parent.MaxQueryMetrics
if maxMetrics == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check if maxMetrics < 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}, nil
}
// Adjust max query size if needed
ctx3, cancel3 := context.WithTimeout(ctx, vs.Timeout.Duration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't have to change but you could probably just use a single context in this function since they all are canceled at the same time and with the same timeout duration. When you cancel a context all the child contexts are canceled as well. I see this in a few spots throughout the code and its not new but I thought I would mention it.

Copy link
Contributor Author

@prydin prydin Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could set it in the beginning, but it wouldn't be the same timeouts for every call. context.WithTimeout() creates a context with an absolute deadline, so each call would have a shorter timeout than the previous one. One could argue how to interpret "timeout", but I assumed it meant the timeout for each call, hence the multiple contexts.

if err != nil {
return nil, err
}
log.Printf("D! [input.vsphere] vCenter says max_query_metrics should be %d", n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this plugin has a very high number of log messages, obviously it's also one of the more complicated plugins so to some degree that's to be expected but I would like it to be a bit less. I think this one could probably go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pruned the debug logs a bit.

log.Printf("D! [input.vsphere] vCenter maxQueryMetrics is defined: %d", v)
if v == -1 {
// Whatever the server says, we never ask for more metrics than this.
return absoluteMaxMetrics, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a much larger max metrics than the current default of 256, is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only situation where this would kick in is if the setting in the config file is higher than absoluteMaxMetrics. However, 100000 was way too high, so I lowered it to 10000.

}

// No usable maxQueryMetrics setting. Infer based on version
parts := strings.Split(c.Client.Client.ServiceContent.About.Version, ".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that this splits to the expected number of parts and if not return the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if s, ok := res[0].GetOptionValue().Value.(string); ok {
v, err := strconv.Atoi(s)
if err == nil {
log.Printf("D! [input.vsphere] vCenter maxQueryMetrics is defined: %d", v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any path through this function should only emit at most a single log message at debug level, I would just include what the max_query_metrics is set to and what method is used:

D! [input.vsphere] Set max_query_metrics to %d using vCenter settings
D! [input.vsphere] Set max_query_metrics to %d using vCenter version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed some logs.

// when checking query size, so keep it at a low value.
// Revisit this when we better understand the reason why vCenter counts it this way!
if res.name == "cluster" && maxMetrics > 10 {
maxMetrics = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally this maxMetrics adjustment should be done when the Endpoint is created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not changing the setting, just setting a local variable for this turn of the loop that goes through resource types. This is a bit of an ugly hack, since clusters seem to need special treatment for some reason. We're investigating this, but this hack solves the immediate problem.

// Since non-realtime metrics are queries with a lookback, we need to check the high-water mark
// to determine if this should be included. Only samples not seen before should be included.
if !(res.realTime || e.hwMarks.IsNew(tsKey, ts)) {
//log.Printf("D! [input.vsphere] Skipped %s for %s because we've already seen it", name, ts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

table: make(map[string]time.Time),
done: make(chan struct{}),
}
go func(t *TSCache) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this done as a goroutine, what if we just run purge once after all chunks have been processed? We should aim for as little concurrency as we can get away with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Changing.

}
}(ep)
}

wg.Wait()
if len(merr) > 0 {
log.Printf("E! [input.vsphere] Error during Gather: %s", merr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't log here since you are returning the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -306,7 +312,7 @@ func init() {
DiscoverConcurrency: 1,
ForceDiscoverOnInit: false,
ObjectDiscoveryInterval: internal.Duration{Duration: time.Second * 300},
Timeout: internal.Duration{Duration: time.Second * 20},
Timeout: internal.Duration{Duration: time.Second * 60},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the sampleconfig and readme with the new default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Lowered absoluteMaxMetrics to 10,000

Update sample config and README.
@danielnelson danielnelson added this to the 1.9.0 milestone Nov 6, 2018
@danielnelson danielnelson added fix pr to fix corresponding bug area/vsphere labels Nov 6, 2018
@danielnelson danielnelson merged commit 2d782fb into influxdata:master Nov 6, 2018
danielnelson pushed a commit that referenced this pull request Nov 6, 2018
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
otherpirate pushed a commit to otherpirate/telegraf that referenced this pull request Mar 15, 2019
dupondje pushed a commit to dupondje/telegraf that referenced this pull request Apr 22, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vsphere fix pr to fix corresponding bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants