-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vSphere plugin issue 4789 (datastore metrics missing) #4968
Conversation
* Added "hack" for limited chunk size for clusters. * Added lookback to include late metrics
@@ -191,3 +211,43 @@ func (c *Client) GetServerTime(ctx context.Context) (time.Time, error) { | |||
} | |||
return *t, nil | |||
} | |||
|
|||
// GetMaxQueryMetrics returns the max_query_metrics setting as configured in vCenter | |||
func (c *Client) GetMaxQueryMetrics(ctx context.Context) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx passed in isn't used, is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look a couple of lines down at om.Query(...). The context is used there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that the new context created on line 217 from a context.Background()
though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Yeah, I'm supposed to pass ctx into that call. The idea was to replace ctx with one that had timeout set.
plugins/inputs/vsphere/endpoint.go
Outdated
@@ -588,6 +591,17 @@ func (e *Endpoint) Collect(ctx context.Context, acc telegraf.Accumulator) error | |||
} | |||
|
|||
func (e *Endpoint) chunker(ctx context.Context, f PushFunc, res *resourceKind, now time.Time, latest time.Time) { | |||
maxMetrics := e.Parent.MaxQueryMetrics | |||
if maxMetrics == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check if maxMetrics < 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
}, nil | ||
} | ||
// Adjust max query size if needed | ||
ctx3, cancel3 := context.WithTimeout(ctx, vs.Timeout.Duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't have to change but you could probably just use a single context in this function since they all are canceled at the same time and with the same timeout duration. When you cancel a context all the child contexts are canceled as well. I see this in a few spots throughout the code and its not new but I thought I would mention it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could set it in the beginning, but it wouldn't be the same timeouts for every call. context.WithTimeout()
creates a context with an absolute deadline, so each call would have a shorter timeout than the previous one. One could argue how to interpret "timeout", but I assumed it meant the timeout for each call, hence the multiple contexts.
if err != nil { | ||
return nil, err | ||
} | ||
log.Printf("D! [input.vsphere] vCenter says max_query_metrics should be %d", n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this plugin has a very high number of log messages, obviously it's also one of the more complicated plugins so to some degree that's to be expected but I would like it to be a bit less. I think this one could probably go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pruned the debug logs a bit.
log.Printf("D! [input.vsphere] vCenter maxQueryMetrics is defined: %d", v) | ||
if v == -1 { | ||
// Whatever the server says, we never ask for more metrics than this. | ||
return absoluteMaxMetrics, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a much larger max metrics than the current default of 256, is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only situation where this would kick in is if the setting in the config file is higher than absoluteMaxMetrics
. However, 100000 was way too high, so I lowered it to 10000.
plugins/inputs/vsphere/client.go
Outdated
} | ||
|
||
// No usable maxQueryMetrics setting. Infer based on version | ||
parts := strings.Split(c.Client.Client.ServiceContent.About.Version, ".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check that this splits to the expected number of parts and if not return the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if s, ok := res[0].GetOptionValue().Value.(string); ok { | ||
v, err := strconv.Atoi(s) | ||
if err == nil { | ||
log.Printf("D! [input.vsphere] vCenter maxQueryMetrics is defined: %d", v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think any path through this function should only emit at most a single log message at debug level, I would just include what the max_query_metrics
is set to and what method is used:
D! [input.vsphere] Set max_query_metrics to %d using vCenter settings
D! [input.vsphere] Set max_query_metrics to %d using vCenter version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed some logs.
// when checking query size, so keep it at a low value. | ||
// Revisit this when we better understand the reason why vCenter counts it this way! | ||
if res.name == "cluster" && maxMetrics > 10 { | ||
maxMetrics = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally this maxMetrics adjustment should be done when the Endpoint is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not changing the setting, just setting a local variable for this turn of the loop that goes through resource types. This is a bit of an ugly hack, since clusters seem to need special treatment for some reason. We're investigating this, but this hack solves the immediate problem.
plugins/inputs/vsphere/endpoint.go
Outdated
// Since non-realtime metrics are queries with a lookback, we need to check the high-water mark | ||
// to determine if this should be included. Only samples not seen before should be included. | ||
if !(res.realTime || e.hwMarks.IsNew(tsKey, ts)) { | ||
//log.Printf("D! [input.vsphere] Skipped %s for %s because we've already seen it", name, ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented out code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
plugins/inputs/vsphere/tscache.go
Outdated
table: make(map[string]time.Time), | ||
done: make(chan struct{}), | ||
} | ||
go func(t *TSCache) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this done as a goroutine, what if we just run purge once after all chunks have been processed? We should aim for as little concurrency as we can get away with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Changing.
plugins/inputs/vsphere/vsphere.go
Outdated
} | ||
}(ep) | ||
} | ||
|
||
wg.Wait() | ||
if len(merr) > 0 { | ||
log.Printf("E! [input.vsphere] Error during Gather: %s", merr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't log here since you are returning the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
@@ -306,7 +312,7 @@ func init() { | |||
DiscoverConcurrency: 1, | |||
ForceDiscoverOnInit: false, | |||
ObjectDiscoveryInterval: internal.Duration{Duration: time.Second * 300}, | |||
Timeout: internal.Duration{Duration: time.Second * 20}, | |||
Timeout: internal.Duration{Duration: time.Second * 60}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the sampleconfig and readme with the new default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Lowered absoluteMaxMetrics to 10,000 Update sample config and README.
(cherry picked from commit 2d782fb)
Required for all PRs:
Addresses issue #4789 (datastore metrics missing). Solved by widening the time window when collecting metrics to allow for metrics that are posted late by vCenter.