Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup disk metrics #5

Merged

Conversation

carlpett
Copy link
Collaborator

Addresses #4 by:

  • Removing the pre-averaged metrics
  • Removing metrics that aggregate read and write into a single metric (since those are exposed separately)
  • Converting free_megabytes to free_bytes
  • Removing the *_base metrics, and fixing the percent-metrics

Because of how the percent metrics are calculated (percent_metric_name / percent_metric_name_base), they get "fake precision" when getting close to zero, eg wmi_perf_percent_disk_write_time{volume="HarddiskVolume3"} 2.931438995229291e-11.
Possibly we should truncate to zero at some point? @brian-brazil - any recommendations here?

@carlpett
Copy link
Collaborator Author

Also, just re-read the recommendations for writing exporters and about not exposing percentages. Fix coming.

@martinlindhe
Copy link
Collaborator

Looks great. Waiting for @brian-brazil to comment before merging

@carlpett
Copy link
Collaborator Author

I did some more cleanup, primarily cosmetic stuff on the metric names, as well as updating the help texts from perfmon.
I also changed from percentages to ratios. I'm not fully satisfied with the names, but not sure how to make them better. idle_time doesn't sound like a ratio, I'd expect seconds. Any ideas?

@carlpett
Copy link
Collaborator Author

It also seems that some of the data we get might actually be counters rather than gauges - the ones with CounterType 272696320 (eg reads_per_sec) in the documentation seem to be strictly increasing, despite the help text indicating that it is a rate...

@martinlindhe
Copy link
Collaborator

martinlindhe commented Aug 27, 2016

Regarding help texts, I think it's a good idea to keep the WMI names in there too (maybe in parenthesis, don't know if there are specific conventions for the help texts in prometheus), to make it meaningful for users to understand what the measure is derived from.

From the prometheus docs:

When you’re transforming metrics it’s useful for users to be able to
track back to what the original was, and what rules were in play that
caused that transform. Putting in the name of the collector/exporter,
the id of any rule that was applied and the name/details of the original
metric into the help string will greatly aid users.

https://prometheus.io/docs/instrumenting/writing_exporters/#help-strings

@carlpett
Copy link
Collaborator Author

Yes, agreed. What do you think about this format?

The number of requests outstanding on the disk (derived from LogicalDisk.CurrentDiskQueueLength)

@martinlindhe
Copy link
Collaborator

martinlindhe commented Aug 27, 2016

That would be great, also The number of requests outstanding on the disk (LogicalDisk.CurrentDiskQueueLength) works :)

@carlpett carlpett force-pushed the f-disk-metrics-cleanup branch from bea56fe to 63c7640 Compare August 27, 2016 12:00
@carlpett
Copy link
Collaborator Author

Done!

@martinlindhe
Copy link
Collaborator

By the way, feel free to rename perf.go to logical_disk.go in this PR

"AvgDiskBytesPerTransfer",
ReadsPerSec: prometheus.NewDesc(
prometheus.BuildFQName(wmiNamespace, "perf", "reads_per_sec"),
"The rate of read operations on the disk (LogicalDisk.DiskReadsPerSec)",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over what time period?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not really specified in the documentation, sadly :( Would that make it useless to expose?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it a lot less useful, but if it's all we have it's all we have.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs we need are https://msdn.microsoft.com/en-us/library/ms803973.aspx and then to https://msdn.microsoft.com/en-us/library/ms804018.aspx

This is actually a Counter, as we're pulling from PerfDataRaw.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found the same thing, but at a different part of msdn (it's a maze of a site...): https://msdn.microsoft.com/en-us/windows/hardware/aa389383(v=vs.71). I'm changing the metrics to counters, now. What would be good new names? reads_total etc?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, wmi_perf_logicaldisk_reads_total.

Is the perf redundant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the perf part has been dropped (not yet pushed, though). All the WMI classes we'll use are prefixed with perf.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add https://msdn.microsoft.com/en-us/library/ms803973.aspx to the PR too for future reference?

@brian-brazil
Copy link

Now that we've mostly figured out the type system, we should start a wiki page or something documenting the mapping from the various "Counter" types to Prometheus Metrics.

AvgDiskBytesPerTransfer_Base uint32
AvgDiskBytesPerWrite uint64
AvgDiskBytesPerWrite_Base uint32
AvgDiskQueueLength uint64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that this is a Counter - which makes no sense for a queue length,

What it actually is is disk requests queued.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to requests_queued_total

@carlpett
Copy link
Collaborator Author

Hm, the rename wasn't very good, now all the comments are hidden away in other diffs :/

@brian-brazil
Copy link

I'm still perusing the docs, but I have a crazy idea.

It looks like it's possible to get to CounterType with WMI (though we'll need to use a different WMI library). Given that and presuming that metric naming is relatively consistent we may be able to fully automate the transformation of metric names and the user would only have to provide a WMI class name. We'll likely still need some hardcoded rules to fix e.g. bytes->megabytes.

I also note we cannot completely rely on documentation alone for CounterTypes - several of the ones in this PR changed in Windows 2000 (which means this won't quite work on XP).

@carlpett
Copy link
Collaborator Author

@brian-brazil Interesting idea! We could just have a mapping list that depends on the OS version, for compatibility. Plus some utility functions for standard unit conversions in case they are not fully standardized.

Along with the discussion re IIS/SQL server metrics, if those should become separate exporters maybe such automatic transformation and conversion could form the base for a wmi exporter library that can be used to easily build new exporters?

@carlpett
Copy link
Collaborator Author

@brian-brazil I think I've addressed all the comments now, and the numbers look reasonable from what I can see

@brian-brazil
Copy link

The counter types is what changed between versions.

What I'd see happening for IIS/SQL is that you'd run the WMI exporter but with different flags just pulling in IIS/SQL metrics.


ch <- prometheus.MustNewConstMetric(
c.RequestsQueued,
prometheus.CounterValue,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gauge

@martinlindhe
Copy link
Collaborator

Hm, maybe need to rebase this branch. Appveyor is complaining it is unmergable without conflicts now.

@carlpett carlpett force-pushed the f-disk-metrics-cleanup branch from cb3027f to fcf5125 Compare August 28, 2016 10:06
@carlpett
Copy link
Collaborator Author

Yep, just pushed :) (Very nice work adding it, btw!)

@carlpett
Copy link
Collaborator Author

Need to squash before merging as well

@martinlindhe
Copy link
Collaborator

@brian-brazil interesting idea, indeed. there is hundreds of WMI sources, but I'm not sure most of them is very usable. Let's discuss that in a separate issue #12

@carlpett carlpett mentioned this pull request Aug 28, 2016
@brian-brazil
Copy link

👍

I'd suggest squashing commits.

@carlpett carlpett force-pushed the f-disk-metrics-cleanup branch from fcf5125 to 1caa658 Compare August 28, 2016 18:21
@carlpett
Copy link
Collaborator Author

Done!

@martinlindhe martinlindhe merged commit 36ed911 into prometheus-community:master Aug 29, 2016
@martinlindhe
Copy link
Collaborator

Great work!

@martinlindhe
Copy link
Collaborator

martinlindhe commented Aug 29, 2016

collectors/logical_disk.go | 265 ++++++++++++++++++++
collectors/perf.go         | 593 ---------------------------------------------

i love this diff :) thanks for all the work

@carlpett carlpett deleted the f-disk-metrics-cleanup branch September 5, 2016 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants