Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation mismatches in regards to what metrics XPUM supports #20

Closed
eero-t opened this issue Oct 12, 2022 · 4 comments
Closed

Documentation mismatches in regards to what metrics XPUM supports #20

eero-t opened this issue Oct 12, 2022 · 4 comments

Comments

@eero-t
Copy link

eero-t commented Oct 12, 2022

Compared following documents:

And which metrics they list XPU manager to provide. Especially CSV file info seems very out of data, but also install guide eg. lists frequency throttle ratio (as not supported by current L0 backend), but not user guide. IMHO it would be better to have supported metrics list in single place, and to refer to that from the other documents.

@taotod
Copy link
Contributor

taotod commented Jan 6, 2023

Hi, @eero-t , the telemetry metrics in the installation guide and user guide are separated defined. For example, throttle ratio is defined in the installation guide to have end user to enable the throttle ratio collection in XPU Manager daemon. However, we think that it is not useful for CLI end users and don't provide it in CLI. As a result, throttle ratio is not written in CLI user guide.

@eero-t
Copy link
Author

eero-t commented Jan 9, 2023

Ok, fair enough. What about the CSV list?

Which of the documents should list all the supported metrics? And could that be linked from the other places mentioning metrics (see also #24)?

@taotod taotod closed this as completed Jan 31, 2024
@eero-t
Copy link
Author

eero-t commented Jan 31, 2024

Those docs do not list error counters as supported for Flex, but RAS works fine for me on them (with i915 backport kernel, as long as Sysman is run as root with PERFMON capability):

# zello_sysman --ras
setting environment variable ZES_ENABLE_SYSMAN=1
Device Name = Intel(R) Data Center GPU Flex 170
Device Name = Intel(R) Data Center GPU Flex 170

 ----  Ras tests ---- 
rasProperties.type = 0
Number of correctable accelerator engine resets attempted by the driver = 0
Number of correctable errors that have occurred in caches = 0
Number of correctable programming errors that have occurred  = 0
Number of correctable driver errors that have occurred  = 0
Number of correctable compute errors that have occurred  = 0
Number of correctable non compute errors that have occurred  = 0
Number of correctable display errors that have occurred  = 0
Setting Total threshold = 14
...

?

I would also suggest either having all metrics common to both platforms listed before ones specific to them, and/or grouping related metric together in these lists, e.g.:

  • GPU frequency
  • GPU throttle reason

And:

  • GPU and GPU engine utilizations
  • GPU EU utilization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants