Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xpumd generates the same error messages ad nauseum #8

Closed
jmechalas opened this issue Jul 14, 2022 · 5 comments
Closed

xpumd generates the same error messages ad nauseum #8

jmechalas opened this issue Jul 14, 2022 · 5 comments

Comments

@jmechalas
Copy link

My instances of xpumd generate this error message for every card, at every poll interval:

[2022-07-14 13:37:47.002] [W] [144540-144567] partial monitoring failure: [toGetMemoryWrite:1394] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:47.002] [W] [144540-144580] partial monitoring failure: [toGetMemoryRead:1343] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:48.502] [W] [144540-144567] partial monitoring failure: [toGetMemoryRead:1343] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:49.002] [W] [144540-144568] partial monitoring failure: [toGetMemoryWrite:1394] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:49.003] [W] [144540-144567] partial monitoring failure: [toGetMemoryReadThroughput:1446] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:49.192] [W] [144540-144578] partial monitoring failure: [toGetMemoryBandwidth:1292] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:50.000] [W] [144540-144574] partial monitoring failure: [toGetMemoryReadThroughput:1446] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:50.076] [W] [144540-144578] partial monitoring failure: [toGetMemoryBandwidth:1292] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:50.502] [W] [144540-144580] partial monitoring failure: [toGetMemoryWriteThroughput:1498] zesMemoryGetBandwidth:0x7ffffffe

And it repeats forever. This floods the console/tty unless you redirect stderr. If you turn on the logfile option, it the log file grows and grows and grows...

There needs to be a mechanism that keeps xpumd from repeating the same information basically forever, especially rapid-fire like this. It makes the logging feature unusable because the logs are filled with noise.

@eero-t
Copy link

eero-t commented Dec 28, 2022

Rate-limiting is unlikely to help with that, so it should either disable that particular query, or (like collectd-6.0 Sysman plugin does) disable the whole metric on its query errors.

It makes the logging feature unusable because the logs are filled with noise.

As a workaround, you can disable querying the given metric. API for that is a bit awkward though. One needs to give XPUM bitmask of which metrics to enable, instead of being able to specify which named metrics should be disabled (like is the case with collectd-6.0 Sysman plugin).

@jmechalas
Copy link
Author

The defaults for xpumd should result in sane behavior.

@eero-t
Copy link

eero-t commented Jan 5, 2023

Sure. I added bug for missing env vars documentation, needed to workaround this issue (#24).

@yupengzh-intel
Copy link

The log shows that the API "zesMemoryGetBandwidth" not working on your platform. By default, xpumd has a predefined set of metrics to collect. You can check by running "xpumd -h", you will see help info like below:

xpumd -h

Usage: xpumd [OPTIONS]

Options:
-h, --help print this help
-p, --pid_file=filename PID file used by daemonized app
-s, --socket_folder=foldername folder for socket files used by daemonized app
-d, --dump_folder=foldername dump folder used by daemonized app
--log_level=LEVEL log level (trace, debug, info, warn, error)
-l, --log_file=filename logfile to write
--log_max_size=number max size of log file in MB
--log_max_files=number max number of log files
-m, --enable_metrics=METRICS list enabled metric indexes, seperated by comma,
use hyphen to indicate a range (e.g., 0,4-7,27-29)
Index Metric Default
----- -------------------------------------------------- -------
0 GPU_UTILIZATION on
1 EU_ACTIVE off
2 EU_STALL off
3 EU_IDLE off
4 POWER on
5 ENERGY on
6 GPU_FREQUENCY on
7 GPU_CORE_TEMPERATURE on
8 MEMORY_USED on
9 MEMORY_UTILIZATION on
10 MEMORY_BANDWIDTH on
11 MEMORY_READ on
12 MEMORY_WRITE on
13 MEMORY_READ_THROUGHPUT on
14 MEMORY_WRITE_THROUGHPUT on
15 ENGINE_GROUP_COMPUTE_ALL_UTILIZATION on
16 ENGINE_GROUP_MEDIA_ALL_UTILIZATION on
17 ENGINE_GROUP_COPY_ALL_UTILIZATION on
18 ENGINE_GROUP_RENDER_ALL_UTILIZATION on
19 ENGINE_GROUP_3D_ALL_UTILIZATION on
20 RAS_ERROR_CAT_RESET on
21 RAS_ERROR_CAT_PROGRAMMING_ERRORS on
22 RAS_ERROR_CAT_DRIVER_ERRORS on
23 RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE on
24 RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE on
25 RAS_ERROR_CAT_DISPLAY_ERRORS_CORRECTABLE on
26 RAS_ERROR_CAT_DISPLAY_ERRORS_UNCORRECTABLE on
27 RAS_ERROR_CAT_NON_COMPUTE_ERRORS_CORRECTABLE on
28 RAS_ERROR_CAT_NON_COMPUTE_ERRORS_UNCORRECTABLE on
29 GPU_REQUEST_FREQUENCY on
30 MEMORY_TEMPERATURE on
31 FREQUENCY_THROTTLE on
32 PCIE_READ_THROUGHPUT off
33 PCIE_WRITE_THROUGHPUT off
34 PCIE_READ off
35 PCIE_WRITE off
36 ENGINE_UTILIZATION on
37 FABRIC_THROUGHPUT on
38 FREQUENCY_THROTTLE_REASON_GPU on

As you can see, the memory related metrics are enabled by default. If you don't want to collect these metrics, you can modify by "-m" option, then xpumd should not print the error logs again.

@huiqiwa huiqiwa closed this as completed Mar 24, 2023
@eero-t
Copy link

eero-t commented May 17, 2023

Latest V1.2.9 release is constantly (at approx 2s interval) logging getFabricPorts result false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants