xpumd generates the same error messages ad nauseum #8

jmechalas · 2022-07-14T20:41:59Z

My instances of xpumd generate this error message for every card, at every poll interval:

[2022-07-14 13:37:47.002] [W] [144540-144567] partial monitoring failure: [toGetMemoryWrite:1394] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:47.002] [W] [144540-144580] partial monitoring failure: [toGetMemoryRead:1343] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:48.502] [W] [144540-144567] partial monitoring failure: [toGetMemoryRead:1343] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:49.002] [W] [144540-144568] partial monitoring failure: [toGetMemoryWrite:1394] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:49.003] [W] [144540-144567] partial monitoring failure: [toGetMemoryReadThroughput:1446] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:49.192] [W] [144540-144578] partial monitoring failure: [toGetMemoryBandwidth:1292] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:50.000] [W] [144540-144574] partial monitoring failure: [toGetMemoryReadThroughput:1446] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:50.076] [W] [144540-144578] partial monitoring failure: [toGetMemoryBandwidth:1292] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:50.502] [W] [144540-144580] partial monitoring failure: [toGetMemoryWriteThroughput:1498] zesMemoryGetBandwidth:0x7ffffffe

And it repeats forever. This floods the console/tty unless you redirect stderr. If you turn on the logfile option, it the log file grows and grows and grows...

There needs to be a mechanism that keeps xpumd from repeating the same information basically forever, especially rapid-fire like this. It makes the logging feature unusable because the logs are filled with noise.

The text was updated successfully, but these errors were encountered:

eero-t · 2022-12-28T12:07:15Z

Rate-limiting is unlikely to help with that, so it should either disable that particular query, or (like collectd-6.0 Sysman plugin does) disable the whole metric on its query errors.

It makes the logging feature unusable because the logs are filled with noise.

As a workaround, you can disable querying the given metric. API for that is a bit awkward though. One needs to give XPUM bitmask of which metrics to enable, instead of being able to specify which named metrics should be disabled (like is the case with collectd-6.0 Sysman plugin).

jmechalas · 2023-01-03T16:31:08Z

The defaults for xpumd should result in sane behavior.

eero-t · 2023-01-05T11:46:17Z

Sure. I added bug for missing env vars documentation, needed to workaround this issue (#24).

yupengzh-intel · 2023-02-21T04:05:00Z

The log shows that the API "zesMemoryGetBandwidth" not working on your platform. By default, xpumd has a predefined set of metrics to collect. You can check by running "xpumd -h", you will see help info like below:

xpumd -h

Usage: xpumd [OPTIONS]

Options:
-h, --help print this help
-p, --pid_file=filename PID file used by daemonized app
-s, --socket_folder=foldername folder for socket files used by daemonized app
-d, --dump_folder=foldername dump folder used by daemonized app
--log_level=LEVEL log level (trace, debug, info, warn, error)
-l, --log_file=filename logfile to write
--log_max_size=number max size of log file in MB
--log_max_files=number max number of log files
-m, --enable_metrics=METRICS list enabled metric indexes, seperated by comma,
use hyphen to indicate a range (e.g., 0,4-7,27-29)
Index Metric Default
----- -------------------------------------------------- -------
0 GPU_UTILIZATION on
1 EU_ACTIVE off
2 EU_STALL off
3 EU_IDLE off
4 POWER on
5 ENERGY on
6 GPU_FREQUENCY on
7 GPU_CORE_TEMPERATURE on
8 MEMORY_USED on
9 MEMORY_UTILIZATION on
10 MEMORY_BANDWIDTH on
11 MEMORY_READ on
12 MEMORY_WRITE on
13 MEMORY_READ_THROUGHPUT on
14 MEMORY_WRITE_THROUGHPUT on
15 ENGINE_GROUP_COMPUTE_ALL_UTILIZATION on
16 ENGINE_GROUP_MEDIA_ALL_UTILIZATION on
17 ENGINE_GROUP_COPY_ALL_UTILIZATION on
18 ENGINE_GROUP_RENDER_ALL_UTILIZATION on
19 ENGINE_GROUP_3D_ALL_UTILIZATION on
20 RAS_ERROR_CAT_RESET on
21 RAS_ERROR_CAT_PROGRAMMING_ERRORS on
22 RAS_ERROR_CAT_DRIVER_ERRORS on
23 RAS_ERROR_CAT_CACHE_ERRORS_CORRECTABLE on
24 RAS_ERROR_CAT_CACHE_ERRORS_UNCORRECTABLE on
25 RAS_ERROR_CAT_DISPLAY_ERRORS_CORRECTABLE on
26 RAS_ERROR_CAT_DISPLAY_ERRORS_UNCORRECTABLE on
27 RAS_ERROR_CAT_NON_COMPUTE_ERRORS_CORRECTABLE on
28 RAS_ERROR_CAT_NON_COMPUTE_ERRORS_UNCORRECTABLE on
29 GPU_REQUEST_FREQUENCY on
30 MEMORY_TEMPERATURE on
31 FREQUENCY_THROTTLE on
32 PCIE_READ_THROUGHPUT off
33 PCIE_WRITE_THROUGHPUT off
34 PCIE_READ off
35 PCIE_WRITE off
36 ENGINE_UTILIZATION on
37 FABRIC_THROUGHPUT on
38 FREQUENCY_THROTTLE_REASON_GPU on

As you can see, the memory related metrics are enabled by default. If you don't want to collect these metrics, you can modify by "-m" option, then xpumd should not print the error logs again.

eero-t · 2023-05-17T16:11:48Z

Latest V1.2.9 release is constantly (at approx 2s interval) logging getFabricPorts result false.

huiqiwa closed this as completed Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xpumd generates the same error messages ad nauseum #8

xpumd generates the same error messages ad nauseum #8

jmechalas commented Jul 14, 2022

eero-t commented Dec 28, 2022

jmechalas commented Jan 3, 2023

eero-t commented Jan 5, 2023

yupengzh-intel commented Feb 21, 2023

eero-t commented May 17, 2023

xpumd generates the same error messages ad nauseum #8

xpumd generates the same error messages ad nauseum #8

Comments

jmechalas commented Jul 14, 2022

eero-t commented Dec 28, 2022

jmechalas commented Jan 3, 2023

eero-t commented Jan 5, 2023

yupengzh-intel commented Feb 21, 2023

xpumd -h

eero-t commented May 17, 2023