-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathsmi_user_guide.md
1108 lines (970 loc) · 62.6 KB
/
smi_user_guide.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Intel(R) XPU System Management Interface User Guide
This guide describes how to use Intel(R) XPU System Management Interface to manage Intel GPU devices.
## Intel(R) XPU System Management Interface main features
* Show the device info.
* Update the device firmware.
## Help info
Show the XPU System Management Interface help info.
```
xpu-smi
Intel(R) XPU System Management Interface -- v1.0
Intel(R) XPU System Management Interface provides the Intel data center GPU device info. It can also be used to update the firmware.
Intel(R) XPU System Management Interface is based on Intel(R) oneAPI Level Zero. Before using XPU System Management Interface, the GPU driver and Intel(R) oneAPI Level Zero should be installed rightly.
Supported devices:
- Intel Data Center GPU
Usage: xpu-smi [Options]
xpu-smi -v
xpu-smi -h
xpu-smi discovery
Options:
-h, --help Print this help message and exit.
-v, --version Display version information and exit.
Subcommands:
discovery Discover the GPU devices installed on this machine and provide the device info.
diag Run some test suites to diagnose GPU.
health Get the GPU device component health status.
updatefw Update GPU firmware.
topology get the system topology
config Get and change the GPU settings.
ps List status of processes.
vgpu Create and remove virtual GPUs in SR-IOV configuration.
stats List the GPU statistics.
dump Dump device statistics data.
log Collect GPU debug logs.
```
Show XPU System Management Interface version and Level Zero version.
```
xpu-smi -v
CLI:
Version: 1.0.0.20211217
Build ID: f847c0fa
Service:
Version: 1.0.0.20211217
Build ID: f847c0fa
Level Zero Version: 1.6.2
```
## Discover the devices in this machine
Help info of the "discovery" subcommand
```
xpu-smi discovery -h
Discover the GPU devices installed on this machine and provide the device info.
Usage: xpu-smi discovery [Options]
xpu-smi discovery
xpu-smi discovery -d [deviceId]
xpu-smi discovery -d [pciBdfAddress]
xpu-smi discovery -d [deviceId] -j
xpu-smi discovery --dump
Options:
-h,--help Print this help message and exit.
-j,--json Print result in JSON format
--debug Print debug info
-d,--device Device ID or PCI BDF address to query. It will show more detailed info.
--pf,--physicalfunction Display the physical functions only.
--vf,--virtualfunction Display the virtual functions only.
--dump Property ID to dump device properties in CSV format. Separated by the comma.
1. Device ID
2. Device Name
3. Vendor Name
4. UUID
5. Serial Number
6. Core Clock Rate
7. Stepping
8. Driver Version
9. GFX Firmware Version
10. GFX Data Firmware Version
11. PCI BDF Address
12. PCI Slot
13. PCIe Generation
14. PCIe Max Link Width
15. OAM Socket ID
16. Memory Physical Size
17. Number of Memory Channels
18. Memory Bus Width
19. Number of EUs
20. Number of Media Engines
21. Number of Media Enhancement Engines
22. GFX Firmware Status
23. PCI Vendor ID
24. PCI Device ID
--listamcversions Show all AMC firmware versions.
-u,--username Username used to authenticate for host redfish access
-p,--password Password used to authenticate for host redfish access
```
Discover the devices in this machine
```
xpu-smi discovery
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Graphics [0x020a] |
| | Vendor Name: Intel(R) Corporation |
| | UUID: 00000000-0000-0000-0000-020a00008086 |
| | PCI BDF Address: 0000:4d:00.0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
```
Discover the devices in this machine and get the JSON format output
```
xpu-smi discovery -j
{
"device_list": [
{
"device_id": 0,
"device_name": "Intel(R) Graphics [0x020a]",
"device_type": "GPU",
"pci_bdf_address": "0000:4d:00.0",
"pci_device_id": "0x20a",
"uuid": "00000000-0000-0000-0000-020a00008086",
"vendor_name": "Intel(R) Corporation"
}
]
}
```
Show the detailed info of one device. The device info includes the model, frequency, driver/firmware info, PCI info, memory info and tile/execution unit info.
```
xpu-smi discovery -d 0000:4d:00.0
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Type: GPU |
| | Device Name: Intel(R) Graphics [0x020a] |
| | Vendor Name: Intel(R) Corporation |
| | UUID: 00000000-0000-0000-0000-020a00008086 |
| | Serial Number: unknown |
| | Core Clock Rate: 1400 MHz |
| | Stepping: B1 |
| | |
| | Driver Version: 16929133 |
| | Firmware Name: GFX |
| | Firmware Version: ATS0_1.1 |
| | |
| | PCI BDF Address: 0000:4d:00.0 |
| | PCI Slot: Riser 1, slot 1 |
| | PCIe Generation: 4 |
| | PCIe Max Link Width: 16 |
| | |
| | Memory Physical Size: 32768.00 MiB |
| | Max Mem Alloc Size: 4095.99 MiB |
| | ECC State: enabled |
| | Number of Memory Channels: 2 |
| | Memory Bus Width: 128 |
| | Max Hardware Contexts: 65536 |
| | Max Command Queue Priority: 0 |
| | |
| | Number of EUs: 512 |
| | Number of Tiles: 2 |
| | Number of Slices: 2 |
| | Number of Sub Slices per Slice: 30 |
| | Number of Threads per EU: 8 |
| | Physical EU SIMD Width: 8 |
| | Number of Media Engines: 2 |
| | Number of Media Enhancement Engines: 2 |
| | |
| | Number of Xe Link ports: 16 |
| | Max Tx/Rx Speed per Xe Link port: 51879.88 MiB/s |
| | Number of Lanes per Xe Link port: 4 |
+-----------+--------------------------------------------------------------------------------------+
```
Dump the specified GPU device info to csv format.
```
xpu-smi discovery --dump 1,2,3,6
Device ID,Device Name,Vendor Name,Core Clock Rate
0,"Intel(R) Graphics [0x56c0]","Intel(R) Corporation","2050 MHz"
1,"Intel(R) Graphics [0x56c0]","Intel(R) Corporation","2050 MHz"
```
Show all AMC firmware versions. For Redfish host interface, the BMC credential need be specified. For IPMI interface, the BMC credential is not required.
```
xpu-smi discovery --listamcversions
2 AMC are found
AMC 0 firmware version: 6.8.0.0
AMC 1 firmware version: 6.8.0.0
```
## Get the system topology
Help info of get the system topology
```
xpu-smi topology
Get the system topology info.
Usage: xpu-smi topology [Options]
xpu-smi topology -d [deviceId]
xpu-smi topology -d [pciBdfAddress]
xpu-smi topology -d [deviceId] -j
xpu-smi topology -f [filename]
xpu-smi topology -m
optional arguments:
-h,--help Print this help message and exit
-j,--json Print result in JSON format
-d,--device The device ID or PCI BDF address to query
-f,--file Generate the system topology with the GPU info to a XML file.
-m,--matrix Print the CPU/GPU topology matrix.
S: Self
XL[laneCount]: Two tiles on the different cards are directly connected by Xe Link. Xe Link lane count is also provided.
XL*: Two tiles on the different cards are connected by Xe Link + MDF. They are not directly connected by Xe Link.
SYS: Connected with PCIe between NUMA nodes
NODE: Connected with PCIe within a NUMA node
MDF: Connected with Multi-Die Fabric Interface
```
Get the hardware topology info which is related to the GPU
```
xpu-smi topology -d 0
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Topology Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Local CPU List: 0-23,48-71 |
| | Local CPUs: 000000ff,ffff0000,00ffffff |
| | PCIe Switch Count: 1 |
| | PCIe Switch: /sys/devices/pci0000:4a/0000:4a:02.0/0000:4b:00.0/0000:4c:00.0 |
| | /0000:4d:00.0/0000:4e:18.0/0000:54:00.0/0000:55:01 |
+-----------+--------------------------------------------------------------------------------------+
```
Generate the system hardware topology to a XML file.
```
xpu-smi topology -f topo.xml
The system topology is generated to the file topo.xml.
```
Generate the CPU/GPU topology matrix.
```
xpu-smi topology -m
GPU 0/0|GPU 0/1|GPU 1/0|GPU 1/1|CPU Affinity
GPU 0/0 |S |MDF |XL* |XL16 |0-23,48-71
GPU 0/1 |MDF |S |XL16 |XL* |0-23,48-71
GPU 1/0 |XL* |XL16 |S |MDF |24-47,72-95
GPU 1/1 |XL16 |XL* |MDF |S |24-47,72-95
```
## Update the GPU firmware
Help info of updating GPU firmware
```
xpu-smi updatefw
Update GPU firmware.
Usage: xpu-smi updatefw [Options]
xpu-smi updatefw -d [deviceId] -t GFX -f [imageFilePath]
xpu-smi updatefw -d [pciBdfAddress] -t GFX -f [imageFilePath]
Options:
-h,--help Print this help message and exit
-j,--json Print result in JSON format
-d,--device The device ID or PCI BDF address. If it is not specified, all devices will be updated.
-t,--type The firmware name. Valid options: GFX, GFX_DATA.
-f,--file The firmware image file path on this server.
-u,--username Username used to authenticate for host redfish access
-p,--password Password used to authenticate for host redfish access
-y,--assumeyes Assume that the answer to any question which would be asked is yes
--force Force GFX firmware update. This parameter only works for GFX firmware.
```
Update GPU GFX firmware
```
xpu-smi updatefw -t GFX -d 0 -f ATS_M75_128_B0_PVT_ES_033_dg2_gfx_fwupdate_SOC2.bin -y
This GPU card has multiple cores. This operation will update all firmware. Do you want to continue? (y/n)
Device 0 FW version: DG02_2.2271
Device 1 FW version: DG02_2.2271
Image FW version: DG02_2.2277
Do you want to continue? (y/n)
Start to update firmware
Firmware Name: GFX
Image path: /home/test/ATS_M75_128_B0_PVT_ES_033_dg2_gfx_fwupdate_SOC2.bin
[============================================================] 100 %
Update firmware successfully.
```
## Diagnose GPU with different test suites
When running tests on GPU, GPU will be used exclusively. There will be obviously performance impact on the GPU. Some CPU performance may also be impacted.
Help info for GPU diagnostics
```
xpu-smi diag
Run some test suites to diagnose GPU.
Usage: xpu-smi diag [Options]
xpu-smi diag -d [deviceId] -l [level]
xpu-smi diag -d [pciBdfAddress] -l [level]
xpu-smi diag -d [deviceId] -l [level] -j
xpu-smi diag -d [deviceId] --singletest [testIds]
xpu-smi diag -d [pciBdfAddress] --singletest [testIds]
xpu-smi diag -d [deviceId] --singletest [testIds] -j
xpu-smi diag -d [pciBdfAddress] --singletest [testIds] -j
xpu-smi diag -d [deviceIds] --stress
xpu-smi diag -d [deviceIds] --stress --stresstime [stress time]
xpu-smi diag --precheck
xpu-smi diag --precheck -j
xpu-smi diag --precheck --listtypes
xpu-smi diag --precheck --listtypes -j
xpu-smi diag --precheck --gpu
xpu-smi diag --precheck --gpu -j
xpu-smi diag --stress
xpu-smi diag --stress --stresstime [time]
Options:
-h,--help Print this help message and exit.
-j,--json Print result in JSON format.
-d,--device The device IDs or PCI BDF addresses. -1 means all devices
-l,--level The diagnostic levels to run. The valid options include
1. quick test
2. medium test - this diagnostic level will have the significant performance impact on the specified GPUs
3. long test - this diagnostic level will have the significant performance impact on the specified GPUs
-s --stress Stress the GPU(s) for the specified time
--stresstime Stress time (in minutes). It is an optional parameter. If not specified, CLI does the stress until manually stopped.
--precheck Do the precheck on the GPU and GPU driver
--listtypes List all supported GPU error types
--gpu Show the GPU status only
--since Start time for log scanning. It only works with the journalctl option. The generic format is "YYYY-MM-DD HH:MM:SS".
Alternatively the strings "yesterday", "today" are also understood.
Relative times also may be specified, prefixed with "-" referring to times before the current time.
Scanning would start from the latest boot if it was not specified.
--singletest Selectively run some particular tests. Separated by the comma.
1. Computation
2. Memory Error
3. Memory Bandwidth
4. Media Codec
5. PCIe Bandwidth
6. Power
7. Computation functional test
8. Media Codec functional test
9. Xe Link Throughput
Note that in a multi NUMA node server, it may need to use numactl to specify which node the PCIe bandwidth test runs on.
Usage: numactl [ --membind nodes ] [ --cpunodebind nodes ] xpu-smi diag -d [deviceId] --singletest 5
It also applies to diag level tests.
```
Run test to diagnose GPU
```
xpu-smi diag -d 0 -l 1
Device Type: GPU
+------------------------+-------------------------------------------------------------------------+
| Device ID | 0 |
+------------------------+-------------------------------------------------------------------------+
| Level | 1 |
| Result | Pass |
| Items | 4 |
+------------------------+-------------------------------------------------------------------------+
| Software Env Variables | Result: Pass |
| | Message: Pass to check environment variables. |
+------------------------+-------------------------------------------------------------------------+
| Software Library | Result: Pass |
| | Message: Pass to check libraries |
+------------------------+-------------------------------------------------------------------------+
| Software Permission | Result: Pass |
| | Message: Pass to check permission |
+------------------------+-------------------------------------------------------------------------+
| Software Exclusive | Result: Pass |
| | Message: warning: 2 process(es) are using the device. |
| | PID: 633972, Command: ./ze_gemm |
| | PID: 633973, Command: ./ze_gemm |
+------------------------+-------------------------------------------------------------------------+
```
Check the GPU status
```
xpu-smi diag --precheck
+------------------+-------------------------------------------------------------------------------+
| Component | Details |
+------------------+-------------------------------------------------------------------------------+
| Driver | Status: Pass |
+------------------+-------------------------------------------------------------------------------+
| CPU | CPU ID: 0 |
| | Status: Pass |
+------------------+-------------------------------------------------------------------------------+
| CPU | CPU ID: 1 |
| | Status: Pass |
+------------------+-------------------------------------------------------------------------------+
| GPU | BDF: 0000:3a:00.0 |
| | Status: Pass |
+------------------+-------------------------------------------------------------------------------+
```
Stress the GPU
```
xpu-smi diag --stress
Stress on all GPU
Device: 0 Finished:0 Time: 0 seconds
Device: 1 Finished:0 Time: 0 seconds
Device: 0 Finished:0 Time: 5 seconds
Device: 1 Finished:0 Time: 5 seconds
^C
```
Run the particular tests on the specified GPU
```
xpu-smi diag -d 0 --singletest 1,4
+-------------------------+------------------------------------------------------------------------+
| Device ID | 0 |
+-------------------------+------------------------------------------------------------------------+
| Performance Computation | Result: Pass |
| | Message: Pass to check computation performance. Its single-precision |
| | GFLOPS is 11120.225. |
+-------------------------+------------------------------------------------------------------------+
| Media Codec | Result: Pass |
| | Message: Pass to check Media transcode performance. |
| | 1080p H.265 : 474 FPS |
| | 1080p H.264 : 430 FPS |
| | 4K H.265 : 139 FPS |
| | 4K H.264 : 119 FPS |
+-------------------------+------------------------------------------------------------------------+
```
## Get and change the GPU settings
Help info of getting/changing the GPU settings
```
xpu-smi config
Get and change the GPU settings.
Usage: xpu-smi config [Options]
xpu-smi config -d [deviceId]
xpu-smi config -d [deviceId] -t [tileId] --frequencyrange [minFrequency,maxFrequency]
xpu-smi config -d [deviceId] --powerlimit [powerValue]
xpu-smi config -d [deviceId] --memoryecc [0|1] 0:disable; 1:enable
xpu-smi config -d [deviceId] -t [tileId] --standby [standbyMode]
xpu-smi config -d [deviceId] -t [tileId] --scheduler [schedulerMode]
xpu-smi config -d [deviceId] -t [tileId] --performancefactor [engineType,factorValue]
xpu-smi config -d [deviceId] -t [tileId] --xelinkport [portId,value]
xpu-smi config -d [deviceId] -t [tileId] --xelinkportbeaconing [portId,value]
xpu-smi config -d [deviceId] --reset
xpu-smi config -d [deviceId] --ppr
Options:
-h,--help Print this help message and exit
-j,--json Print result in JSON format
-d,--device The device ID or PCI BDF address to query
-t,--tile The tile ID
--frequencyrange GPU tile-level core frequency range.
--powerlimit GPU-level power limit.
--standby Tile-level standby mode. Valid options: "default"; "never".
--scheduler Tile-level scheduler mode. Value options: "timeout",timeoutValue (us);
"timeslice",interval (us),yieldtimeout (us); "exclusive". The valid range of all time values (us) is from 5000 to 100,000,000.
--performancefactor Set the tile-level performance factor. Valid options: "compute/media";factorValue. The factor value should be
between 0 to 100. 100 means that the workload is completely compute bounded and requires very little support from the memory support. 0 means that the workload is completely memory bounded and the performance of the memory controller needs to be increased.
--xelinkport Change the Xe Link port status. The value 0 means down and 1 means up.
--xelinkportbeaconing Change the Xe Link port beaconing status. The value 0 means off and 1 means on.
--memoryecc Enable/disable memory ECC setting. 0:disable; 1:enable
--reset Reset device by SBR (Secondary Bus Reset). For Max series GPU, add "pci=realloc=off" into the Linux kernel boot parameter when SR-IOV is enabled in BIOS. If SR-IOV is disabled, add "pci=realloc=on" into the Linux kernel boot parameter.
--ppr Apply PPR to the device.
--force Force PPR to run.
```
show the GPU settings
```
xpu-smi config -d 0
+-------------+-------------------+----------------------------------------------------------------+
| Device Type | Device ID/Tile ID | Configuration |
+-------------+-------------------+----------------------------------------------------------------+
| GPU | 0 | Power Limit (w): 300.0 |
| | | Valid Range: 1 to 500 |
| | | |
| | | Memory ECC: |
| | | Current: enabled |
| | | Pending: enabled |
+-------------+-------------------+----------------------------------------------------------------+
| GPU | 0/0 | GPU Min Frequency(MHz): 300.0 |
| | | GPU Max Frequency(MHz): 1300.0 |
| | | Valid Options: 300, 350, 400, 450, 500,550, 600, 650, 700 |
| | | 750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200 |
| | | 1250, 1300 |
| | | |
| | | Standby Mode: default |
| | | Valid Options: default, never |
| | | |
| | | Scheduler Mode: timeslice |
| | | Interval(us): 5000 |
| | | Yield Timeout (us): 640000 |
| | | |
| | | Engine Type: compute |
| | | Performance Factor: 70 |
| | | Engine Type: media |
| | | Performance Factor: 50 |
| | | |
| | | Xe Link ports: |
| | | Up: 0,1,2,3 |
| | | Down: 4,5,6,7 |
| | | Beaconing On: 0,1,2,3 |
| | | Beaconing Off: 4,5,6,7 |
+-------------+-------------------+----------------------------------------------------------------+
| GPU | 0/1 | GPU Min Frequency(MHz): 300.0 |
| | | GPU Max Frequency(MHz): 1300.0 |
| | | Valid Options: 300, 350, 400, 450, 500,550, 600, 650, 700 |
| | | 750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200 |
| | | 1250, 1300 |
| | | |
| | | Standby Mode: default |
| | | Valid Options: default, never |
| | | |
| | | Scheduler Mode: timeslice |
| | | Interval(us): 5000 |
| | | Yield Timeout (us): 640000 |
| | | |
| | | Engine Type: compute |
| | | Performance Factor: 70 |
| | | Engine Type: media |
| | | Performance Factor: 50 |
| | | |
| | | Xe Link ports: |
| | | Up: 0,1,2,3 |
| | | Down: 4,5,6,7 |
| | | Beaconing On: 0,1,2,3 |
| | | Beaconing Off: 4,5,6,7 |
+-------------+-------------------+----------------------------------------------------------------+
```
Change the GPU tile core frequency range.
```
xpu-smi config -d 0 -t 0 --frequencyrange 1200,1300
Succeed to change the core frequency range on GPU 0 tile 0.
```
Change the GPU power limit.
```
xpu-smi config -d 0 --powerlimit 299
Succeed to set the power limit on GPU 0.
```
Change the GPU memory ECC mode.
```
xpu-smi config -d 0 --memoryecc 0
Return: Succeed to change the ECC mode to be disabled on GPU 0. Please reset GPU or reboot OS to take effect.
```
Change the GPU tile standby mode.
```
xpu-smi config -d 0 -t 0 --standby never
Succeed to change the standby mode on GPU 0.
```
Change the GPU tile scheduler mode.
```
xpu-smi config -d 0 -t 0 --scheduler timeout,640000
Succeed to change the scheduler mode on GPU 0 tile 0.
```
Set the performance factor
```
xpu-smi config -d 0 -t 0 --performancefactor compute,70
Succeed to change the compute performance factor to 70 on GPU 0 tile 0.
```
Change the Xe Link port status
```
xpu-smi config -d 0 -t 0 --xelinkport 0,1
Succeed to change Xe Link port 0 to up.
```
Change the Xe Link port beaconing status
```
xpu-smi config -d 0 -t 0 --xelinkportbeaconing 0,1
Succeed to change Xe Link port 0 beaconing to on.
```
## Get the device real-time statistics
Help info for getting the GPU device real-time statistics
```
List the GPU statistics.
Usage: xpu-smi stats [Options]
xpu-smi stats
xpu-smi stats -d [deviceId]
xpu-smi stats -d [pciBdfAddress]
xpu-smi stats -d [deviceId] -j
xpu-smi stats -d [pciBdfAddress] -j
xpu-smi stats -d [deviceId] -e
xpu-smi stats -d [pciBdfAddress] -e
xpu-smi stats -d [deviceId] -e -j
xpu-smi stats -d [pciBdfAddress] -e -j
Options:
-h,--help Print this help message and exit
-j,--json Print result in JSON format
-d,--device The device ID or PCI BDF address to query
-e,--eu Show the EU metrics
-r,--ras Show RAS error metrics
--xelink Show the all the Xe Link throughput (GB/s) matrix
--utils Show the Xe Link throughput utilization
```
List the GPU device real-time statistics that are collected by xpu-smi
```
xpu-smi stats -d 0
+-----------------------------+--------------------------------------------------------------------+
| Device ID | 0 |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%) | 0 |
| EU Array Active (%) | |
| EU Array Stall (%) | |
| EU Array Idle (%) | |
| | |
| Compute Engine Util (%) | 0; Engine 0: 0 |
| Render Engine Util (%) | 0; Engine 0: 0 |
| Media Engine Util (%) | 0 |
| Decoder Engine Util (%) | Engine 0: 0, Engine 1: 0 |
| Encoder Engine Util (%) | Engine 0: 0, Engine 1: 0 |
| Copy Engine Util (%) | 0; Engine 0: 0 |
| Media EM Engine Util (%) | Engine 0: 0, Engine 1: 0 |
| 3D Engine Util (%) | |
+-----------------------------+--------------------------------------------------------------------+
| Reset | 0 |
| Programming Errors | 0 |
| Driver Errors | 0 |
| Cache Errors Correctable | 0 |
| Cache Errors Uncorrectable | 0 |
| Mem Errors Correctable | 0 |
| Mem Errors Uncorrectable | 0 |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W) | 15 |
| GPU Frequency (MHz) | 2050 |
| Media Engine Freq (MHz) | 1025 |
| GPU Core Temperature (C) | 41 |
| GPU Memory Temperature (C) | |
| GPU Memory Read (kB/s) | |
| GPU Memory Write (kB/s) | |
| GPU Memory Bandwidth (%) | |
| GPU Memory Used (MiB) | 24 |
| GPU Memory Util (%) | 0 |
| Xe Link Throughput (kB/s) | |
+-----------------------------+--------------------------------------------------------------------+
```
The media engine frequency is inferred with GPU frequency. Some GPU telemetries are not available for non-root users by default, such as GPU temperature. If you want to retrieve the GPU temperature with non-root privilege user, please add read permission in sysfs: /sys/class/intel_pmt/telem?/telem. Please refer to the example below. After that, you may read GPU temperature with non-root user.
```
cd /sys/class/intel_pmt/telem1
sudo chmod o+r telem
```
Sometimes, GPU memory throughput is temporarily unavailable due to the slow response from the device.
## Get the device health status
Help info of get GPU device component health status
```
xpu-smi health
Get the GPU device component health status
Usage: xpu-smi health [Options]
xpu-smi health -l
xpu-smi health -l -j
xpu-smi health -d [deviceId]
xpu-smi health -d [deviceId] -j
xpu-smi health -d [pciBdfAddress]
xpu-smi health -d [deviceId] -j
xpu-smi health -d [pciBdfAddress] -j
xpu-smi health -d [deviceId] -c [componentTypeId]
xpu-smi health -d [pciBdfAddress] -c [componentTypeId] -j
optional arguments:
-h,--help Print this help message and exit
-j,--json Print result in JSON format
-l,--list Display health info for all devices
-d,--device The device ID or PCI BDF address
-c,--component Component types
1. GPU Core Temperature
2. GPU Memory Temperature
3. GPU Power
4. GPU Memory
5. Xe Link Port
6. GPU Frequency
```
Get the GPU device component health status.
```
xpu-smi health -l
+------------------------------+-------------------------------------------------------------------+
| Device ID | 0 |
+------------------------------+-------------------------------------------------------------------+
| 1. GPU Core Temperature | Status: Ok |
| | Description: All temperature sensors are healthy. |
| | Throttle Threshold: 105 Celsius Degree |
| | Shutdown Threshold: 130 Celsius Degree |
+------------------------------+-------------------------------------------------------------------+
| 2. GPU Memory Temperature | Status: Ok |
| | Description: All temperature sensors are healthy. |
| | Throttle Threshold: 85 Celsius Degree |
| | Shutdown Threshold: 100 Celsius Degree |
+------------------------------+-------------------------------------------------------------------+
| 3. GPU Power | Status: Ok |
| | Description: All power domains are healthy. |
| | Throttle Threshold: 150 watts |
+------------------------------+-------------------------------------------------------------------+
| 4. GPU Memory | Status: Ok |
| | Description: All memory channels are healthy. |
+------------------------------+-------------------------------------------------------------------+
| 5. Xe Link Port | Status: Ok |
| | Description: All ports are healthy. |
+------------------------------+-------------------------------------------------------------------+
| 6. GPU Frequency | Status: OK |
| | Description: The device frequency not throttled |
+------------------------------+-------------------------------------------------------------------+
```
## Dump the device statistics in CSV format
### The Statistics supported by Intel data center GPUs
* Intel(R) Data Center Flex Series GPU
* GPU and GPU engine utilizations
* GPU energy and power
* GPU core temperature
* GPU frequency
* GPU memory usage
* GPU memory throughput
* GPU EU utilization
* GPU PCIe throughput
* GPU throttle reason
* Intel(R) Data Center Max Series GPU
* GPU and GPU engine utilizations
* GPU energy and power
* GPU core and GPU memory temperature
* GPU frequency
* GPU memory usage
* GPU memory throughput
* GPU PCIe throughput
* Xe Link throughput
* GPU throttle reason
* GPU reset count and GPU memory error number
Help info of the device statistics dump. Please note that the metrics 'Programming Errors', 'Driver Errors', 'Cache Errors Correctable' and 'Cache Errors Uncorrectable' are not implemented in dump sub-command so far. Please do not dump these metrics. 'Media Engine Frequency' is inferred with 'GPU Frequency'.
```
xpu-smi dump -h
Dump device statistics data.
Usage: xpu-smi dump [Options]
xpu-smi dump -d [deviceIds] -m [metricsIds] -i [timeInterval] -n [dumpTimes]
xpu-smi dump -d [deviceIds] -t [deviceTileIds] -m [metricsIds] -i [timeInterval] -n [dumpTimes]
xpu-smi dump -d [pciBdfAddress] -t [deviceTileIds] -m [metricsIds] -i [timeInterval] -n [dumpTimes]
Options:
-h,--help Print this help message and exit
-j,--json Print result in JSON format
--debug Print debug info
-d,--device The device IDs or PCI BDF addresses to query. The value of "-1" means all devices.
-t,--tile The device tile ID to query. If the device has only one tile, this parameter should not be specified.
-m,--metrics Metrics type to collect raw data, options. Separated by the comma.
0. GPU Utilization (%), GPU active time of the elapsed time, per tile
1. GPU Power (W), per tile
2. GPU Frequency (MHz), per tile
3. GPU Core Temperature (Celsius Degree), per tile
4. GPU Memory Temperature (Celsius Degree), per tile
5. GPU Memory Utilization (%), per tile
6. GPU Memory Read (kB/s), per tile
7. GPU Memory Write (kB/s), per tile
8. GPU Energy Consumed (J), per tile
9. GPU EU Array Active (%), the normalized sum of all cycles on all EUs that were spent actively executing instructions. Per tile.
10. GPU EU Array Stall (%), the normalized sum of all cycles on all EUs during which the EUs were stalled. Per tile.
At least one thread is loaded, but the EU is stalled. Per tile.
11. GPU EU Array Idle (%), the normalized sum of all cycles on all cores when no threads were scheduled on a core. Per tile.
12. Reset Counter, per tile.
13. Programming Errors, per tile.
14. Driver Errors, per tile.
15. Cache Errors Correctable, per tile.
16. Cache Errors Uncorrectable, per tile.
17. GPU Memory Bandwidth Utilization (%)
18. GPU Memory Used (MiB)
19. PCIe Read (kB/s), per GPU
20. PCIe Write (kB/s), per GPU
21. Xe Link Throughput (kB/s), a list of tile-to-tile Xe Link throughput.
22. Compute engine utilizations (%), per tile.
23. Render engine utilizations (%), per tile.
24. Media decoder engine utilizations (%), per tile.
25. Media encoder engine utilizations (%), per tile.
26. Copy engine utilizations (%), per tile.
27. Media enhancement engine utilizations (%), per tile.
28. 3D engine utilizations (%), per tile.
29. GPU Memory Errors Correctable, per tile. Other non-compute correctable errors are also included.
30. GPU Memory Errors Uncorrectable, per tile. Other non-compute uncorrectable errors are also included.
31. Compute engine group utilization (%), per tile
32. Render engine group utilization (%), per tile
33. Media engine group utilization (%), per tile
34. Copy engine group utilization (%), per tile
35. Throttle reason, per tile
36. Media Engine Frequency (MHz), per tile
-i The interval (in seconds) to dump the device statistics to screen. Default value: 1 second.
-n Number of the device statistics dump to screen. The dump will never be ended if this parameter is not specified.
```
Dump the device statistics to screen in CSV format.
```
xpu-smi dump -d 0 -m 0,1,2 -i 1 -n 5
Timestamp, DeviceId, GPU Utilization (%), GPU Power (W), GPU Frequency (MHz)
06:14:46.000, 0, 0.00, 14.61, 0
06:14:47.000, 0, 0.00, 14.59, 0
06:14:48.000, 0, 0.00, 14.61, 0
06:14:49.000, 0, 0.00, 14.61, 0
06:14:50.000, 0, 0.00, 14.61, 0
```
Sometimes, GPU memory throughput is temporarily unavailable due to the slow response from the device.
## Collect the GPU debug log files
Help info of collecting GPU log files.
```
xpu-smi log -h
Collect GPU debug logs.
Usage: xpu-smi log [Options]
xpu-smi log -f [tarGzipFileName]
Options:
-h,--help Print this help message and exit
-j,--json Print result in JSON format
-f,--file The file (a tar.gz) to archive all the debug logs
```
Collect the GPU log files.
```
xpu-smi log -f 1217.tar.gz
Done
```
## Set up GPU SR-IOV configuration
Only Ubuntu and CentOS 7 are validated.
### BIOS settings for GPU SR-IOV
#### Intel M50CYP
Advanced -> PCI Configuration -> Memory Mapped I/O above 4GB : Enabled
Advanced -> PCI Configuration -> MMIO High Base : 56T
Advanced -> PCI Configuration -> Memory Mapped I/O Size : 1024G
Advanced -> System Acoustic and Performance Configuration -> Set Fan Profile : Performance
Advanced -> Integrated IO Configuration -> PCIe Misc. Configuration -> PCIe ASPM Support (Global) : Disabled
Advanced -> Power & Performance -> Workload Configuration : I/O Sensitive
Advanced -> Power & Performance -> CPU P State Control -> Intel(R) Turbo Boost Technology : Enabled
Advanced -> PCI Configuration -> SR-IOV Support : Enabled
Advanced -> Processor Configuration -> Intel(R) Virtualization Technology: Enabled
Advanced -> Integrated IO Configuration -> Intel(R) VT for Directed I/O : Enabled
#### Inspur NF5280M6
MMCFG Base = Auto
MMCFG Size = Auto
MMIO High Base = 32T
MMIO High Granularity Size = 1024G
VMX = Enabled
SR-IOV Support = Enabled
Intel VT for Directed I/O (VT-d) = Enabled
Power/Performance Profile = Virtualization
#### H3C R5300 G5
MMCFG Base=Auto
MMIO High Base=32T
MMIO High Granularity Size=1024G
VMX=Enabled
SR-IOV Support=Enabled
Intel VT for Directed I/O=Enabled
ENERGY_PERF_BIAS_CFG mode=Performance
Workload Configuration=Balanced
Workload Profile Configuration=Graphic Processing
#### Dell PowerEdge XR12
Integrated Devices -> Memory Mapped I/O above 4GB = Enabled
Integrated Devices -> Memory Mapped Base = 56TB
Integrated Devices -> SR-IOV Global Enable = Enabled
Processor Settings -> Virtualization Technology = Enabled
System Profile Settings -> System Profile = Performance
#### Supermicro SYS-620C-TN12R
Advanced -> Chipset Configuration -> NorthBridge -> IIO Configuration -> Intel VT for Directed I/O (VT-d) -> Enable
Advanced -> Chipset Configuration -> NorthBridge -> IIO Configuration -> PCI-E ASPM Support (Global) -> No
Advanced -> Chipset Configuration -> NorthBridge -> IIO Configuration -> IIO eDPC Support -> Disable
Advanced -> PCIe/PCI/PnP Configuration -> Above 4G Decoding -> Enabled
Advanced -> PCIe/PCI/PnP Configuration -> SR-IOV Support -> Enable
Advanced -> PCIe/PCI/PnP Configuration -> ARI Support -> Enable
Advanced -> PCIe/PCI/PnP Configuration -> MMCFG Base -> Auto
Advanced -> PCIe/PCI/PnP Configuration -> MMIO High Base -> 32T
Advanced -> PCIE/PCI/PnP Configuration -> MMIO High Granularity Size -> 1024G or 2048G
#### HPE ProLiant DL380 Gen10
Virtualization Options -> Intel(R) Virtualization Technology: Enabled
Virtualization Options -> Intel(R) VT-d: Enabled
Virtualization Options -> SR-IOV: Enabled
### Add Linux kernel command line parameters
After BIOS settings are rightly configured and the GPU driver is installed, some kernel command line parameters need be added into the "GRUB_CMDLINE_LINUX" setting of the file, /etc/default/grub. They are "intel_iommu=on i915.max_vfs=31". "intel_iommmu" is for IOMMU and "i915.max_vfs" is for SR-IOV. After you update the grub file, please run the command like "update-grub" to set to kernel image and reboot OS to take effect. You may check the content of /proc/cmdline to confirm that they are rightly set.
The command "sudo xpu-smi vgpu --addkernelparam" can automatically add these kernel parameters.
### Help info of GPU SR-IOV configuration feature.
```
xpu-smi vgpu
Create and remove virtual GPUs in SR-IOV configuration.
Usage: xpu-smi vgpu [Options]
xpu-smi vgpu --precheck
xpu-smi vgpu --addkernelparam
xpu-smi vgpu -d [deviceId] -c -n [vGpuNumber] --lmem [vGpuMemorySize]
xpu-smi vgpu -d [pciBdfAddress] -c -n [vGpuNumber] --lmem [vGpuMemorySize]
xpu-smi vgpu -d [deviceId] -r
xpu-smi vgpu -d [pciBdfAddress] -r
xpu-smi vgpu -d [deviceId] -l
xpu-smi vgpu -d [pciBdfAddress] -l
Options:
-h,--help Print this help message and exit
-j,--json Print result in JSON format
--precheck Check if BIOS settings and kernel command line options are ready to create virtual GPUs
--addkernelparam Add the kernel command line parameters for the virtual GPUs
-d,--device Device ID or PCI BDF address
-c,--create Create the virtual GPUs
-n The number of virtual GPUs to create.
--lmem The memory size of each virtual GPUs, in MiB. For example, --lmem 500. This parameter is optional.
-l,--list List all virtual GPUs on the specified physical GPU
-r,--remove Remove all virtual GPUs on the specified physical GPU
-y,--assumeyes Assume that the answer to any question which would be asked is yes
```
### GPU SR-IOV precheck
```
xpu-smi vgpu --precheck
+------------------+-------------------------------------------------------------------------------+
| VMX Flag | Result: Pass |
| | Message: |
+------------------+-------------------------------------------------------------------------------+
| SR-IOV | Status: Fail |
| | Message: SR-IOV is disabled. Please set the related BIOS settings and |
| | kernel command line options. |
+------------------+-------------------------------------------------------------------------------+
| IOMMU | Status: Pass |
| | Message: |
+------------------+-------------------------------------------------------------------------------+
```
### Add the Linux kernel command line parameters for the virtual GPUs
```
sudo xpu-smi vgpu --addkernelparam -y
Do you want to add the required kernel command line parameters? (y/n) y
Succeed in adding the required kernel command line parameters, "intel_iommu=on i915.max_vfs=31". "intel_iommmu" is for IOMMU and "i915.max_vfs" is for SR-IOV. Please reboot OS to take effect.
```
### Create the virtual GPUs
After all GPU SR-IOV pre-checks are passed, you may create the virtual GPUs
```
sudo xpu-smi vgpu -d 0 -c -n 4 --lmem 500
+--------------------------------------------------------------------------------------------------+