[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. #1066

chengjunlu · 2024-05-08T08:26:20Z

There is no stand along profiler tools for Triton XPU now.

We used to use:

the Torch legacy profiler with the IPEX extension. (This is going to be removed by IPEX)
The new torch profiler with the Kineto extended by IPEX. (This depends on the Kineto and Torch)
Use the synchronization wait on the host to measure the performance. (This is not accurate with host overheads.)

The Triton has a new component for profiling performance of the Triton kernel. It worth to support it for the Triton XPU.

tdeng5 · 2024-05-16T01:14:30Z

It is the highest priority for collecting accurate Triton performance data for the coming Triton Demo on Jun 25.

etiotto · 2024-05-17T15:50:19Z

I have added post review comments to the PR that closed this issue, see #1136 (comment).

I am concerned the benchmarks use a different way than the do_bench Triton uses to compute timing.

chengjunlu · 2024-07-30T00:05:34Z

Use the Torch legacy profiler with the IPEX extension to profile the Triton kernel.
The changes has been merged.
To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136

As the legacy profiler in public Torch doesn't support XPU, we need to use the Kineto to profile the Triton kernel.
The alternative solution is to enable the Proton which is tracked in
#1145

Change the title of this issue to be more precise.

chengjunlu · 2024-08-19T07:42:10Z

The Pytorch Kineto for XPU requires a separate PTI package intel-pti-dev_p_0.9.0.32 which is not included PTDB package so far.

I will try it with Triton kernel to see if it works properly.

chengjunlu · 2024-08-23T02:08:42Z

Add @ZzEeKkAa as assignee to this issue because he has already worked on the Kineto profiler integration to Triton.

Here is the PR #1905 from @ZzEeKkAa

ZzEeKkAa · 2024-08-23T15:03:42Z

Speaking of PTI, can we use it for the elapsed_time. We would unblock long running pytorch/pytorch#126456?

vlad-penkin · 2024-08-23T20:55:01Z

@chengjunlu could you please provide the detailed instruction on how to use Kineto for PyTorch profiling in general and Triton kernels profiling in particular. The use cases in scope are:

Triton UT's
Triton Tutorial's
Torch Inductor UT's relevant to the Triton
PyTorch/Benchmark E2E tests

chengjunlu · 2024-08-26T06:58:43Z

Speaking of PTI, can we use it for the elapsed_time. We would unblock long running pytorch/pytorch#126456?

The elapsed_time is more general that can be used to profile the E2E GPU time including the bubble time which maybe caused by kernel scheduling bubble. I am not sure it is possible in PTI.

chengjunlu · 2024-08-28T01:31:52Z

@chengjunlu could you please provide the detailed instruction on how to use Kineto for PyTorch profiling in general and Triton kernels profiling in particular. The use cases in scope are:

Triton UT's

Triton Tutorial's

Torch Inductor UT's relevant to the Triton

PyTorch/Benchmark E2E tests

I combine all the information here for performance profiling.

There are two ways used in Torch + Triton to measure GPU kernel performance:

Diff the two events time stamp by elapsed_time.
Use the Kineto to profile kernel time thru PTI.

Note: Pytorch XPU 2.5 doesn't support elapsed_time and raises exception. The Triton XPU work around it approximately by using the wall time instead of GPU time stamp. I will mark the use cases with Triton Workaround for notice.

Here are the cases using 1st way:

Triton Tutorial's Triton Workaround
Triton UT's Triton Workaround
Torch Inductor UT's relevant to the Triton. Triton Workaround

Here are the cases using 2nd way:

PyTorch/Benchmark E2E tests

whitneywhtsang · 2024-08-28T01:38:10Z

Note: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/scripts/patch-pytorch.sh can be applied to allow using elapsed_time for the cases that specified above as using 1st way. It is used in CI and developer script.

chengjunlu · 2024-08-28T04:00:15Z

The Kineto is blocked by an issue in Intel PTI not able to trace the Triton kernel launched by SYCL API.

We can use the first way as a work around to get the approximate performance profiling with the patch https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/scripts/patch-pytorch.sh.

For the Pytorch 2.5 OOB supporting, we have to use the wall time as a work around for now.

The changes has been pushed to the PR #1905

vlad-penkin · 2024-11-21T22:08:41Z

Closing this issue, it was superseded by:

vlad-penkin added this to the 0.3 [Triton] Language and Runtime milestone May 8, 2024

vlad-penkin added enhancement New feature or request performance labels May 8, 2024

tdeng5 changed the title ~~[Profiler] Support Triton XPU kernel profiling thru the Proton~~ [Profiler] Provide an accuracy method to profile Triton XPU Kernel's execution time. May 16, 2024

tdeng5 changed the title ~~[Profiler] Provide an accuracy method to profile Triton XPU Kernel's execution time.~~ [Profiler] Provide a method to profile Triton XPU Kernel's accuracy execution time. May 16, 2024

chengjunlu linked a pull request May 16, 2024 that will close this issue

To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136

Merged

chengjunlu closed this as completed in #1136 May 17, 2024

chengjunlu mentioned this issue May 17, 2024

[Profiler] Enable Triton Proton for the Intel GPU's #1145

Open

etiotto reopened this May 17, 2024

etiotto assigned chengjunlu May 17, 2024

vlad-penkin modified the milestones: 0.3 [Triton] Language and Runtime, 4.0 [Performance] Core Jun 12, 2024

chengjunlu changed the title ~~[Profiler] Provide a method to profile Triton XPU Kernel's accuracy execution time.~~ [Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. Jul 30, 2024

whitneywhtsang mentioned this issue Jul 30, 2024

[Profiler] Verify if Triton SYCL event based profiler still works with PyTorch upstream #1685

Closed

vlad-penkin modified the milestones: 4.0 [Performance] Core, 4.5 [Performance] Profiling Aug 18, 2024

chengjunlu assigned ZzEeKkAa Aug 23, 2024

chengjunlu linked a pull request Aug 23, 2024 that will close this issue

Build benchmarks with upstream pytorch #1905

Closed

vlad-penkin unassigned ZzEeKkAa Aug 23, 2024

chengjunlu removed a link to a pull request Aug 30, 2024

Build benchmarks with upstream pytorch #1905

Closed

vlad-penkin mentioned this issue Sep 13, 2024

[Tutorials] Enable 09-persistent-matmul.py #1699

Open

vlad-penkin closed this as completed Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. #1066

[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. #1066

chengjunlu commented May 8, 2024

tdeng5 commented May 16, 2024 •

edited

Loading

etiotto commented May 17, 2024

chengjunlu commented Jul 30, 2024

chengjunlu commented Aug 19, 2024

chengjunlu commented Aug 23, 2024

ZzEeKkAa commented Aug 23, 2024

vlad-penkin commented Aug 23, 2024

chengjunlu commented Aug 26, 2024

chengjunlu commented Aug 28, 2024 •

edited

Loading

whitneywhtsang commented Aug 28, 2024

chengjunlu commented Aug 28, 2024 •

edited

Loading

vlad-penkin commented Nov 21, 2024

[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. #1066

[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. #1066

Comments

chengjunlu commented May 8, 2024

tdeng5 commented May 16, 2024 • edited Loading

etiotto commented May 17, 2024

chengjunlu commented Jul 30, 2024

chengjunlu commented Aug 19, 2024

chengjunlu commented Aug 23, 2024

ZzEeKkAa commented Aug 23, 2024

vlad-penkin commented Aug 23, 2024

chengjunlu commented Aug 26, 2024

chengjunlu commented Aug 28, 2024 • edited Loading

whitneywhtsang commented Aug 28, 2024

chengjunlu commented Aug 28, 2024 • edited Loading

vlad-penkin commented Nov 21, 2024

tdeng5 commented May 16, 2024 •

edited

Loading

chengjunlu commented Aug 28, 2024 •

edited

Loading

chengjunlu commented Aug 28, 2024 •

edited

Loading