Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. #1066

Closed
chengjunlu opened this issue May 8, 2024 · 12 comments · Fixed by #1136
Closed
Assignees
Labels
enhancement New feature or request performance

Comments

@chengjunlu
Copy link
Contributor

There is no stand along profiler tools for Triton XPU now.

We used to use:

  1. the Torch legacy profiler with the IPEX extension. (This is going to be removed by IPEX)
  2. The new torch profiler with the Kineto extended by IPEX. (This depends on the Kineto and Torch)
  3. Use the synchronization wait on the host to measure the performance. (This is not accurate with host overheads.)

The Triton has a new component for profiling performance of the Triton kernel. It worth to support it for the Triton XPU.

@vlad-penkin vlad-penkin added enhancement New feature or request performance labels May 8, 2024
@tdeng5 tdeng5 changed the title [Profiler] Support Triton XPU kernel profiling thru the Proton [Profiler] Provide an accuracy method to profile Triton XPU Kernel's execution time. May 16, 2024
@tdeng5
Copy link

tdeng5 commented May 16, 2024

It is the highest priority for collecting accurate Triton performance data for the coming Triton Demo on Jun 25.

@tdeng5 tdeng5 changed the title [Profiler] Provide an accuracy method to profile Triton XPU Kernel's execution time. [Profiler] Provide a method to profile Triton XPU Kernel's accuracy execution time. May 16, 2024
@etiotto etiotto reopened this May 17, 2024
@etiotto
Copy link
Contributor

etiotto commented May 17, 2024

I have added post review comments to the PR that closed this issue, see #1136 (comment).

I am concerned the benchmarks use a different way than the do_bench Triton uses to compute timing.

@chengjunlu
Copy link
Contributor Author

As the legacy profiler in public Torch doesn't support XPU, we need to use the Kineto to profile the Triton kernel.
The alternative solution is to enable the Proton which is tracked in
#1145

Change the title of this issue to be more precise.

@chengjunlu chengjunlu changed the title [Profiler] Provide a method to profile Triton XPU Kernel's accuracy execution time. [Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. Jul 30, 2024
@chengjunlu
Copy link
Contributor Author

The Pytorch Kineto for XPU requires a separate PTI package intel-pti-dev_p_0.9.0.32 which is not included PTDB package so far.

I will try it with Triton kernel to see if it works properly.

@chengjunlu
Copy link
Contributor Author

Add @ZzEeKkAa as assignee to this issue because he has already worked on the Kineto profiler integration to Triton.

Here is the PR #1905 from @ZzEeKkAa

@chengjunlu chengjunlu linked a pull request Aug 23, 2024 that will close this issue
@ZzEeKkAa
Copy link
Contributor

Speaking of PTI, can we use it for the elapsed_time. We would unblock long running pytorch/pytorch#126456?

@vlad-penkin
Copy link
Contributor

@chengjunlu could you please provide the detailed instruction on how to use Kineto for PyTorch profiling in general and Triton kernels profiling in particular. The use cases in scope are:

  • Triton UT's
  • Triton Tutorial's
  • Torch Inductor UT's relevant to the Triton
  • PyTorch/Benchmark E2E tests

@chengjunlu
Copy link
Contributor Author

Speaking of PTI, can we use it for the elapsed_time. We would unblock long running pytorch/pytorch#126456?

The elapsed_time is more general that can be used to profile the E2E GPU time including the bubble time which maybe caused by kernel scheduling bubble. I am not sure it is possible in PTI.

@chengjunlu
Copy link
Contributor Author

chengjunlu commented Aug 28, 2024

@chengjunlu could you please provide the detailed instruction on how to use Kineto for PyTorch profiling in general and Triton kernels profiling in particular. The use cases in scope are:

  • Triton UT's
  • Triton Tutorial's
  • Torch Inductor UT's relevant to the Triton
  • PyTorch/Benchmark E2E tests

I combine all the information here for performance profiling.

There are two ways used in Torch + Triton to measure GPU kernel performance:

  1. Diff the two events time stamp by elapsed_time.
  2. Use the Kineto to profile kernel time thru PTI.

Note: Pytorch XPU 2.5 doesn't support elapsed_time and raises exception. The Triton XPU work around it approximately by using the wall time instead of GPU time stamp. I will mark the use cases with Triton Workaround for notice.

Here are the cases using 1st way:

  • Triton Tutorial's Triton Workaround
  • Triton UT's Triton Workaround
  • Torch Inductor UT's relevant to the Triton. Triton Workaround

Here are the cases using 2nd way:

  • PyTorch/Benchmark E2E tests

@whitneywhtsang
Copy link
Contributor

Note: https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/scripts/patch-pytorch.sh can be applied to allow using elapsed_time for the cases that specified above as using 1st way. It is used in CI and developer script.

@chengjunlu
Copy link
Contributor Author

chengjunlu commented Aug 28, 2024

The Kineto is blocked by an issue in Intel PTI not able to trace the Triton kernel launched by SYCL API.

We can use the first way as a work around to get the approximate performance profiling with the patch https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/scripts/patch-pytorch.sh.

For the Pytorch 2.5 OOB supporting, we have to use the wall time as a work around for now.

The changes has been pushed to the PR #1905

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment