Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] Add an option for profiling cuda kernels #16061

Merged
merged 1 commit into from
Nov 5, 2023

Conversation

echuraev
Copy link
Contributor

@echuraev echuraev commented Nov 3, 2023

Several Nvidia tools such as Nsight Systems and Nsight Compute can be used for profiling cuda kernels. NVIDIA Nsight Systems collects system-wide information about your program and GPU events and might help you to understand possible bottlenecks in your topology. To profile concrete Cuda kernel, NVIDIA Nsight Compute can be used.

If you try to profile cuda kernel from TVM with Nsight Compute without this patch, then you see only SASS instructions instead of the source code. It is useful, but sometimes it might be easier to analyze generated cuda code instead of instructions. In this patch, a new pass config option was added. By using option cuda.kernels_output_dir, you can specify the directory where cuda source code should be stored after the build. Also, in the case of using this option, cuda kernels will be compiled with option -lineinfo which is an equivalent of -g option in GCC. When the cuda kernels were compiled with -lineinfo option, then Nsight compute can map profile information to the source code. One important note, that to get the source code in Nsight Compute, you have to set parameter Import Source during profiling session configuration equals to Yes.

Several Nvidia tools such as Nsight Systems and Nsight Compute can be
used for profiling cuda kernels. NVIDIA Nsight Systems collects
system-wide information about your program and GPU events and might help
you to understand possible bottlenecks in your topology. To profile
concrete Cuda kernel, NVIDIA Nsight Compute can be used.

If you try to profile cuda kernel from TVM with Nsight Compute without
this patch, then you see only SASS instructions instead of the source
code. It is useful, but sometimes it might be easier to analyze
generated cuda code instead of instructions. In this patch, a new pass
config option was added. By using option `cuda.kernels_output_dir`, you
can specify the directory where cuda source code should be stored after
the build. Also, in the case of using this option, cuda kernels will be
compiled with option `-lineinfo` which is an equivalent of `-g` option
in GCC. When the cuda kernels were compiled with `-lineinfo` option,
then Nsight compute can map profile information to the source code. One
important note, that to get the source code in Nsight Compute, you have
to set parameter `Import Source` during profiling session configuration
equals to `Yes`.
@echuraev
Copy link
Contributor Author

echuraev commented Nov 3, 2023

Here is an example how Nvidia tools can be used to analyze kernels in the models, compiled with TVM.

I took the code which was used in autotvm x86 tutorial for running ResNet50-v2 model and modified the target to run on cuda. After that, I added the option cuda.kernels_output_dir to PassContext during model compilation:

with tvm.transform.PassContext(opt_level=3, config={"cuda.kernels_output_dir": "___tmp_cuda_dumps"}):
    lib = relay.build(mod, target=target, params=params)

After running the script, directory ___tmp_cuda_dumps will be created and the file with cuda kernels will be stored in this directory.

If we want to profile our model, first we can use Nvidia Nsight Systems. Run our model with Nsight Systems profiler:

nsys profile python3 model_run.py

After executing this command, a new file report1.nsys-rep will be created in a working directory. You can open this report by using Nvidia Nsight Systems. On the screenshot you can see an example of the main window after opening the report:
изображение

You can zoom in the window with trace and expand row with CUDA HW (...). Then you can see the sequence of cuda kernels which were executed. After high-level analysis, you might want to analyze a concrete kernel. In such case you can select this kernel and click Analyze the Selected Kernel with NVIDIA Nsight Compute:
изображение

After that you can start GUI interface of installed NVIDIA Nsight Compute or display a command line for Nsight Compute CLI. I prefer GUI interface because it gives you more tools for analysis.
изображение

After selecting GUI interface, an instance of Nsight Compute will be opened. You should click on the Connect button and on the opened window it will be necessary to configure your profiling session:

  1. Configure your connection to the target machine in case of remote profiling
  2. Specify environment variables. Usually, I add the following env variables:
    PATH=<path_to_anaconda>/envs/<env_name>/bin:/usr/local/cuda/bin:/usr/bin;
    PYTHONPATH=<path_to_tvm>/python;
    LD_LIBRARY_PATH=<path_to_tvm>/build
    
    Path to anaconda is necessary olny in case if you use specific python environment. All variables should be separated by ;.
  3. In profile configuration part, on the tab Other set value of parameter Import Source equals to Yes.
  4. After that you should be able to launch kernel profiling.
    изображение

When the profiling was finished, you can see the detailed overview of your kernel. This page provides a lot of information about your kernel and hardware utilization:
изображение

After switching to the Source page, you can see the source code of the kernels and SASS instructions:
изображение

You can see on the screenshot that code or our kernel is on line 1329. It is happened because TVM dumps kernels for whole network into a single file and file with sources contains all kernels from the model. But as you can see, near the scroll bar there is a colorized area. This is the area where the executed code is located. So we can easy find necessary kernel in this file.

@echuraev
Copy link
Contributor Author

echuraev commented Nov 3, 2023

Here we can discuss current implementation and if we should split the file with all cuda kernels into a separate files or not.

@echuraev echuraev marked this pull request as ready for review November 3, 2023 10:38
@echuraev echuraev requested a review from masahi November 3, 2023 10:38
@Hzfengsy Hzfengsy merged commit b144145 into apache:main Nov 5, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants