Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: no kernel image is available for execution on the device #147

Open
leoauri opened this issue Jun 19, 2024 · 3 comments
Open

Comments

@leoauri
Copy link

leoauri commented Jun 19, 2024

Hi there,
I have copied s4.py and the kernel extension into another repository I am working on. I had S4 components running (with CUDA), and then I installed the kernel extensions. The build output was full of deprecation warnings so filled my terminal history, but ends with

...
creating build/lib.linux-x86_64-cpython-310
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-310/cauchy.o build/temp.linux-x86_64-cpython-310/cauchy_cuda.o -L/usr/local/lib/python3.10/dist-packages/torch/lib -L/usr/local/cuda/lib64 -L/usr/lib/x86_64-linux-gnu -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/structured_kernels.cpython-310-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-cpython-310/structured_kernels.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for structured_kernels.cpython-310-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/structured_kernels.py to structured_kernels.cpython-310.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying structured_kernels.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying structured_kernels.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying structured_kernels.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying structured_kernels.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.structured_kernels.cpython-310: module references __file__
creating 'dist/structured_kernels-0.1.0-py3.10-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing structured_kernels-0.1.0-py3.10-linux-x86_64.egg
creating /usr/local/lib/python3.10/dist-packages/structured_kernels-0.1.0-py3.10-linux-x86_64.egg
Extracting structured_kernels-0.1.0-py3.10-linux-x86_64.egg to /usr/local/lib/python3.10/dist-packages
Adding structured-kernels 0.1.0 to easy-install.pth file

Installed /usr/local/lib/python3.10/dist-packages/structured_kernels-0.1.0-py3.10-linux-x86_64.egg
Processing dependencies for structured-kernels==0.1.0
Finished processing dependencies for structured-kernels==0.1.0

Also, I remember at the beginning some warnings because CUDA version is 12.3 but pytorch is built for 12.1...

In any case, when I now try to train with the S4 components, I get an error like:

...
  File "/workspace/cornbirdrave/RAVE/extensions/kernels/cauchy.py", line 96, in forward
    return cauchy_mult_sym_fwd(v, z, w)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Any idea how to work around this?
Thanks...

@leoauri
Copy link
Author

leoauri commented Jun 19, 2024

I ran the installer again and there were not the same deprecation warnings, the output was:

$ python3.10 setup.py install 2>&1 | tee install.log
running install
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/issues/917 for details.
        ********************************************************************************

!!
  self.initialize_options()
running bdist_egg
running egg_info
writing structured_kernels.egg-info/PKG-INFO
writing dependency_links to structured_kernels.egg-info/dependency_links.txt
writing top-level names to structured_kernels.egg-info/top_level.txt
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:499: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'structured_kernels.egg-info/SOURCES.txt'
writing manifest file 'structured_kernels.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:418: UserWarning: The detected CUDA version (12.3) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem.
  warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:428: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.3
  warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-cpython-310/structured_kernels.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for structured_kernels.cpython-310-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/structured_kernels.py to structured_kernels.cpython-310.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying structured_kernels.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying structured_kernels.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying structured_kernels.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying structured_kernels.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.structured_kernels.cpython-310: module references __file__
creating 'dist/structured_kernels-0.1.0-py3.10-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing structured_kernels-0.1.0-py3.10-linux-x86_64.egg
removing '/usr/local/lib/python3.10/dist-packages/structured_kernels-0.1.0-py3.10-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib/python3.10/dist-packages/structured_kernels-0.1.0-py3.10-linux-x86_64.egg
Extracting structured_kernels-0.1.0-py3.10-linux-x86_64.egg to /usr/local/lib/python3.10/dist-packages
Adding structured-kernels 0.1.0 to easy-install.pth file

Installed /usr/local/lib/python3.10/dist-packages/structured_kernels-0.1.0-py3.10-linux-x86_64.egg
Processing dependencies for structured-kernels==0.1.0
Finished processing dependencies for structured-kernels==0.1.0

In particular the line There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.3 jumps out at me, could be something to do with it?

@leoauri
Copy link
Author

leoauri commented Jun 20, 2024

Ah wait. The compilation job and the training job landed on different machines in the cluster with different GPU models. Probably the kernel has to be compiled for the actual GPU it will be used with...

@albertfgu
Copy link
Contributor

Yes, it has to be compiled for the specific GPU. Sometimes there can be issues with versions managed in a cluster because of this. I recommend trying to create a separate environment for each machine type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants