Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Ada Lovelace & Blackwell #6

Open
leedrake5 opened this issue Feb 26, 2025 · 5 comments
Open

Support for Ada Lovelace & Blackwell #6

leedrake5 opened this issue Feb 26, 2025 · 5 comments

Comments

@leedrake5
Copy link

This is perhaps better served as a feature request, but I can see wide interest in proper F8 implementation for the 4090/5090 series of GPUs. Is there any interest in adapting this for sm_89 & sm_100/a? Is it a steep challenge or is it feasible?

Many thanks for providing this repo. It is a remarkable contribution to the field.

@LyricZhao
Copy link
Collaborator

I'm not sure whether it is hard to add other arch-support and maximize the performance. We may release new version if we get some other arch support, also, open-source community PRs are welcomed.

@nupam
Copy link

nupam commented Feb 26, 2025

when openCL+RISC-V? :P

@MakiSonomura
Copy link

i'm working on adapting to Ada Lovelace

@leedrake5
Copy link
Author

leedrake5 commented Mar 4, 2025

An update on my efforts:

First step is to install cutlass within the third_party folder for the intended architecture:

cd ~/GitHub/DeepGEMM/third_party/cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS="89"
make test_unit -j 18

I am using 18 cores since I have an intel i9, but specify a number here! -j flag will go nuts and consume all RAM during the tests if you don't tailor it for your system.

Next, change references from SM90 & sm_90 to SM89 & sm_89 in the following files:

~/GitHub/DeepGEMM/deep_gemm/include/deep_gemm/fp8_gemm.cuh
~/GitHub/DeepGEMM/deep_gemm/include/deep_gemm/mma_utils.cuh
~/GitHub/DeepGEMM/deep_gemm/include/deep_gemm/tma_utils.cuh
~/GitHub/DeepGEMM/deep_gemm/jit/compiler.py
~/GitHub/DeepGEMM/deep_gemm/jit_kernels/gemm.py

Then install. The recommended python server.py install didn't work, but the much simpler pip install did, though perhaps this is connected to later problems with test_core.py:

cd ~/GitHub/DeepGEMM
pip install .

Next, run tests. Right now test_jit.py works:

cd ~/GitHub/DeepGEMM/tests
:~/GitHub/DeepGEMM/tests$ python test_jit.py
NVCC compiler: ('/usr/local/cuda-12.6/bin/nvcc', '12.6')

Generated code:
// DeepGEMM auto-generated JIT CUDA source file

#include <cuda.h>
#include <cuda_fp8.h>
#include <cuda_runtime.h>
#include <iostream>

#include "cutlass/cutlass.h"

extern "C" void launch(void* __raw_lhs, void* __raw_rhs, void* __raw_scale, void* __raw_out, bool enable_double_streams, void* __raw_stream, int& __return_code) {
    // Cast raw types (if needed)
    auto lhs = reinterpret_cast<__nv_fp8_e4m3*>(__raw_lhs);
    auto rhs = reinterpret_cast<__nv_fp8_e4m3*>(__raw_rhs);
    auto scale = reinterpret_cast<float*>(__raw_scale);
    auto out = reinterpret_cast<__nv_bfloat16*>(__raw_out);
    auto stream = reinterpret_cast<cudaStream_t>(__raw_stream);

    std::cout << reinterpret_cast<uint64_t>(lhs) << std::endl;
    std::cout << reinterpret_cast<uint64_t>(rhs) << std::endl;
    std::cout << reinterpret_cast<uint64_t>(scale) << std::endl;
    std::cout << reinterpret_cast<uint64_t>(out) << std::endl;
    std::cout << enable_double_streams << std::endl;
    std::cout << reinterpret_cast<uint64_t>(stream) << std::endl;
}


Building ...
Running ...
JIT test passed

But test_core.py fails:

python test_core.py
Library path:
 > ['~/.local/lib/python3.11/site-packages/deep_gemm']

Testing GEMM:
In file included from ~/.deep_gemm/cache/kernel.gemm_fp8_fp8_bf16_nt.70b5a94ce876/kernel.cu:9:
~/.local/lib/python3.11/site-packages/deep_gemm/jit/../include/deep_gemm/fp8_gemm.cuh:8:10: fatal error: cute/arch/cluster_sm89.hpp: No such file or directory
    8 | #include <cute/arch/cluster_sm89.hpp>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Traceback (most recent call last):
  File "~/GitHub/DeepGEMM/tests/test_core.py", line 156, in <module>
    test_gemm()
  File "~/GitHub/DeepGEMM/tests/test_core.py", line 70, in test_gemm
    deep_gemm.gemm_fp8_fp8_bf16_nt(x_fp8, y_fp8, out)
  File "~.local/lib/python3.11/site-packages/deep_gemm/jit_kernels/gemm.py", line 156, in gemm_fp8_fp8_bf16_nt
    runtime = jit_tuner.compile_and_tune(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.11/site-packages/deep_gemm/jit_kernels/tuner.py", line 40, in compile_and_tune
    kernels.append((build(name, arg_defs, code), tuned_keys))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.local/lib/python3.11/site-packages/deep_gemm/jit/compiler.py", line 139, in build
    return_code = subprocess.check_call(command)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/local/cuda-12.6/bin/nvcc', '~/.deep_gemm/cache/kernel.gemm_fp8_fp8_bf16_nt.70b5a94ce876/kernel.cu', '-o', '~/.deep_gemm/tmp/nvcc.tmp.003194d1-f5d3-4a66-a54c-866bb24b6506.fe7708dc26e3.so', '-std=c++17', '-shared', '-O3', '--expt-relaxed-constexpr', '--expt-extended-lambda', '-gencode=arch=compute_89,code=sm_89', '--ptxas-options=--register-usage-level=10', '--diag-suppress=177,174,940', '--compiler-options=-fPIC,-O3,-Wno-deprecated-declarations,-Wno-abi', '-I~/.local/lib/python3.11/site-packages/deep_gemm/jit/../include']' returned non-zero exit status 1.

This is because it looks like cutlass doesn't compile a cluster_sm89.hpp file similar to a cluster_sm90.hpp file, presumably because these GPUs aren't intended to be server workhorses. So it is possible that cluster won't work. But it might be possible to split the test file to see if it runs on a single GPU. Working on that next. For now, just encouraged that jit still works.

My hope, which I'm pursuing right now, is that I've got to rebuild their version of cutlass for sm_89, and that these files just need to be compiled. So exploring ways to do that, and will report progress, if any.

@leedrake5
Copy link
Author

An update on my end, and possibly fatal block. In fp8_gemm.cuh, the following libraries must be included:

#include <cute/arch/cluster_sm90.hpp>
#include <cute/arch/copy_sm90_desc.hpp>
#include <cute/arch/copy_sm90_tma.hpp>

Now, for Ada Lovelace in particular, one can link the sm80 versions of files. But after re-compiling cutlass, I don't think there are instructions to build sm_80/sm_89 versions. They are specific to sm_90 and sm_100. So unless a way can be found away this, I think we're stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants