Fix bug for asort kernel & faster sampler with GPU sorting #2730

guoqingbao · 2025-01-21T09:33:52Z

This PR addresses an issue with the asort kernel, which can lead to CUDA invalid arguments error when the last dimension to be sorted exceeded the number of threads allowed per CUDA block. With the revised asort kernel, we can now partially perform logit sampling on the GPU (specifically, the sorting process). This optimization improves text generation performance by up to 25%, depending on the model and the number of tokens generated.

Testing Cases

I conducted tests using the following models and commands:

GLM4-9B

cargo run --release --example glm4 --features cuda -- \
  --weight-path /home/glm-4-9b-chat/ \
  --prompt "Please talk about deep learning." \
  --sample-len 2000

Note: By default, the GLM4 model uses sampling parameters of temperature=0.8 and top_p=0.8.

LLaMa3.1 8B

cargo run --release --example llama --features cuda -- \
  --weight-path /home/Meta-Llama-3.1-8B-Instruct/ \
  --prompt "Please talk about deep learning." \
  --temperature 0.7 \
  --top-p 0.8 \
  --sample-len 2000

Performance Results

Original Implementation (Sampling Entirely on CPU)

GLM4-9B:

100 tokens generated (26.52 tokens/s)
500 tokens generated (25.43 tokens/s)
2000 tokens generated (20.41 tokens/s)

LLaMa3.1 8B:

100 tokens generated (40.14 tokens/s)
500 tokens generated (38.62 tokens/s)
607 tokens generated (38.28 tokens/s)
2000 tokens generated (34.12 tokens/s)

After `asort` Kernel Fix & Faster Sampler with GPU Sort

GLM4-9B:

100 tokens generated (29.93 tokens/s)
500 tokens generated (28.52 tokens/s)
2000 tokens generated (22.30 tokens/s)

LLaMa3.1 8B:

100 tokens generated (50.68 tokens/s)
500 tokens generated (48.81 tokens/s)
712 tokens generated (47.88 tokens/s)
2000 tokens generated (41.84 tokens/s)

Note: You can adjust the temperature parameter to influence the model's behavior and potentially generate more than 2000 tokens for the above comparisons. While sample-len sets the maximum number of tokens that can be generated, the actual output length may vary depending on the model (and sampling process, parameters, etc.).

LaurentMazare · 2025-01-26T11:53:26Z

I'm not sure to understand the change properly and testing them they seem to produce weird results. If you run the tests with the cuda feature enabled, this will actually fail, did you try it?

$ cargo test --features cuda
---- asort_gpu stdout ----
thread 'asort_gpu' panicked at candle-core/tests/tensor_tests.rs:133:5:
assertion `left == right` failed
  left: [[1, 3, 0, 2, 4], [1, 0, 2, 3, 4]]
 right: [[1, 3, 0, 2, 4], [1, 4, 0, 2, 3]]

Below is also another example I tried where the results seem wrong. Fwiw the original implementation comes from llama.cpp argsort.cu so if you get the kernel to work on arbitrary sizes I would suggest to also make a PR there as it would be of interest to them.

Finally, could you make only the kernel changes in this PR, and defer the sampling bits to a PR for later once this has stabilized a bit?

use anyhow::Result;
use candle_core::{Device, Tensor};

fn main() -> Result<()> {
    let dev = Device::new_cuda(0)?;
    let values: Vec<u32> = (0..10u32).rev().collect();
    let a = Tensor::new(values, &dev)?;
    let a = a.arg_sort_last_dim(true)?;
    let a = a.to_vec1::<u32>()?;
    println!("{a:?}");
    Ok(())
}

guoqingbao · 2025-01-27T04:57:22Z

I'm not sure to understand the change properly and testing them they seem to produce weird results. If you run the tests with the cuda feature enabled, this will actually fail, did you try it?
$ cargo test --features cuda
---- asort_gpu stdout ----
thread 'asort_gpu' panicked at candle-core/tests/tensor_tests.rs:133:5:
assertion `left == right` failed
  left: [[1, 3, 0, 2, 4], [1, 0, 2, 3, 4]]
 right: [[1, 3, 0, 2, 4], [1, 4, 0, 2, 3]]
Below is also another example I tried where the results seem wrong. Fwiw the original implementation comes from llama.cpp argsort.cu so if you get the kernel to work on arbitrary sizes I would suggest to also make a PR there as it would be of interest to them.

Finally, could you make only the kernel changes in this PR, and defer the sampling bits to a PR for later once this has stabilized a bit?
use anyhow::Result;
use candle_core::{Device, Tensor};

fn main() -> Result<()> {
    let dev = Device::new_cuda(0)?;
    let values: Vec<u32> = (0..10u32).rev().collect();
    let a = Tensor::new(values, &dev)?;
    let a = a.arg_sort_last_dim(true)?;
    let a = a.to_vec1::<u32>()?;
    println!("{a:?}");
    Ok(())
}

Thanks for the comments. I'm currently working on a fix.

guoqingbao · 2025-01-27T09:37:44Z

Hi @LaurentMazare,
I have fixed the bug in the asort kernel. The issue was caused by the following factors:

Limited shared memory: The available shared memory is insufficient to cache even a single row of the input array (maximum ~40k f32 elements), which complicates thread synchronization.
Improper global memory access: Bitonic sort requires the input size to be a power of 2 (log2-based), and this was not handled correctly.
Padding issues: Some values were not padded appropriately, which is necessary given the nature of bitonic sort.

These issues have been addressed by creating temporary row and indice buffers (sized to a power of 2, padded) and copying the valid elements back.

guoqingbao closed this Jan 27, 2025

guoqingbao force-pushed the candle-main branch from 82120c6 to e4c3a71 Compare January 27, 2025 09:43

guoqingbao added 2 commits January 27, 2025 09:48

Fix bug for asort kernel (initial)

fe31cd1

Fix bug for asort kernel & faster sampler

1625938

guoqingbao reopened this Jan 27, 2025

Do not perform inplace sort

af24795

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug for asort kernel & faster sampler with GPU sorting #2730

Fix bug for asort kernel & faster sampler with GPU sorting #2730

guoqingbao commented Jan 21, 2025

LaurentMazare commented Jan 26, 2025

guoqingbao commented Jan 27, 2025

guoqingbao commented Jan 27, 2025

Fix bug for asort kernel & faster sampler with GPU sorting #2730

Are you sure you want to change the base?

Fix bug for asort kernel & faster sampler with GPU sorting #2730

Conversation

guoqingbao commented Jan 21, 2025

Testing Cases

GLM4-9B

LLaMa3.1 8B

Performance Results

Original Implementation (Sampling Entirely on CPU)

After asort Kernel Fix & Faster Sampler with GPU Sort

LaurentMazare commented Jan 26, 2025

guoqingbao commented Jan 27, 2025

guoqingbao commented Jan 27, 2025

After `asort` Kernel Fix & Faster Sampler with GPU Sort