Improve parallelism in RoPE with pos_ids #609

nandor · 2024-11-13T16:13:23Z

The previous kernel was not parallelised sufficiently well for low batch sizes. Similarly to the regular rotary kernel, now all qo/kv heads are split across separate blocks.

In decode mode, the pos_ids kernel is now faster.

The previous kernel was not parallelised sufficiently well for low batch sizes. Similarly to the regular rotary kernel, now all qo/kv heads are split across separate blocks. In decode mode, the pos_ids kernel is now faster.

yzh119 · 2024-11-13T19:38:11Z

Hi @nandor we use such parallelism mainly to save sin/cos computation time (same sin/cos can be reused for multiple heads).
I expect using different threadblock for different heads will be faster for small batch size.

Would you mind running https://github.com/flashinfer-ai/flashinfer/blob/32d9510d67187f1f3a379cce81302cdd15a557d2/benchmarks/bench_rope.py ?

nandor · 2024-11-13T23:23:24Z

You are right - saving sin and cos across 2-8 heads does yield a small speedup. But the finer-grained computation is significantly faster on an H100 already.

Unfortunately this sort of batching is a bit more convoluted to implement in CUDA than triton and internally we'll be relying on a Triton kernel instead.

yzh119

Sound good, thank you!

this sort of batching is a bit more convoluted to implement in CUDA than triton and internally we'll be relying on a Triton kernel instead.

Yes I agree, we plan to port most of the kernels (except for sampling and attention) to triton in v0.2.1 :)

flashinfer-ai/flashinfer#609 potentially introduces correctness issues

james-p-xu · 2024-11-16T05:48:46Z

I'm actually seeing that this change causes a correctness issue wrt apply_rope_pos_ids.

Here's a sample comparison script, passing prior to this commit hash (32d9510) but failing post-change: https://github.com/sgl-project/sglang/blob/dd0d2a3af4967880362e3bad9d95cd14572c89ea/scripts/playground/compare_flashinfer_vllm_rope.py

yzh119 · 2024-11-16T07:44:33Z

@james-p-xu I'll fix it, thank you!

This reverts commit ff05155.

@james-p-xu

As observed by @james-p-xu, #609 produce wrong results for some input shapes, this PR fixes the correctness issue, and add optimizations of dispatching to different parallelism modes for different input shapes. For large shape inputs, using the original implementation (re-use sin/cos for different heads) will be better. For small shape inputs, using head parallelism will be better. Some results: ``` Before #609 (no head-parallelism, re-use sin/cos value) ----------------- batch_size: 1, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 27us, throughput: 0.762GB/s batch_size: 1, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 22us, throughput: 0.919GB/s batch_size: 1, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 27us, throughput: 95.699GB/s batch_size: 1, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 28us, throughput: 95.244GB/s batch_size: 1, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 31us, throughput: 670.254GB/s batch_size: 1, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 31us, throughput: 667.253GB/s --- batch_size: 19, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 27us, throughput: 14.490GB/s batch_size: 19, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 27us, throughput: 14.466GB/s batch_size: 19, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 37us, throughput: 1344.086GB/s batch_size: 19, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 37us, throughput: 1344.902GB/s batch_size: 19, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 148us, throughput: 2699.475GB/s batch_size: 19, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 147us, throughput: 2701.897GB/s --- batch_size: 99, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 27us, throughput: 74.322GB/s batch_size: 99, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 27us, throughput: 74.568GB/s batch_size: 99, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 110us, throughput: 2352.352GB/s batch_size: 99, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 110us, throughput: 2365.580GB/s batch_size: 99, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 718us, throughput: 2893.608GB/s batch_size: 99, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 717us, throughput: 2894.859GB/s --- batch_size: 128, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 27us, throughput: 95.373GB/s batch_size: 128, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 27us, throughput: 95.810GB/s batch_size: 128, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 130us, throughput: 2583.872GB/s batch_size: 128, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 129us, throughput: 2595.944GB/s batch_size: 128, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 923us, throughput: 2907.408GB/s batch_size: 128, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 924us, throughput: 2905.533GB/s Head parallelism only (no dispatch) --------------------- batch_size: 1, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 6us, throughput: 3.321GB/s batch_size: 1, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 6us, throughput: 3.391GB/s batch_size: 1, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 7us, throughput: 358.862GB/s batch_size: 1, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 7us, throughput: 362.361GB/s batch_size: 1, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 15us, throughput: 1413.175GB/s batch_size: 1, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 15us, throughput: 1437.332GB/s --- batch_size: 19, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 6us, throughput: 60.526GB/s batch_size: 19, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 6us, throughput: 60.127GB/s batch_size: 19, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 26us, throughput: 1897.923GB/s batch_size: 19, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 24us, throughput: 2050.075GB/s batch_size: 19, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 164us, throughput: 2431.650GB/s batch_size: 19, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 147us, throughput: 2709.333GB/s --- batch_size: 99, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 7us, throughput: 284.641GB/s batch_size: 99, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 7us, throughput: 302.815GB/s batch_size: 99, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 109us, throughput: 2391.712GB/s batch_size: 99, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 97us, throughput: 2671.150GB/s batch_size: 99, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 860us, throughput: 2413.211GB/s batch_size: 99, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 828us, throughput: 2508.817GB/s --- batch_size: 128, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 7us, throughput: 349.795GB/s batch_size: 128, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 7us, throughput: 376.624GB/s batch_size: 128, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 139us, throughput: 2413.690GB/s batch_size: 128, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 124us, throughput: 2705.994GB/s batch_size: 128, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 1110us, throughput: 2417.480GB/s batch_size: 128, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 1063us, throughput: 2525.976GB/s This PR (shape dispatch) --------------------- batch_size: 1, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 28us, throughput: 0.728GB/s batch_size: 1, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 6us, throughput: 3.451GB/s batch_size: 1, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 7us, throughput: 359.759GB/s batch_size: 1, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 7us, throughput: 361.286GB/s batch_size: 1, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 15us, throughput: 1426.267GB/s batch_size: 1, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 15us, throughput: 1433.691GB/s --- batch_size: 19, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 6us, throughput: 60.390GB/s batch_size: 19, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 6us, throughput: 59.937GB/s batch_size: 19, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 26us, throughput: 1892.575GB/s batch_size: 19, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 24us, throughput: 2049.735GB/s batch_size: 19, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 148us, throughput: 2698.780GB/s batch_size: 19, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 147us, throughput: 2701.558GB/s --- batch_size: 99, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 7us, throughput: 285.335GB/s batch_size: 99, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 7us, throughput: 303.373GB/s batch_size: 99, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 110us, throughput: 2351.126GB/s batch_size: 99, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 110us, throughput: 2362.898GB/s batch_size: 99, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 717us, throughput: 2893.713GB/s batch_size: 99, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 717us, throughput: 2894.902GB/s --- batch_size: 128, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 7us, throughput: 350.720GB/s batch_size: 128, append_len: 1, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 7us, throughput: 376.690GB/s batch_size: 128, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 130us, throughput: 2584.221GB/s batch_size: 128, append_len: 128, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 129us, throughput: 2596.612GB/s batch_size: 128, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: False, latency: 924us, throughput: 2906.480GB/s batch_size: 128, append_len: 1024, num_qo_heads: 32, num_kv_heads: 8, head_dim: 128, use_cos_sin_cache: True, latency: 924us, throughput: 2905.134GB/s ``` cc @nandor @james-p-xu

🤖 I have created a release *beep* *boop* --- ## [0.2.0](v0.1.6...v0.2.0) (2024-12-17) [Release Blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html). ### Features * add `rotary_dim` argument to rope APIs for partial apply rope ([#599](#599)) ([eb9bc71](eb9bc71)) * add a `use_softmax` field in variant class ([#533](#533)) ([d81af97](d81af97)) * add an option `non_blocking` to plan function ([#622](#622)) ([560af6f](560af6f)) * add gemma_rmsnorm and gemma_fused_add_rmsnorm ([#477](#477)) ([1a6b17e](1a6b17e)) * add group size 3 to GQA decode dispatch ([#558](#558)) ([6227562](6227562)) * add JIT compilation support for FA3 templates ([#672](#672)) ([d4e8d79](d4e8d79)) * allow the cascade kernels to be executed using varying sequence lenghts ([#627](#627)) ([92ac440](92ac440)) * CUDAGraph compatibility of multi-level cascade inference APIs ([#586](#586)) ([2332e8a](2332e8a)) * fix the maximal grid dimension in prefill planning with CUDA graphs ([#639](#639)) ([86ca89a](86ca89a)) * improve the precision of the FusedAddRMSNormKernel function ([#587](#587)) ([c7dc921](c7dc921)) * JIT compilation ([#507](#507)) ([3613a5b](3613a5b)) * modify group-gemm stage number ([#497](#497)) ([52dab1d](52dab1d)) * non-contiguous query with paged kv cache ([#553](#553)) ([89f2c4a](89f2c4a)) * pass a dynamic token count to the cascade kernels ([#635](#635)) ([5fe9f7d](5fe9f7d)) * simplify prefill JIT compilation ([#605](#605)) ([fe4f898](fe4f898)) * specify gemm backend ([#648](#648)) ([0cc1a51](0cc1a51)) * support cached cos/sin in rope APIs ([#585](#585)) ([83e541d](83e541d)) * support huggingface transformer style rope interface ([#568](#568)) ([4f40420](4f40420)) * support sm90 cutlass group gemm ([#509](#509)) ([794bdda](794bdda)) * torch custom_op fix for rope ([#569](#569)) ([3e104bc](3e104bc)) * torch custom_op support: norm ([#552](#552)) ([f6e0010](f6e0010)) * torch.compile and custom_op support ([#554](#554)) ([9bf916f](9bf916f)) * warmup for jit kernel tests ([#629](#629)) ([8f5f349](8f5f349)) ### Bug Fixes * AOT compiler flags on non-sm90 ([#522](#522)) ([0aa4726](0aa4726)) * batch decode kernel redundant store output to gmem ([#505](#505)) ([90e42a7](90e42a7)) * compatible with torch 2.2 ([#478](#478)) ([ac41d1b](ac41d1b)) * #452 ([b53a46f](b53a46f)) * remove redundant load ([#495](#495)) ([2de16b0](2de16b0)) * update bmm fp8 test ([#487](#487)) ([45eac04](45eac04)) ### Performance Improvements * accelerate JIT compilation speed ([#618](#618)) ([eaf73fd](eaf73fd)) * Dense and sparse customizable flashattention-3 template ([#667](#667)) ([51236c9](51236c9)) * fix prefill kernel performance degradation (step 1) ([#602](#602)) ([595cf60](595cf60)) * fix the performance issue of `append_paged_kv_cache` ([#588](#588)) ([e15f7c9](e15f7c9)) * improve parallelism in RoPE with pos_ids ([#609](#609)) ([ff05155](ff05155)) * improve plan performance by using non-blocking memcpy ([#547](#547)) ([41ebe6d](41ebe6d)) * reduce the read and write of shared memory in the FusedAddRMSNormKernel ([#592](#592)) ([2043ca2](2043ca2)) * reduce total_num_tiles_q by one ([#644](#644)) ([553ace5](553ace5)) * remove unnecessary contiguous operation in block sparse attention ([#561](#561)) ([7a7ad46](7a7ad46)) * speedup jit compilation of prefill attention kernels ([#632](#632)) ([a059586](a059586)) * use cuda-core implemention for io-bound block-sparse attention ([#560](#560)) ([3fbf028](3fbf028)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <[email protected]>

Improve parallelism in RoPE with pos_ids

afe3318

The previous kernel was not parallelised sufficiently well for low batch sizes. Similarly to the regular rotary kernel, now all qo/kv heads are split across separate blocks. In decode mode, the pos_ids kernel is now faster.

abcdabcd987 requested a review from yzh119 November 13, 2024 16:59

yzh119 approved these changes Nov 14, 2024

View reviewed changes

yzh119 merged commit ff05155 into flashinfer-ai:main Nov 14, 2024

github-actions bot mentioned this pull request Nov 14, 2024

chore(main): release 0.2.0 #476

Merged

james-p-xu added a commit to james-p-xu/sglang that referenced this pull request Nov 16, 2024

[REMOVE ME] Revert FlashInfer version

0e2c8bc

flashinfer-ai/flashinfer#609 potentially introduces correctness issues

yzh119 added a commit that referenced this pull request Nov 20, 2024

Revert "perf: improve parallelism in RoPE with pos_ids (#609)"

478db8c

This reverts commit ff05155.

yzh119 mentioned this pull request Nov 20, 2024

bugfix: fix the rope correctness issue introduced in #609 #619

Merged

github-actions bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parallelism in RoPE with pos_ids #609

Improve parallelism in RoPE with pos_ids #609

nandor commented Nov 13, 2024

yzh119 commented Nov 13, 2024

nandor commented Nov 13, 2024

yzh119 left a comment

james-p-xu commented Nov 16, 2024

yzh119 commented Nov 16, 2024

Improve parallelism in RoPE with pos_ids #609

Improve parallelism in RoPE with pos_ids #609

Conversation

nandor commented Nov 13, 2024

yzh119 commented Nov 13, 2024

nandor commented Nov 13, 2024

yzh119 left a comment

Choose a reason for hiding this comment

james-p-xu commented Nov 16, 2024

yzh119 commented Nov 16, 2024