You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/home/ray/default/vllm/./benchmarks/kernels/benchmark_moe.py", line 70, in run
fused_moe(
File "/tmp/ray/session_2024-06-27_10-06-48_118980_5595/runtime_resources/working_dir_files/_ray_pkg_ef0e5109bc8b4140628503119c10e0b2c9ea3f17/vllm/model_executor/layers/fused_moe/fused_moe.py", line 519, in fused_moe
return fused_experts(hidden_states,
File "/tmp/ray/session_2024-06-27_10-06-48_118980_5595/runtime_resources/working_dir_files/_ray_pkg_ef0e5109bc8b4140628503119c10e0b2c9ea3f17/vllm/model_executor/layers/fused_moe/fused_moe.py", line 449, in fused_experts
invoke_fused_moe_kernel(intermediate_cache2,
File "/tmp/ray/session_2024-06-27_10-06-48_118980_5595/runtime_resources/working_dir_files/_ray_pkg_ef0e5109bc8b4140628503119c10e0b2c9ea3f17/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245, in invoke_fused_moe_kernel
fused_moe_kernel[grid](
File "/home/ray/anaconda3/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/triton/runtime/jit.py", line 425, in run
kernel.run(grid_0, grid_1, grid_2, kernel.num_warps, kernel.num_ctas, # number of warps/ctas per instance
File "/home/ray/anaconda3/lib/python3.10/site-packages/triton/compiler/compiler.py", line 255, in __getattribute__
self._init_handles()
File "/home/ray/anaconda3/lib/python3.10/site-packages/triton/compiler/compiler.py", line 250, in _init_handles
self.module, self.function, self.n_regs, self.n_spills = driver.utils.load_binary(
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
We have seen this problem on L4 and A100 GPUs. I also tried to tune this particular workload using different block sizes, but none of the configs could bypass the error. Since we usually don't use such a large batch size (number of tokens), this bug should not be critical at least for now.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
Illegal memory access for MoE triton kernel when the workload (e.g., batch size) is too large. To reproduce:
Output
We have seen this problem on L4 and A100 GPUs. I also tried to tune this particular workload using different block sizes, but none of the configs could bypass the error. Since we usually don't use such a large batch size (number of tokens), this bug should not be critical at least for now.
Also cc @pcmoritz @WoosukKwon @Yard1
The text was updated successfully, but these errors were encountered: