[Bug]: enable_prefix_caching cause a triron crash #6099

sweetning0809 · 2024-07-03T09:10:43Z

Your current environment

`Collecting environment information...
PyTorch version: 2.3.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.17

Python version: 3.9.16 (main, Jul 10 2023, 11:13:07) [GCC 8.3.1 20190311 (Red Hat 8.3.1-3)] (64-bit runtime)
Python platform: Linux-4.18.0-147.20200626.413.el8_1.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 470.103.01
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.9.2
/usr/lib64/libcudnn_adv_infer.so.8.9.2
/usr/lib64/libcudnn_adv_train.so.8.9.2
/usr/lib64/libcudnn_cnn_infer.so.8.9.2
/usr/lib64/libcudnn_cnn_train.so.8.9.2
/usr/lib64/libcudnn_ops_infer.so.8.9.2
/usr/lib64/libcudnn_ops_train.so.8.9.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 192
On-line CPU(s) list: 0-45
Off-line CPU(s) list: 46-191
Thread(s) per core: 0
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7642 48-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 3291.355
CPU max MHz: 2300.0000
CPU min MHz: 1500.0000
BogoMIPS: 4591.38
Virtualization: AMD-V
L1d cache: 1.5 MiB
L1i cache: 1.5 MiB
L2 cache: 24 MiB
L3 cache: 256 MiB
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.25.0
[pip3] nvidia-nccl-cu11==2.20.5
[pip3] onnx==1.12.0
[pip3] onnx-graphsurgeon==0.3.12
[pip3] onnxruntime==1.15.1
[pip3] torch==2.3.0+cu118
[pip3] torchvision==0.14.1
[pip3] transformers==4.41.1
[pip3] triton==2.3.0
[pip3] vllm-nccl-cu11==2.18.1.0.4.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: ; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 CPU Affinity NUMA Affinity
GPU0 X NV12 PXB PXB NODE NODE SYS SYS SYS SYS 0-47,96-143 0
GPU1 NV12 X PXB PXB NODE NODE SYS SYS SYS SYS 0-47,96-143 0
mlx5_0 PXB PXB X PIX NODE NODE SYS SYS SYS SYS
mlx5_1 PXB PXB PIX X NODE NODE SYS SYS SYS SYS
mlx5_2 NODE NODE NODE NODE X PIX SYS SYS SYS SYS
mlx5_3 NODE NODE NODE NODE PIX X SYS SYS SYS SYS
mlx5_4 SYS SYS SYS SYS SYS SYS X PIX NODE NODE
mlx5_5 SYS SYS SYS SYS SYS SYS PIX X NODE NODE
mlx5_6 SYS SYS SYS SYS SYS SYS NODE NODE X PIX
mlx5_7 SYS SYS SYS SYS SYS SYS NODE NODE PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

our model is llama-33B when we try to add enable_prefix_caching=True in LLM(**kwargs) and batch_size = 100 it will cause error which is never happend when with out enable_prefix_caching.

ERROR 07-02 16:38:17 worker_base.py:145] Error executing method execute_model. This might cause deadlock in distributed execution.
ERROR 07-02 16:38:17 worker_base.py:145] Traceback (most recent call last):
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/worker/worker_base.py", line 137, in execute_method
ERROR 07-02 16:38:17 worker_base.py:145] return executor(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-02 16:38:17 worker_base.py:145] return func(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/worker/worker.py", line 262, in execute_model
ERROR 07-02 16:38:17 worker_base.py:145] output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-02 16:38:17 worker_base.py:145] return func(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/worker/model_runner.py", line 793, in execute_model
ERROR 07-02 16:38:17 worker_base.py:145] hidden_states = model_executable(**execute_model_kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return self._call_impl(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return forward_call(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 364, in forward
ERROR 07-02 16:38:17 worker_base.py:145] hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return self._call_impl(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return forward_call(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 291, in forward
ERROR 07-02 16:38:17 worker_base.py:145] hidden_states, residual = layer(
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return self._call_impl(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return forward_call(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 233, in forward
ERROR 07-02 16:38:17 worker_base.py:145] hidden_states = self.self_attn(
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return self._call_impl(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return forward_call(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/model_executor/models/llama.py", line 167, in forward
ERROR 07-02 16:38:17 worker_base.py:145] attn_output = self.attn(q, k, v, kv_cache, attn_metadata,
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return self._call_impl(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-02 16:38:17 worker_base.py:145] return forward_call(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/attention/layer.py", line 48, in forward
ERROR 07-02 16:38:17 worker_base.py:145] return self.impl.forward(query, key, value, kv_cache, attn_metadata,
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/attention/backends/xformers.py", line 240, in forward
ERROR 07-02 16:38:17 worker_base.py:145] out = PagedAttention.forward_prefix(
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/attention/ops/paged_attn.py", line 177, in forward_prefix
ERROR 07-02 16:38:17 worker_base.py:145] context_attention_fwd(
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-02 16:38:17 worker_base.py:145] return func(*args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/vllm-0.4.2+cu118-py3.9-linux-x86_64.egg/vllm/attention/ops/prefix_prefill.py", line 753, in context_attention_fwd
ERROR 07-02 16:38:17 worker_base.py:145] _fwd_kernel[grid](
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/triton/runtime/jit.py", line 167, in
ERROR 07-02 16:38:17 worker_base.py:145] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/triton/runtime/jit.py", line 425, in run
ERROR 07-02 16:38:17 worker_base.py:145] kernel.run(grid_0, grid_1, grid_2, kernel.num_warps, kernel.num_ctas, # number of warps/ctas per instance
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/triton/compiler/compiler.py", line 255, in getattribute
ERROR 07-02 16:38:17 worker_base.py:145] self._init_handles()
ERROR 07-02 16:38:17 worker_base.py:145] File "/usr/local/lib/python3.9/site-packages/triton/compiler/compiler.py", line 250, in _init_handles
ERROR 07-02 16:38:17 worker_base.py:145] self.module, self.function, self.n_regs, self.n_spills = driver.utils.load_binary(
ERROR 07-02 16:38:17 worker_base.py:145] RuntimeError: Triton Error [CUDA]: device kernel image is invalid

sweetning0809 · 2024-07-03T09:12:54Z

also have a test on smaller batchsize = 10, which could work nomally.
using tp = 2 , tp = 4 ,gpu_memory_utilization = 0.95 have the same error

jeejeelee · 2024-07-03T10:40:29Z

It seems that your GPU driver is too old.

sweetning0809 · 2024-07-04T02:22:09Z

It seems that your GPU driver is too old.

Tks
however, we can start inference normally, but problems will occur in middle
is it cause by driver version？

jeejeelee · 2024-07-04T02:40:03Z

FYI: https://github.com/vllm-project/vllm/issues?q=device+kernel+image+is+invalid

sweetning0809 · 2024-07-08T06:53:49Z

FYI: https://github.com/vllm-project/vllm/issues?q=device+kernel+image+is+invalid

we update gpu driver to 525 cuda to 121, no help

sweetning0809 · 2024-07-08T07:08:58Z

notice a similar issue #5938

comaniac · 2024-07-09T00:48:25Z

This should be different from #5938 which happens to MoE kernel, unless this is a general issue in triton compiler, but I have no clue.

sweetning0809 added the bug Something isn't working label Jul 3, 2024

sweetning0809 changed the title ~~[Bug]:~~ [Bug]: enable_prefix_caching cause an triron crash Jul 3, 2024

sweetning0809 changed the title ~~[Bug]: enable_prefix_caching cause an triron crash~~ [Bug]: enable_prefix_caching cause a triron crash Jul 3, 2024

sweetning0809 closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: enable_prefix_caching cause a triron crash #6099

[Bug]: enable_prefix_caching cause a triron crash #6099

sweetning0809 commented Jul 3, 2024 •

edited

Loading

sweetning0809 commented Jul 3, 2024 •

edited

Loading

jeejeelee commented Jul 3, 2024

sweetning0809 commented Jul 4, 2024 •

edited

Loading

jeejeelee commented Jul 4, 2024

sweetning0809 commented Jul 8, 2024 •

edited

Loading

sweetning0809 commented Jul 8, 2024

comaniac commented Jul 9, 2024

[Bug]: enable_prefix_caching cause a triron crash #6099

[Bug]: enable_prefix_caching cause a triron crash #6099

Comments

sweetning0809 commented Jul 3, 2024 • edited Loading

Your current environment

🐛 Describe the bug

our model is llama-33B when we try to add enable_prefix_caching=True in LLM(**kwargs) and batch_size = 100 it will cause error which is never happend when with out enable_prefix_caching.

sweetning0809 commented Jul 3, 2024 • edited Loading

jeejeelee commented Jul 3, 2024

sweetning0809 commented Jul 4, 2024 • edited Loading

jeejeelee commented Jul 4, 2024

sweetning0809 commented Jul 8, 2024 • edited Loading

sweetning0809 commented Jul 8, 2024

comaniac commented Jul 9, 2024

sweetning0809 commented Jul 3, 2024 •

edited

Loading

sweetning0809 commented Jul 3, 2024 •

edited

Loading

sweetning0809 commented Jul 4, 2024 •

edited

Loading

sweetning0809 commented Jul 8, 2024 •

edited

Loading