-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Crash with --enable-prefix-caching enabled #3944
Comments
Right now, APC is not supported on T4 due to the attention kernels requiring > compute capability 8.0 (Ampere) I added a check for this in #3903 with a better error message. Please open a Feature Request for T4 support if you're interested or open a PR with T4 support if interested |
upgrade the triton. |
Same problem with Testla V100 32GB capability=7.0. Is it due to the low capability of my GPU? |
Yes |
Thank you for your response @robertgshaw2-neuralmagic. Does the limit only occur on the automatic-prefix-caching functionality? Can I surpass the limit of the capability using the prefix_pos argument in the previous versions? |
I upgraded to v0.4.1 and prefix caching is working well on a T4. I mean I added the parameter and the server no longer crashes after the second message. It looks like it’s working |
Oh Thanks @eByteTheDust I think v0.4.1 had a triton upgrade which seems to have resolved the issues with the previous kernel for prefix attention |
@eByteTheDust That sounds great! Can you provide the version of your triton for reference? |
@eByteTheDust @robertgshaw2-neuralmagic Just an update: GPU: Tesla V100 capability=7.0. |
I have the same issue, is there a solution ? |
same issue,i've tried a range of vllm version with triton, none of them works. anyone to help, thanks!!!! |
@robertgshaw2-neuralmagic |
Can you share the specific request pattern you are sending. If I can reproduce the error I can try to fix it |
exactly same error as @eByteTheDust. i've tried the method suggested as @OUTHIM did, but it didnt work for me. thanks for your reply!!!! @robertgshaw2-neuralmagic |
An update from me: |
Your current environment
🐛 Describe the bug
when 'enable-prefix-caching' is on, VLLM always crash when the second message is sent to the server. The first message always works and it always crashes on the second (I guess when it is trying to use cache). I'm using ray and v0.4.0.post1. When I remove 'enable-prefix-caching' is works well.
python3 -m vllm.entrypoints.openai.api_server --model /maindir/Nous-Hermes-2-SOLAR-10.7B --tensor-parallel-size 2 --dtype half --gpu-memory-utilization 0.96 --max-model-len 4096 --enforce-eager --worker-use-ray --load-format auto --disable-log-stats --max-context-len-to-capture 4096 --enable-prefix-caching.
python3: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
*** SIGABRT received at time=1712674906 on cpu 3 ***
PC: @ 0x7e0584c969fc (unknown) pthread_kill
@ 0x7e0584c42520 (unknown) (unknown)
[2024-04-09 11:01:46,817 E 19542 19712] logging.cc:361: *** SIGABRT received at time=1712674906 on cpu 3 ***
[2024-04-09 11:01:46,817 E 19542 19712] logging.cc:361: PC: @ 0x7e0584c969fc (unknown) pthread_kill
[2024-04-09 11:01:46,817 E 19542 19712] logging.cc:361: @ 0x7e0584c42520 (unknown) (unknown)
Fatal Python error: Aborted
Stack (most recent call first):
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/triton/compiler/compiler.py", line 107 in ttgir_to_llir
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/triton/compiler/compiler.py", line 385 in
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/triton/compiler/compiler.py", line 476 in compile
File "", line 63 in _fwd_kernel
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/attention/ops/prefix_prefill.py", line 699 in context_attention_fwd
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/attention/ops/paged_attn.py", line 178 in forward_prefix
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/attention/backends/xformers.py", line 262 in forward
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/attention/layer.py", line 46 in forward
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 156 in forward
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 213 in forward
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 271 in forward
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 345 in forward
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 663 in execute_model
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/vllm/worker/worker.py", line 221 in execute_model
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/concurrent/futures/thread.py", line 58 in run
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/concurrent/futures/thread.py", line 83 in _worker
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/threading.py", line 982 in run
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/home/server11/miniconda3/envs/maindir/lib/python3.11/threading.py", line 1002 in _bootstrap
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, _brotli, zstandard.backend_c, charset_normalizer.md, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, markupsafe._speedups, pyarrow.lib, pyarrow._hdfsio, pyarrow._json, PIL._imaging, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct, httptools.parser.parser, httptools.parser.url_parser, websockets.speedups (total: 104)
./solar-ray.sh: line 9: 19542 Aborted (core dumped) python3 -m vllm.entrypoints.openai.api_server --model /maindir/Nous-Hermes-2-SOLAR-10.7B --tensor-parallel-size 2 --dtype half --gpu-memory-utilization 0.96 --max-model-len 4096 --enforce-eager --worker-use-ray --load-format auto --disable-log-stats --max-context-len-to-capture 4096 --enable-prefix-caching
The text was updated successfully, but these errors were encountered: