-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: FlashAttention only supports Ampere GPUs or newer. #5985
Comments
Tesla V100 uses the Volta architecture. It goes In the meantime, load the model without flash attention. |
llama.cpp added FP16 in FlashAttention vector kernel and older GPUs which lacks tensor core can run flash attention. llama.cpp added FP32 in FlashAttention vector kernel and even Pascal GPUs(which lack FP16 performance) can now run flash attention. After these changes are imported into 'text-generation-webui', FlashAttention can be supported on non-NVIDIA GPUs (including Apple Silicon) and old pre-Ampere NVIDIA GPUs. |
I'm having the same issue right now with an error message calling out FlashAttention, and ChatGPT recommended the same thing...run the server without FlashAttention. But, for the life of me, I can't find a way to actually do this. |
I had the same error with ModernBERT, you can actually get arround FlashAttention and use SPDA or Eager Attention: Please check out this discussion : Good Luck ! |
Describe the bug
python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code
model : Qwen-7B-Chat
question
Traceback (most recent call last):
File "/app/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/app/modules/text_generation.py", line 390, in generate_with_callback
shared.model.generate(**kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1259, in generate
return super().generate(
File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
return self.sample(
File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample
outputs = self(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1043, in forward
transformer_outputs = self.transformer(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 891, in forward
outputs = block(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 610, in forward
attn_outputs = self.attn(
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 499, in forward
attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 191, in forward
output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal)
File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func
return FlashAttnFunc.apply(
File "/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Output generated in 2.02 seconds (0.00 tokens/s, 0 tokens, context 60, seed 1536745076)
06:02:32-584051 INFO Deleted "logs/chat/Assistant/20240506-04-07-12.json".
Is there an existing issue for this?
Reproduction
gpu: V100-PCIE-32GB
python: 3.10
model:Qwen-7B-Chat
docker docker run -it --rm --gpus='"device=0,3"' -v /root/wangbing/model/Qwen-7B-Chat/V1/:/data/mlops/modelDir -v /root/wangbing/sftmodel/qwen/V1:/data/mlops/adapterDir/ -p30901:5000 -p7901:7860 dggecr01.huawei.com:80/tbox/text-generation-webui:at-0.0.1 bash
app python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code
Screenshot
No response
Logs
System Info
The text was updated successfully, but these errors were encountered: