-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Installation]: Cannot compile flash attention when building from source #8878
Comments
That is OOM-kill. |
Yeah, this seems like an OOM to me. The machine I used to report this issue has 128 GB memory, and then I switched to a cluster with 1 TB and I could see the peak usage when compiling the flash attention kernel was around 300. The doc update looks very helpful. I'll definitely try it out. Btw, I'm just curious as to why the compilation guide says it can be done in 5-10 minutes, while I use a powerful cluster and it took me 2 hours? Did I do something terribly wrong? |
Popping this up - any idea why the compilation takes much longer than what was noted on the Doc? Thanks! |
it was an estimation long time ago. many kernels are added recently. if you don't touch kernels, I would suggest using the python-only build from source approach, see https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source-without-compilation |
I updated the description in #8931 |
Thanks! |
Your current environment
The doc did not specify what CUDA version is required, but the recommended Docker image
nvcr.io/nvidia/pytorch:23.10-py3
uses 12.2.The output of
python collect_env.py
, running from the NVIDIA PyTorch 23.10 Docker image.How you are installing vllm
Command (ran from the NVIDIA PyTorch 23.10 Docker image):
Toggle to see view output (excluding installation of dependencies)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: