-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Can't load gemma-2-9b-it with vllm 0.5.2 #6462
Comments
cc @tlrmchlsmth @mgoin for cutlass and fp8 |
the same operation, get this error:
|
the same operation, get this error: File ~/.local/lib/python3.8/site-packages/vllm/entrypoints/llm.py:150, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs) File ~/.local/lib/python3.8/site-packages/vllm/engine/llm_engine.py:421, in LLMEngine.from_engine_args(cls, engine_args, usage_context) File ~/.local/lib/python3.8/site-packages/vllm/engine/llm_engine.py:263, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, multimodal_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers) File ~/.local/lib/python3.8/site-packages/vllm/engine/llm_engine.py:362, in LLMEngine._initialize_kv_caches(self) File ~/.local/lib/python3.8/site-packages/vllm/executor/gpu_executor.py:78, in GPUExecutor.determine_num_available_blocks(self) File ~/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs) File ~/.local/lib/python3.8/site-packages/vllm/worker/worker.py:179, in Worker.determine_num_available_blocks(self) File ~/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs) File ~/.local/lib/python3.8/site-packages/vllm/worker/model_runner.py:923, in GPUModelRunnerBase.profile_run(self) File ~/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs) File ~/.local/lib/python3.8/site-packages/vllm/worker/model_runner.py:1299, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps) TypeError: 'NoneType' object is not callable |
@vlsav How did you install vLLM? If you installed vLLM from source like:
Then either try rerunning
@wlwqq and @ArlanCooper, those are separate errors -- please open up a new issue to keep the conversation focused. |
@tlrmchlsmth By pip from pypi, not from sources: |
Update: 3f3b6b2 made it into 0.5.1, which kind of breaks the version mismatch theory, unfortunately. |
@vlsav what do you see when you run the following? |
@tlrmchlsmth |
I also encountered the same problem when deploying the gemma-2-9b-it model with VLLM 0.5.2. |
+1 |
This looks right to me -- at least the function is present in the .so file. I'll try to reproduce the problem. |
@jueming0312 and @twright8 it'd be helpful if you could share the output of |
`PyTorch version: 2.3.0+cu121 OS: Ubuntu 20.04.6 LTS (x86_64) Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime) Nvidia driver version: 550.90.07 CPU: Versions of relevant libraries: Legend: X = Self |
Think i needed to upgrade flashinfer. But I cant use it as it only supports ampere 8+. There's a solution proposed here: #6173 |
@twright8 unfortunately, ampere is needed for fp8 quantization support as well. @ArlanCooper it looks like you need to install flashinfer (you'll want to download a wheel from here https://github.com/flashinfer-ai/flashinfer/releases) |
@vlsav I haven't been able to reproduce your problem, either on a H100 or on an L40. What model exactly are you running with? I.e. what is |
@tlrmchlsmth should it help me if I will install newer version of flashinfer with vllm 0.5.2? |
@tlrmchlsmth now tried to launch neuralmagic/gemma-2-9b-it-FP8 with vllm==0.5.2
|
I don't think that will resolve your issue. This is a more of a linker issue with a C++ function that we compile and ship with the vLLM wheel file. One thing to check is to make sure there isn't a stale |
It also occurs to me that the name |
There is only one |
@tlrmchlsmth I also found that one in vllm output: |
@vlsav That is actually your problem: I'll look into this -- looks like we'd need to build our .so files on an OS with an earlier version of GLIBC, since you're on 2.28 |
FYI, with vllm 0.5.2, I get this warning on Ubuntu 20.04, but on Ubuntu 22.04 it works fine.
|
Thanks. Not sure if I will be able to do system upgrade
|
As a workaround, you can try installing from the wheel files on Github https://github.com/vllm-project/vllm/releases/tag/v0.5.2, which were built on an older OS (Ubuntu 20.04). I think that should work for you. |
Thanks. It works, both for |
This issue should be fixed for most people in 0.5.3 and later, now that we are building on Ubuntu 20.04. I think we can go ahead and close this one. |
Thanks. So far no issues with 0.5.3 |
I am using v0.6.1 and get the same error with gemma2 |
@yazdanbakhsh what OS are you running? Specific version is important here. And do you see a msg like this in your log? |
Thanks for looking into this -- here is the complete command (note that the exact same docker works for at least 15-16 other models -- non gemma), also I used offline checkpoint to load the model tokenize.
Main Error:
OS: Linux isca3 5.15.0-1062-gcp #70~20.04.1-Ubuntu SMP Fri May 24 20:12:18 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux Libraries: vllm -> 0.6.1; torch-> 2.4.0+cu121; NVIDIA-SMI 550.54.15; Driver Version: 550.54.15; CUDA Version: 12.4 Code snippet used for generation
These are the models that exact same setup works:
Let me know if you need anything else from me. |
One last piece of information is that our machine does not have internet. Note sure if this error has anything to do with internet. |
Also tried with both H100 and A100 and the error persists. @tlrmchlsmth |
@yazdanbakhsh, this looks like a different problem (0.5.1 had some problems related to the glibc issues, but you're not running into those here), so I think we should track this in a separate issue. Feel free to make one and tag me there. I don't know what's going on here, but my first suggestion is to try nuking your
|
There is nothing there. It seems this is something that is being built after running the script. |
how can I disable triton for vllm? is there a way? |
There's no way to disable triton in general. But since this is only Gemma-2, it might be the torch.compile used in GemmaRMSNorm. To test that out, and as a workaround, you could try modying
with
|
Update: I think it is related to the version of our drivers. Please do not extrapolate this case to other scenarios as our setup is a little bit unique. Thanks for all the help. |
Your current environment
🐛 Describe the bug
Successfully launched gemma-2-9b-it with vlmm 0.5.1.
Following script was used
export VLLM_ATTENTION_BACKEND=FLASHINFER
python -m vllm.entrypoints.openai.api_server --port=8080 --host=0.0.0.0 --model /models/gemma-2-9b-it --quantization fp8 --enforce-eager --seed 1234 --served-model-name gemma-2-9b
no issues (except sliding window warning and capping the max length to the sliding window size (4096).
Same script after installing vllm 0.5.2 gives error message:
The text was updated successfully, but these errors were encountered: