-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load model turn out to be very slow after update the version of vllm #2959
Comments
I tried 0.2.7, it also works well, but when it comes to 0.3.0, it hangs for 40 minutes, so I’m sure there is some change in 0.3.0 that leads to this issue |
Can confirm, distributed inference works with 0.2.7. Edit: distributed inference on GPUs on the same node works well. Only when using GPUs across nodes did I run into this slow loading. Thank you! |
We have added documentation for this situation in #5430. Please take a look. |
I am using vllm and mixtral 8x7b, the version of vllm is 0.2.6, it works well.
I tried to update the version to latest 0.3.1. however, while it proceeds to load model weight, it becomes very slow. it takes almost 40 minutes vs. 5 minutes for the old version. I don't know how it happens since I do not change any other environment and parameters.
I have checked the difference of loading model in mixtral.py and doesn't find any clues.
I downloaded the model, and my parameters are:
the log hanging for about 40 minutes at:
more information:
the model I downloaded was put on the hdfs, and is mounted to the k8s pod using pvc, which has a band limit of 800M/s, I have checked the monitor, while loading model it did not exceed the limit.
sorry to bother, but if you have any idea of this, since you have changed the file (mixtral.py) since 0.2.6. @WoosukKwon @zhuohan123 @pcmoritz @tterrysun
The text was updated successfully, but these errors were encountered: