-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU consumption #550
Comments
I load the opt 125M in the vllm api and it takes 21GB on RTX 6000, it's strange. |
Observe the same issue with A100 20Gi profile. Using the opt-125 eats up complete GPU memory. |
vllm will allocate 90% GPU memory for model inference and kv_cache blocks. So in A100 case, it will use at least 0.9 * 81920 = 73728 MiB |
Any way to limit the GPU memory usage ? (if we can tradeoff between throughput/memory) |
You can restrict the gpu usage with the |
Thanks so much. But what if the needed gpu exceed the predefined gpu_memory_utilization? will it raise OOM or automatically use more gpu memory? |
Since vllm support continuous batching, it will automatically schedule the requests to run in each iteration, so oom will not happen. |
Please refer to #241 for memory usage! We will add this to our document. |
…-project#550) Signed-off-by: kevin <[email protected]>
when i load 13b llama in HF, GPU usage is about 26G.
However, when load 13b llama in vllm, GPU usage is about 73G.
Is this ususal?
The text was updated successfully, but these errors were encountered: