Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU consumption #550

Closed
David-Lee-1990 opened this issue Jul 23, 2023 · 8 comments
Closed

GPU consumption #550

David-Lee-1990 opened this issue Jul 23, 2023 · 8 comments

Comments

@David-Lee-1990
Copy link

when i load 13b llama in HF, GPU usage is about 26G.

However, when load 13b llama in vllm, GPU usage is about 73G.

image

Is this ususal?

@trannhatquy
Copy link

I load the opt 125M in the vllm api and it takes 21GB on RTX 6000, it's strange.

@rahuldshetty
Copy link

I load the opt 125M in the vllm api and it takes 21GB on RTX 6000, it's strange.

Observe the same issue with A100 20Gi profile. Using the opt-125 eats up complete GPU memory.
Seems like a case of memory leak ?

@irasin
Copy link
Contributor

irasin commented Jul 24, 2023

vllm will allocate 90% GPU memory for model inference and kv_cache blocks. So in A100 case, it will use at least 0.9 * 81920 = 73728 MiB

@rahuldshetty
Copy link

vllm will allocate 90% GPU memory for model inference and kv_cache blocks. So in A100 case, it will use at least 0.9 * 81920 = 73728 MiB

Any way to limit the GPU memory usage ? (if we can tradeoff between throughput/memory)

@irasin
Copy link
Contributor

irasin commented Jul 24, 2023

vllm will allocate 90% GPU memory for model inference and kv_cache blocks. So in A100 case, it will use at least 0.9 * 81920 = 73728 MiB

Any way to limit the GPU memory usage ? (if we can tradeoff between throughput/memory)

You can restrict the gpu usage with the gpu_memory_utilization, https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L27

@David-Lee-1990
Copy link
Author

vllm will allocate 90% GPU memory for model inference and kv_cache blocks. So in A100 case, it will use at least 0.9 * 81920 = 73728 MiB

Any way to limit the GPU memory usage ? (if we can tradeoff between throughput/memory)

You can restrict the gpu usage with the gpu_memory_utilization, https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L27

Thanks so much. But what if the needed gpu exceed the predefined gpu_memory_utilization? will it raise OOM or automatically use more gpu memory?

@irasin
Copy link
Contributor

irasin commented Jul 25, 2023

Since vllm support continuous batching, it will automatically schedule the requests to run in each iteration, so oom will not happen.
However, if you do something like parallel sampling, you may see oom happen if there is no enough cpu_cache blocks.

@zhuohan123
Copy link
Member

Please refer to #241 for memory usage! We will add this to our document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants