-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default #5158
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice finding!
Meanwhile, since this changes the definition of --gpu-memory-utilization
, we may want to document some where (such as release note) to remind people.
@comaniac Thanks for the comment! I have updated the docs that refer to --gpu-memory-utilization and also added cuda graph estimate into the peak_memory calculations. |
vllm/worker/worker.py
Outdated
# In non-eager mode, we also need to add cuda graph | ||
# memory to peak_memory | ||
if not self.model_config.enforce_eager: | ||
cuda_graph_memory = 1 * (1000 * 1024 * 1024) # 1GB estimate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might not be accurate. it can vary according to the model size. typically it takes less than 1 GB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the problem is that the capture happens only after the code determines gpu_blocks. I have tried this for 70B model - it works. We can also increase it to be on the conservative side (if needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the code to estimate cuda graphs based on peak_memory (which is a function of model size)
How much memory usually LORA consumes in addition to the model memory? |
The reason we design |
@zhuohan123 thanks for providing this info, I was not aware that GPU utilization parameter was used to fragment a single GPU to multiple models. What if we introduce a new parameter --gpu-fragment that you can set to 0.5 and it will ensure the total memory used by the vllm instance is 0.5, and the other GPU utilization parameter will control the kv cache size. Btw, your proposal of increasing the default utilization is also good, since it will provide good performance out of the box. So both options are good, would be good to get input from people on what is preferable. |
To me I feel like the second parameter "GPU utilization parameter for KV cache" should always be 100% If we have fixed the fragment of GPU memory we are using. Is there any benefit for a user to adjust this parameter? |
I think you right, I will try to bump the default of GPU utilization without changing its meaning and check if I still get good performance. Will report back. |
@zhuohan123 updated the PR based on your suggestion (to simply bump the default value up and not change the meaning) so we can preserve the use-case you described. |
9b61ef4
to
1531426
Compare
…ory utilization is similar to TensorRT
df794e0
to
528966e
Compare
Rebased to get the "[CI] Disable flash_attn backend for spec decode (#5286)" fix to allow speculative decoding tests to pass |
@comaniac @zhuohan123 @simon-mo @youkaichao @robertgshaw2-neuralmagic all tests pass. I also re-run the server benchmarks of TRT-Triton vs vLLM (updated PDF above) for both chunked and non-chunked prompt. We can see that vLLM with 0.98 gpu_util and chunked prompt is able to achieve the same e2e user-latency and throughput as TRT-Triton, which is encouraging to see. |
It might be more prone to OOM when our estimation of memory consumption (e.g. cudagraph) is incorrect. Can we catch the error and tell user to decrease the gpu memory utilization number? |
yeah I can add the error catching when capture happens |
gb = 1024 * 1024 * 1024 | ||
min_dynamic_memory = int(1.5 * gb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that this can be inaccurate. It is reported in #5517 that different GPUs may cost different memory in terms of cudagraph.
@alexm-neuralmagic instead of catching the error (which requires careful cleanup), can we use a hindsight approach: the default value is still So basically, default to 0.9.
|
@youkaichao this is a good idea but it still requires to detect how much max dynamic memory is used. Not sure how to do it accurately, which is also the issue with this diff to avoid possible OOM issues it may generate by bumping the default up. |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
During our comparisons of vLLM vs TRT-LLM-Triton, we have found that the default value of --gpu_memory_utilization=0.9 parameter in vLLM is not optimal compared to TRT's usage of --kv_cache_free_gpu_mem_fraction. In both vLLM and TRT, the default value used is 0.9, however, in TRT the 0.9 is applied on the memory that is available after the model is loaded, as described here tensorrtllm_backend:
In vLLM, --gpu_memory_utilization is applied before the model is loaded, and this results in vLLM under utilizing the GPU resource. For example, for 81GB A100 and 65GB model like Llama3 70B (split over 2 GPUs), the GPU memory utilization for vLLM is reaching 72GB (88%), while for TRT it is reaching 80GB (98%). As a result, the batching via KV-cache is reduced for vLLM, which in turn hurts e2e user-latency and total throughput time.
This PR bumps up --gpu_memory_utilization default value from 0.9 to 0.98, so that vLLM will utilize GPU memory in a way that is more similar to TensorRT. To avoid OOM errors due to the bump, logic was added to ensure availability of 1.5GB GPU memory for dynamic allocations.
To demonstrate the performance difference, an analysis of TTFT, TPOT, e2e-user-latency and total-time is attached for LLama3 70B on 2xA100 GPUs for QPS=1,2.5,5,7.5,10. The results show the effect of bumping gpu_util to 0.98 for both default scheduler and chunked prompt scheduler. We can see that with 0.98 gpu_util and chunked scheduler, vLLM is able to achieve the same e2e latency and throughput as the optimized TensorRT-LLM-Triton (with chunked prompt or without).
Original PDF for the benchmark:
TRT-LLM-Triton_vs_vLLM_old_vs_vLLM_new_for_Llama3_70B_2xA100_p256_o128.pdf
To reproduce these results, we have created a tensorrt-demo repository that outlines the steps that need to be taken to run TRT-LLM-Triton and vLLM.