[Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default #5158

alexm-redhat · 2024-05-31T16:19:52Z

During our comparisons of vLLM vs TRT-LLM-Triton, we have found that the default value of --gpu_memory_utilization=0.9 parameter in vLLM is not optimal compared to TRT's usage of --kv_cache_free_gpu_mem_fraction. In both vLLM and TRT, the default value used is 0.9, however, in TRT the 0.9 is applied on the memory that is available after the model is loaded, as described here tensorrtllm_backend:

In vLLM, --gpu_memory_utilization is applied before the model is loaded, and this results in vLLM under utilizing the GPU resource. For example, for 81GB A100 and 65GB model like Llama3 70B (split over 2 GPUs), the GPU memory utilization for vLLM is reaching 72GB (88%), while for TRT it is reaching 80GB (98%). As a result, the batching via KV-cache is reduced for vLLM, which in turn hurts e2e user-latency and total throughput time.

This PR bumps up --gpu_memory_utilization default value from 0.9 to 0.98, so that vLLM will utilize GPU memory in a way that is more similar to TensorRT. To avoid OOM errors due to the bump, logic was added to ensure availability of 1.5GB GPU memory for dynamic allocations.

To demonstrate the performance difference, an analysis of TTFT, TPOT, e2e-user-latency and total-time is attached for LLama3 70B on 2xA100 GPUs for QPS=1,2.5,5,7.5,10. The results show the effect of bumping gpu_util to 0.98 for both default scheduler and chunked prompt scheduler. We can see that with 0.98 gpu_util and chunked scheduler, vLLM is able to achieve the same e2e latency and throughput as the optimized TensorRT-LLM-Triton (with chunked prompt or without).

Original PDF for the benchmark:

TRT-LLM-Triton_vs_vLLM_old_vs_vLLM_new_for_Llama3_70B_2xA100_p256_o128.pdf

To reproduce these results, we have created a tensorrt-demo repository that outlines the steps that need to be taken to run TRT-LLM-Triton and vLLM.

comaniac

LGTM. Nice finding!
Meanwhile, since this changes the definition of --gpu-memory-utilization, we may want to document some where (such as release note) to remind people.

alexm-redhat · 2024-05-31T18:11:02Z

@comaniac Thanks for the comment! I have updated the docs that refer to --gpu-memory-utilization and also added cuda graph estimate into the peak_memory calculations.

youkaichao · 2024-05-31T18:36:55Z

vllm/worker/worker.py

+        # In non-eager mode, we also need to add cuda graph
+        # memory to peak_memory
+        if not self.model_config.enforce_eager:
+            cuda_graph_memory = 1 * (1000 * 1024 * 1024)  # 1GB estimate


this might not be accurate. it can vary according to the model size. typically it takes less than 1 GB.

yeah, the problem is that the capture happens only after the code determines gpu_blocks. I have tried this for 70B model - it works. We can also increase it to be on the conservative side (if needed).

Modified the code to estimate cuda graphs based on peak_memory (which is a function of model size)

alexm-redhat · 2024-05-31T20:06:25Z

How much memory usually LORA consumes in addition to the model memory?

zhuohan123 · 2024-05-31T20:53:12Z

The reason we design --gpu-memory-utilization in the current way is that if a user wants to run multiple instances of vllm on a single GPU (say 2 instances), they can simply specify --gpu-memory-utilization 0.5 for both instances and it will work. With this PR, it will be harder to achieve this. I think the fix here should be making our default utilization higher instead of changing its definition.

alexm-redhat · 2024-05-31T21:26:52Z

@zhuohan123 thanks for providing this info, I was not aware that GPU utilization parameter was used to fragment a single GPU to multiple models. What if we introduce a new parameter --gpu-fragment that you can set to 0.5 and it will ensure the total memory used by the vllm instance is 0.5, and the other GPU utilization parameter will control the kv cache size.

Btw, your proposal of increasing the default utilization is also good, since it will provide good performance out of the box.

So both options are good, would be good to get input from people on what is preferable.

zhuohan123 · 2024-05-31T21:32:09Z

To me I feel like the second parameter "GPU utilization parameter for KV cache" should always be 100% If we have fixed the fragment of GPU memory we are using. Is there any benefit for a user to adjust this parameter?

alexm-redhat · 2024-05-31T22:39:27Z

I think you right, I will try to bump the default of GPU utilization without changing its meaning and check if I still get good performance. Will report back.

alexm-redhat · 2024-06-01T17:57:00Z

@zhuohan123 updated the PR based on your suggestion (to simply bump the default value up and not change the meaning) so we can preserve the use-case you described.

…ory utilization is similar to TensorRT

alexm-redhat · 2024-06-06T17:07:42Z

Rebased to get the "[CI] Disable flash_attn backend for spec decode (#5286)" fix to allow speculative decoding tests to pass

alexm-redhat · 2024-06-07T14:32:52Z

@comaniac @zhuohan123 @simon-mo @youkaichao @robertgshaw2-neuralmagic all tests pass. I also re-run the server benchmarks of TRT-Triton vs vLLM (updated PDF above) for both chunked and non-chunked prompt. We can see that vLLM with 0.98 gpu_util and chunked prompt is able to achieve the same e2e user-latency and throughput as TRT-Triton, which is encouraging to see.

youkaichao · 2024-06-07T17:03:31Z

It might be more prone to OOM when our estimation of memory consumption (e.g. cudagraph) is incorrect.

Can we catch the error and tell user to decrease the gpu memory utilization number?

alexm-redhat · 2024-06-07T18:13:54Z

yeah I can add the error catching when capture happens

youkaichao · 2024-06-14T17:47:11Z

vllm/worker/worker.py

+        gb = 1024 * 1024 * 1024
+        min_dynamic_memory = int(1.5 * gb)


My concern is that this can be inaccurate. It is reported in #5517 that different GPUs may cost different memory in terms of cudagraph.

youkaichao · 2024-06-14T17:51:10Z

@alexm-neuralmagic instead of catching the error (which requires careful cleanup), can we use a hindsight approach: the default value is still 0.9, but after we profile the model, we can know how much memory we can use, and we can print a message, showing user the value they can increase to. They can make the decision whether or not to change the value.

So basically, default to 0.9.

If OOM happens, print an error message to guide users to set a lower value
If everything works, print an info message to tell users the max value they can set.

alexm-redhat · 2024-06-19T11:10:07Z

@youkaichao this is a good idea but it still requires to detect how much max dynamic memory is used. Not sure how to do it accurately, which is also the issue with this diff to avoid possible OOM issues it may generate by bumping the default up.

github-actions · 2024-10-26T02:00:10Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2024-11-26T05:51:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

comaniac reviewed May 31, 2024

View reviewed changes

youkaichao reviewed May 31, 2024

View reviewed changes

alexm-redhat changed the title ~~[Core] Optimize gpu_memory_utilization parameter to be applied on the GPU memory available after peak_memory~~ [Core] Bump up the default of gpu_memory_utilization to be more similar to TensorRT Triton's default Jun 3, 2024

alexm-redhat changed the title ~~[Core] Bump up the default of gpu_memory_utilization to be more similar to TensorRT Triton's default~~ [Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default Jun 3, 2024

alexm-redhat force-pushed the optimize_gpu_util branch from 9b61ef4 to 1531426 Compare June 4, 2024 16:16

Bump up gpu_memory_utilization parameter to 0.98 so that KV-cache mem…

528966e

…ory utilization is similar to TensorRT

alexm-redhat force-pushed the optimize_gpu_util branch from df794e0 to 528966e Compare June 6, 2024 17:03

format

a977b4c

vrdn-23 mentioned this pull request Jun 6, 2024

[Bug]: GPU memory usage is inconsistent with gpu_memory_utilization settings #5305

Closed

youkaichao reviewed Jun 14, 2024

View reviewed changes

github-actions bot added the stale label Oct 26, 2024

simon-mo requested review from njhill, LiuXiaoxuanPKU and zhuohan123 as code owners November 26, 2024 05:49

mergify bot added the documentation Improvements or additions to documentation label Nov 26, 2024

mergify bot added frontend needs-rebase labels Nov 26, 2024

github-actions bot added unstale and removed stale labels Nov 27, 2024

hmellor closed this Feb 18, 2025

mergify bot added the speculative-decoding label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default #5158

[Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default #5158

alexm-redhat commented May 31, 2024 •

edited

Loading

comaniac left a comment

alexm-redhat commented May 31, 2024

youkaichao May 31, 2024 •

edited

Loading

alexm-redhat May 31, 2024

alexm-redhat Jun 2, 2024

alexm-redhat commented May 31, 2024

zhuohan123 commented May 31, 2024 •

edited

Loading

alexm-redhat commented May 31, 2024 •

edited

Loading

zhuohan123 commented May 31, 2024

alexm-redhat commented May 31, 2024

alexm-redhat commented Jun 1, 2024

alexm-redhat commented Jun 6, 2024

alexm-redhat commented Jun 7, 2024 •

edited

Loading

youkaichao commented Jun 7, 2024

alexm-redhat commented Jun 7, 2024

youkaichao Jun 14, 2024

youkaichao commented Jun 14, 2024

alexm-redhat commented Jun 19, 2024

github-actions bot commented Oct 26, 2024

mergify bot commented Nov 26, 2024

[Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default #5158

[Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default #5158

Conversation

alexm-redhat commented May 31, 2024 • edited Loading

comaniac left a comment

Choose a reason for hiding this comment

alexm-redhat commented May 31, 2024

youkaichao May 31, 2024 • edited Loading

Choose a reason for hiding this comment

alexm-redhat May 31, 2024

Choose a reason for hiding this comment

alexm-redhat Jun 2, 2024

Choose a reason for hiding this comment

alexm-redhat commented May 31, 2024

zhuohan123 commented May 31, 2024 • edited Loading

alexm-redhat commented May 31, 2024 • edited Loading

zhuohan123 commented May 31, 2024

alexm-redhat commented May 31, 2024

alexm-redhat commented Jun 1, 2024

alexm-redhat commented Jun 6, 2024

alexm-redhat commented Jun 7, 2024 • edited Loading

youkaichao commented Jun 7, 2024

alexm-redhat commented Jun 7, 2024

youkaichao Jun 14, 2024

Choose a reason for hiding this comment

youkaichao commented Jun 14, 2024

alexm-redhat commented Jun 19, 2024

github-actions bot commented Oct 26, 2024

mergify bot commented Nov 26, 2024

alexm-redhat commented May 31, 2024 •

edited

Loading

youkaichao May 31, 2024 •

edited

Loading

zhuohan123 commented May 31, 2024 •

edited

Loading

alexm-redhat commented May 31, 2024 •

edited

Loading

alexm-redhat commented Jun 7, 2024 •

edited

Loading