Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default #5158

Closed
wants to merge 2 commits into from

Conversation

alexm-redhat
Copy link
Collaborator

@alexm-redhat alexm-redhat commented May 31, 2024

During our comparisons of vLLM vs TRT-LLM-Triton, we have found that the default value of --gpu_memory_utilization=0.9 parameter in vLLM is not optimal compared to TRT's usage of --kv_cache_free_gpu_mem_fraction. In both vLLM and TRT, the default value used is 0.9, however, in TRT the 0.9 is applied on the memory that is available after the model is loaded, as described here tensorrtllm_backend:

image

In vLLM, --gpu_memory_utilization is applied before the model is loaded, and this results in vLLM under utilizing the GPU resource. For example, for 81GB A100 and 65GB model like Llama3 70B (split over 2 GPUs), the GPU memory utilization for vLLM is reaching 72GB (88%), while for TRT it is reaching 80GB (98%). As a result, the batching via KV-cache is reduced for vLLM, which in turn hurts e2e user-latency and total throughput time.

This PR bumps up --gpu_memory_utilization default value from 0.9 to 0.98, so that vLLM will utilize GPU memory in a way that is more similar to TensorRT. To avoid OOM errors due to the bump, logic was added to ensure availability of 1.5GB GPU memory for dynamic allocations.

To demonstrate the performance difference, an analysis of TTFT, TPOT, e2e-user-latency and total-time is attached for LLama3 70B on 2xA100 GPUs for QPS=1,2.5,5,7.5,10. The results show the effect of bumping gpu_util to 0.98 for both default scheduler and chunked prompt scheduler. We can see that with 0.98 gpu_util and chunked scheduler, vLLM is able to achieve the same e2e latency and throughput as the optimized TensorRT-LLM-Triton (with chunked prompt or without).

image

Original PDF for the benchmark:

TRT-LLM-Triton_vs_vLLM_old_vs_vLLM_new_for_Llama3_70B_2xA100_p256_o128.pdf

To reproduce these results, we have created a tensorrt-demo repository that outlines the steps that need to be taken to run TRT-LLM-Triton and vLLM.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice finding!
Meanwhile, since this changes the definition of --gpu-memory-utilization, we may want to document some where (such as release note) to remind people.

@alexm-redhat
Copy link
Collaborator Author

@comaniac Thanks for the comment! I have updated the docs that refer to --gpu-memory-utilization and also added cuda graph estimate into the peak_memory calculations.

# In non-eager mode, we also need to add cuda graph
# memory to peak_memory
if not self.model_config.enforce_eager:
cuda_graph_memory = 1 * (1000 * 1024 * 1024) # 1GB estimate
Copy link
Member

@youkaichao youkaichao May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not be accurate. it can vary according to the model size. typically it takes less than 1 GB.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the problem is that the capture happens only after the code determines gpu_blocks. I have tried this for 70B model - it works. We can also increase it to be on the conservative side (if needed).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the code to estimate cuda graphs based on peak_memory (which is a function of model size)

@alexm-redhat
Copy link
Collaborator Author

How much memory usually LORA consumes in addition to the model memory?

@zhuohan123
Copy link
Member

zhuohan123 commented May 31, 2024

The reason we design --gpu-memory-utilization in the current way is that if a user wants to run multiple instances of vllm on a single GPU (say 2 instances), they can simply specify --gpu-memory-utilization 0.5 for both instances and it will work. With this PR, it will be harder to achieve this. I think the fix here should be making our default utilization higher instead of changing its definition.

@alexm-redhat
Copy link
Collaborator Author

alexm-redhat commented May 31, 2024

@zhuohan123 thanks for providing this info, I was not aware that GPU utilization parameter was used to fragment a single GPU to multiple models. What if we introduce a new parameter --gpu-fragment that you can set to 0.5 and it will ensure the total memory used by the vllm instance is 0.5, and the other GPU utilization parameter will control the kv cache size.

Btw, your proposal of increasing the default utilization is also good, since it will provide good performance out of the box.

So both options are good, would be good to get input from people on what is preferable.

@zhuohan123
Copy link
Member

To me I feel like the second parameter "GPU utilization parameter for KV cache" should always be 100% If we have fixed the fragment of GPU memory we are using. Is there any benefit for a user to adjust this parameter?

@alexm-redhat
Copy link
Collaborator Author

I think you right, I will try to bump the default of GPU utilization without changing its meaning and check if I still get good performance. Will report back.

@alexm-redhat
Copy link
Collaborator Author

@zhuohan123 updated the PR based on your suggestion (to simply bump the default value up and not change the meaning) so we can preserve the use-case you described.

@alexm-redhat alexm-redhat changed the title [Core] Optimize gpu_memory_utilization parameter to be applied on the GPU memory available after peak_memory [Core] Bump up the default of gpu_memory_utilization to be more similar to TensorRT Triton's default Jun 3, 2024
@alexm-redhat alexm-redhat changed the title [Core] Bump up the default of gpu_memory_utilization to be more similar to TensorRT Triton's default [Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default Jun 3, 2024
@alexm-redhat
Copy link
Collaborator Author

Rebased to get the "[CI] Disable flash_attn backend for spec decode (#5286)" fix to allow speculative decoding tests to pass

@alexm-redhat
Copy link
Collaborator Author

alexm-redhat commented Jun 7, 2024

@comaniac @zhuohan123 @simon-mo @youkaichao @robertgshaw2-neuralmagic all tests pass. I also re-run the server benchmarks of TRT-Triton vs vLLM (updated PDF above) for both chunked and non-chunked prompt. We can see that vLLM with 0.98 gpu_util and chunked prompt is able to achieve the same e2e user-latency and throughput as TRT-Triton, which is encouraging to see.

@youkaichao
Copy link
Member

It might be more prone to OOM when our estimation of memory consumption (e.g. cudagraph) is incorrect.

Can we catch the error and tell user to decrease the gpu memory utilization number?

@alexm-redhat
Copy link
Collaborator Author

yeah I can add the error catching when capture happens

Comment on lines +175 to +176
gb = 1024 * 1024 * 1024
min_dynamic_memory = int(1.5 * gb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that this can be inaccurate. It is reported in #5517 that different GPUs may cost different memory in terms of cudagraph.

@youkaichao
Copy link
Member

@alexm-neuralmagic instead of catching the error (which requires careful cleanup), can we use a hindsight approach: the default value is still 0.9, but after we profile the model, we can know how much memory we can use, and we can print a message, showing user the value they can increase to. They can make the decision whether or not to change the value.

So basically, default to 0.9.

  • If OOM happens, print an error message to guide users to set a lower value
  • If everything works, print an info message to tell users the max value they can set.

@alexm-redhat
Copy link
Collaborator Author

@youkaichao this is a good idea but it still requires to detect how much max dynamic memory is used. Not sure how to do it accurately, which is also the issue with this diff to avoid possible OOM issues it may generate by bumping the default up.

Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 26, 2024
@mergify mergify bot added the documentation Improvements or additions to documentation label Nov 26, 2024
Copy link

mergify bot commented Nov 26, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants