-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic scheduler delay to improve ITL performance #3279
Conversation
Co-authored-by: Thomas Parnell <[email protected]>
Take a look also at the chunked prefill efforts to address this |
@robertgshaw2-neuralmagic Thanks, and agreed: chunked prefill may eventually solve this problem in a different way. We hope that this relatively simple, optional, change can be used to improve performance in the meantime. |
This might affect #3168 and IMO it's worth thinking about how to integrate these control changes with each other |
@tdoublep We were planning to upstream something similar, but instead of time we used number of decode iterations ("schedule prefill iteration only after N decode iterations have been completed or there are no running sequences"). We believe that this scheme is more generic and easier to implement. I'd be happy to make a PR early next week, if you are interested in trying that out. |
@Yard1 could you elaborate on "more generic and easier to implement"? Isn't it completely generic and fairly trivial to implement in either case? We found the adaptive time-based approach to work very well, and it makes more sense to me intuitively at least. The goal is to prevent prefills from starving decode progress - the enforced delay is some fraction of the duration of the last prefill and so equivalent to saying that not more than say 50% of time can be spent in prefill. We chose this min delay to be half the last prefill time which ensures at most 66% of time is spent in prefill. Of course like in your case, the min delay only applies while there are still running sequences. |
Hmm I now see the delay is dynamic. I think thinking in terms of model iterations is simpler, but I suppose that this approach should be just as good. @tdoublep would it be possible for you to open source your benchmarking tool? |
@Yard1 Yes - we do plan to open-source the benchmarking tool. We are working through that process internally at the moment. |
@tdoublep Which value of |
@sh1ng |
Based on the discussion here it sounds like sorting the requests in the waiting queue will no longer be necessary once we merge #3236 which effectively removing padding constraints via 1D query. We have run additional experiments to compare the performance when using 1D query from #3236, as well as to evaluate the performance if we enable the dynamic delay (from this PR) in combination with 1D query: Conclusion: combining dynamic scheduler delay (#3279) with 1D query (#3236) is even more effective than combining it with sorting requests by length (#2357). |
Update: Added a test case in |
Now that 1D query has been merged, the changes from this PR can be effective when applied on top of main branch. Here is latest round of benchmarking results. I've also included performance data collected using TGIS (our fork of TGI) as an additional reference point: Some conclusions here:
|
Looks good. I think it would be even better if we didn't hardcode it to 0.5. I think we could make the argument a float, and if it is <=0, we don't apply the delay. |
@Yard1 Good idea - there is no reason to assume that 0.5 an optimum for all scenarios. I've updated the code accordingly. |
@Yard1 are you approving this PR? |
@Yard1 thanks for the review and helpful discussion and suggestions. |
@tdoublep Does vllm have a doc about configuration? Feel like it is worth adding it there if there is. I.e., there are config setttings to optimize throughput over latency, TTFT over ITL or the other way around. But it seems like things are not that well documented |
@rkooo567 I agree it would be good to have documentation like that. The closest thing I can find the the developer documentation, e.g.: Perhaps we should consider adding some more pages there to documentation the |
I see. Yeah +1 we need better doc with configs, but it seems like there's no holistic page that explains this. |
We have been benchmarking vLLM internally using a synthetic workload generator that has been fitted to mimic our production workloads. It stresses the inference server using a varying number of concurrent users, all users send requests that are drawn uniformly from a heterogeneous set of requests with different prompt lengths and number of generated tokens.
We have found that for these workloads, vLLM has extremely low TTFT (time to first token) but has relatively high ITL (inter-token latency). An in-depth analysis seems to show that vLLM tends to schedule prompts as soon as possible, resulting in very small prompt batches, which are processed very quickly, but end up starving the decoding phase.
This PR adds a new optional feature
--scheduler-use-delay
which, if enabled, creates an artificial delay before scheduling prompts. The delay is determined dynamically based on the time to perform the last prompt step. This delay allows the waiting queue to fill up with more requests. This gives the opportunity to make larger prompt batches, but due to the heterogeneous nature of the workload, we then hit issues related to padding overhead. It is thus beneficial to combine this scheduler delay with the--scheduler-policy=reorder
feature from #2357 which sorts the waiting queue by sequence length. This allows us to create much larger prompt batches whilst staying with the padding limits, and leads to significant improvements in terms of ITL performance.This ITL improvement comes at the expense of TTFT performance, since (a) we are applying an artificial delay before scheduling prompts and (b) we are now processing larger batches which take longer to process. Different use-cases may have a preference towards either metric, which is why we feel this makes sense as an optional feature for now.
Benchmarking results (labels on each point indicates the number of concurrent users):