-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][WIP] V1 sampler implements parallel sampling (PR 1/N for parallel sampling support) #10980
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
7cd9a24
to
f506458
Compare
It seems like this PR is implementing ideas similar to those implemented in PR #9302 for the V0 engine. That PR created some issues that were addressed in PR #11898 and which may exist in the proposed V1 code. In particular, the proposed code currently does not properly handle the case when a seed value is provided for the parent request; the seed value is duplicated in child requests, leading to identical outputs in the child requests. The fix in #11898 was simply to move the copying of the Additionally, the proposed code for the V1 engine defines |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
3d5b962
to
bf3cfd0
Compare
The vLLM V1 sampler will support parallel sampling.
Currently the sampler consumes a logits vector which is concatenated over all requests, i.e. the logits vector is
total_batch_tokens x vocab_size
. Each request contributes a single completion.Now the sampler will consume a logits vector which is
total_tokens x vocab_size
wheretotal_tokens
is the sum of the sequence lengths of all ongoing completions for all requests, allowing that a request may have more than one completion.TODO: understand how to exploit prefix caching for parallel sampling?
NOTE: this PR depends on #9880
PR 1/N towards addressing the known need for vLLM v1 parallel sampling support (as described in the vLLM's V1 Engine Architecture RFC)