Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Multi Step Scheduling #7000

Merged
merged 2 commits into from
Aug 19, 2024
Merged

Conversation

SolitaryThinker
Copy link
Contributor

@SolitaryThinker SolitaryThinker commented Jul 31, 2024

Adds initial multi step scheduling support to vLLM.
RFC: #6854

Current Status:

8/16: Initial support for chunked prefill thanks to @varun-sundar-rabindranath

8/14: Ready for another round of reviews! please review #7452
8/8: multi-node working
8/6: PP+TP working; PP+ray fixed; a few single GPU perf regressions (easy fix)
8/2 PP works with MP; Ready for initial pass on design
8/1 - PP is very close to working. We do get the desired interleaving of steps between microbatches which is great!
7/31 - Current branch is in very rough shape after getting the RFC design working. Will clean up after adding TP/PP support as there may be some refactors needed. However single GPU is ready for initial testing

Cmd:
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B --swap-space 16 --disable-log-requests --use-v2-block-manager --tensor-parallel-size 1 --worker-use-ray --pipeline-parallel-size 1 --gpu-memory-utilization 0.90 --num-scheduler-steps 8

Benchmark (8/16)
See: #7528
CP_1: Force Single Step: We force single step when there are prefill requests in a batch. This may work well for offline batching, but not good for online serving because new requests keep coming.

CP_2: Ignore Prefill (WIP): We ignore prefill requests since the second step, meaning that prefill requests do nothing in (k-1) steps. This may work better for online serving.

Single GPU Baseline (Req/s) Baseline+CP (Req/s) MS-8 (Req/s) MS-8+CP_1 (Req/s)
A10G 8B Llama 6.21 - 6.63 -
H100 8B Llama 25.96 27.82 44.44 31.4
H100 30B Llama 10.38 11.01 14.27 12.31
PP=2 Baseline (Req/s) MS-4 (Req/s) MS-8 (Req/s) MS-12 (Req/s) MS-16 (Req/s)
A10G 8B Llama (microbatch=128) 8.98 - 9.99 - -
H100 8B Llama 23 - 31 - - `
H100 70B Llama 3.09 3.13 3.13 - -
TP=2 Baseline (Req/s) MS-4 (Req/s) MS-8 (Req/s) MS-12 (Req/s) MS-16 (Req/s)
A10G 8B Llama 6.11 - 7.02 - -
TP=2, PP=2 Baseline (Req/s) MS-4 (Req/s) MS-8 (Req/s) MS-12 (Req/s) MS-16 (Req/s)
A10G 8B Llama (microbatch=128) 5.99 - 7.15 - -

TODO:
Milestone 1: POC

  • Add --max_forward_calls_per_step to cli argument, engine args, and schedulerConfig
  • Changes to SequenceGroupState in sequence.py to track multi-step state.
  • Add MultiStepWorker in worker/ to cache multi-step state
  • Changes to ModelRunner to handle multi step state
  • Reorganize input preparation in ModelRunner to reduce duplicate code
  • Async GPU->CPU transfer for sampled token
  • Async pythonization
  • Flash Attn backend
  • Cudagraph
  • Benchmarks (Ongoing)
  • TP
  • PP (works with MP and Ray, mem leak somewhere with RAY)
  • PP+TP
  • multi-node

Milstone 2: Mergeable

Follow up work: Tracking Issue #7528

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@rkooo567
Copy link
Collaborator

rkooo567 commented Aug 1, 2024

QQ: do you plan to split PRs to smaller pieces?

@SolitaryThinker
Copy link
Contributor Author

@rkooo567 If there are splits that makes sense I will definitely do that. Currently working on a small part here #6971

vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved
vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved
vllm/worker/multi_step_worker.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first batch of comments.

vllm/config.py Outdated Show resolved Hide resolved
vllm/engine/arg_utils.py Outdated Show resolved Hide resolved
vllm/engine/async_llm_engine.py Outdated Show resolved Hide resolved
vllm/engine/async_llm_engine.py Show resolved Hide resolved
vllm/model_executor/layers/sampler.py Outdated Show resolved Hide resolved
vllm/sequence.py Outdated Show resolved Hide resolved
vllm/sequence.py Outdated Show resolved Hide resolved
@SolitaryThinker SolitaryThinker changed the title [WIP] [core] Multi Step Scheduling [core] Multi Step Scheduling Aug 9, 2024
@SolitaryThinker
Copy link
Contributor Author

@zhuohan123 @rkooo567 @Yard1 @comaniac @alexm-neuralmagic rebased and ready for review

@SolitaryThinker
Copy link
Contributor Author

Working on a smaller PR that contains parts of this.

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of questions. Will add more tmrw.

vllm/config.py Outdated Show resolved Hide resolved
vllm/sequence.py Outdated Show resolved Hide resolved
vllm/sequence.py Outdated Show resolved Hide resolved
vllm/worker/worker.py Outdated Show resolved Hide resolved
vllm/worker/multi_step_worker.py Outdated Show resolved Hide resolved
Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second batch of reviews

vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved
Execute the model for a single step and update multi-step
metadata
"""
assert num_steps == 1, "MultiStepModelRunner only supports num_steps=1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assert mean the MultiStepModelRunner can only be run with one step? Can you elaborate on this?

Copy link
Contributor Author

@SolitaryThinker SolitaryThinker Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiStepModelRunner only takes a single step internally before returning to AsyncLLMEngine. As the multi-step is done implicitly using stateful model inputs and SequenceGroup states.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explaination!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing tho. IIRC, this was introduced by me for multi-step draft model runner? We should remove this argument and use stateful model inputs as the unify representation. Also cc @alexm-neuralmagic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's remove this argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this in a follow up PR as it will involve spec decode as well. Will add to TODO tracker

vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved
vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved
vllm/worker/multi_step_model_runner.py Show resolved Hide resolved
vllm/worker/multi_step_model_runner.py Show resolved Hide resolved
vllm/worker/worker_base.py Outdated Show resolved Hide resolved
vllm/engine/async_llm_engine.py Outdated Show resolved Hide resolved
vllm/engine/async_llm_engine.py Outdated Show resolved Hide resolved
vllm/engine/async_llm_engine.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for me. Leave to @zhuohan123

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work! In general LGTM. Please see my comments.

Comment on lines 889 to 891
if not self.use_v2_block_manager:
raise ValueError("BlockSpaceManagerV2 is required for "
"multi-step (--num-scheduler-steps > 1)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we auto-correct to v2 block manager and print a warning here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Execute the model for a single step and update multi-step
metadata
"""
assert num_steps == 1, "MultiStepModelRunner only supports num_steps=1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's remove this argument

vllm/engine/async_llm_engine.py Outdated Show resolved Hide resolved
vllm/engine/async_llm_engine.py Outdated Show resolved Hide resolved
@@ -997,7 +996,7 @@ class SamplerOutput:

# On-device tensor containing the sampled token ids.
sampled_token_ids: Optional[torch.Tensor] = None
sampled_token_ids_numpy: Optional[numpy.ndarray] = None
sampled_token_ids_cpu: Optional[torch.Tensor] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain why we need this variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 100 to 105
class MutableModelInputForGPUWithMultiStepMetadata(BroadcastableModelInput):
# actual frozen model input dataclass passed to _base_model_runner
frozen_model_input: Optional[ModelInputForGPUWithSamplingMetadata] = None

# list of model outputs for each step, may not be all pythonized
outputs: List[ModelOutput] = field(default_factory=list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If outputs is a part of this data structure, calling this class MutableModelInput seems confusing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the current class name is probably a bit too long. Maybe something like ModelRequests? Feel free to use any other name that makes more sense here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to StatefulModelInputs. The outputs is really a cache and needed for the next step. So renamed to cached_outputs

# Update GPU tensors
ops.advance_step(
num_seqs=num_seqs,
num_queries=num_queries,
block_size=self.block_size,
input_tokens=frozen_model_input.input_tokens,
sampled_token_ids=model_input.outputs[-1].sampled_token_ids,
input_positions=frozen_model_input.input_positions,
seq_lens=attn_metadata.seq_lens_tensor,
slot_mapping=attn_metadata.slot_mapping,
block_tables=attn_metadata.block_tables)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a review comment, just a question: Is this op attention-backend specific?

cc @WoosukKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, eventually we will move this into attention backends API, but it may involve some refactoring to do cleanly. See some initial work done here: #7571

vllm/worker/multi_step_model_runner.py Show resolved Hide resolved
vllm/worker/multi_step_model_runner.py Outdated Show resolved Hide resolved
Comment on lines +117 to +121
model_input.last_sampled_token_ids = (
execute_model_req.last_sampled_token_ids.cuda())
model_input.add_sampler_output(
SamplerOutput(outputs=[], sampled_token_ids=None),
model_input.last_sampled_token_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not the last pipeline stage, why would we need to know last_sampled_token_ids? We are not running the sampler if we are not the last pipeline stage right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however non-last PP stages need to use last_sampled_token_ids to perform in-place advance_step on GPU. And this is where we append to model_inputs so that every rank sees a consistent sampled_token_ids for the last step

@zhuohan123
Copy link
Member

zhuohan123 commented Aug 17, 2024

Also before merge, can you please verify the throughput (tokens/sec) gain in the following settings to make sure the PR is good performance-wise:

  • ShareGPT + Llama 8B + 1x H100/A100
  • ShareGPT + Llama 70B + 8x H100/A100

Also, can you add what are the dataset you are using in your original benchmark? Thanks!

server_cli_args: List[str]):

outputs = None
with RemoteOpenAIServer(model_name, server_cli_args) as server:
Copy link
Contributor

@afeldman-nm afeldman-nm Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SolitaryThinker no need to block on this feedback - but if you have time - I would propose adding an example/offline_inference_multi_step.py example which instantiates an engine instance with multi-step enabled. Similar in structure to example/offline_inference.py.

An example of why this is useful - as part of the logprobs workstream, I am trying to step through the multi-step model runner with the python debugger & examine the output logprobs. I am using your multi_step/test_correctness.py in order to set up a server with multi-step enabled.

However, multi_step/test_correctness.py is an end-to-end client/server test & it is not straightforward (although technically doable) to step through the server code with the debugger because the server is in another process.

I will get around this by writing a short script which sets up an engine instance with multi-step enabled.

However, for someone else who is approaching this code for the first time, it could be helpful to have an example file (or unit test) which just sets up an engine instance with multi-step enabled and invokes inference using LLM.generate(). This could be a good way to facilitate quick debugging & also gives insight into how the server works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the offline_inference_multi_step.py script I wrote for myself to facilitate debugging, if you would like to use it.

'''
Example of setting up LLM with multi-step enabled.
In actuality, async engine would be a more sensible choice
for a real use-case. However this example is useful
for demonstration & debugging of multi-step code.
'''

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="JackFram/llama-160m",
          swap_space=16,
          tensor_parallel_size=1,
          gpu_memory_utilization=0.9,
          num_scheduler_steps=8,
          use_v2_block_manager=True,
          )
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Comment on lines +299 to +301
output[0].sampled_token_ids = None
output[0].sampled_token_probs = None
output[0].logprobs = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if there's a more generic way of doing this. If this data structure gets modified somewhere else it will not be reflected here. Maybe a loop where we check the device if the object is a tensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are optionals and only set if include_gpu_probs_tensor is set in the sampler.

SolitaryThinker and others added 2 commits August 19, 2024 10:55
remove some redundant test cases

set v2 blockmananger and fix rebase

Update vllm/engine/async_llm_engine.py

Co-authored-by: Zhuohan Li <[email protected]>

Update vllm/engine/async_llm_engine.py

Co-authored-by: Zhuohan Li <[email protected]>

Update vllm/worker/multi_step_model_runner.py

Co-authored-by: Zhuohan Li <[email protected]>

add comment

typo

rename to StatefulModelInput

renamed outputs to cached_outputs

Update vllm/worker/multi_step_model_runner.py

Co-authored-by: afeldman-nm <[email protected]>
@SolitaryThinker
Copy link
Contributor Author

Also before merge, can you please verify the throughput (tokens/sec) gain in the following settings to make sure the PR is good performance-wise:

ShareGPT + Llama 8B + 1x H100/A100
ShareGPT + Llama 70B + 8x H100/A100
Also, can you add what are the dataset you are using in your original benchmark? Thanks!

@zhuohan123
I'm using sharegpt for all the numbers. Benchmarked using the benchmark_serving.py script.
See below for single GPU numbers.
image

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the hard work! Please make sure to keep track of the TODOs we discussed in this PR.

@Yard1 Yard1 merged commit 47b65a5 into vllm-project:main Aug 19, 2024
65 checks passed
@SolitaryThinker SolitaryThinker deleted the multi-step branch August 19, 2024 22:16
@WoosukKwon
Copy link
Collaborator

[rank0]: File "/data/woosuk/workspace/vllm/vllm/engine/output_processor/multi_step.py", line 88, in process_outputs
[rank0]: assert valid_samples

@SolitaryThinker Huge thanks for the PR! QQ: I got the above error when running benchmark scripts with num_scheduler_steps > 1. Is this expected?

zifeitong pushed a commit to zifeitong/vllm that referenced this pull request Aug 20, 2024
@jiqing-feng
Copy link
Contributor

Hi @WoosukKwon . I see spec decode also has a class name MultiStepWorker, is there any relation with MultiStepWorker from vllm/worker/multi_step_worker.py in this PR?

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024
omrishiv pushed a commit to omrishiv/vllm that referenced this pull request Aug 26, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.