Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler methods #1913

Closed
wants to merge 9 commits into from
Closed

Conversation

josephydu
Copy link
Contributor

We have added two new load balancing solutions, resources_aware and pre_radix

resources_aware

resources_aware takes into account the GPU resource usage to dynamically schedule requests. The comparison results of resources_aware are shown in the figure.

image

The script and environment that produces the result is as follows:
serving:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method resources_aware
bench:
/workspace/bin/micromamba run -n sglang python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8080 --dataset-name random --tokenizer meta-llama/Meta-Llama-3.1-8B --model meta-llama/Meta-Llama-3.1-8B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --num-prompts 90000--request-rate 15.7

pre_radix

pre_radix is ​​implemented based on resources_aware. It can greatly improve the KV Cache hit rate. It is mainly used to handle multi-round dialogue situations. Its results are as follows:
image

We also counted the cache hit rate during the inference process, and the results are as follows:
round_robin cache hit rate
round_robin cache hit rate
pre_radix cache hit rate
pre_radix cache hit rate

The script and environment that produces the result is as follows:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path Qwen/Qwen2-7B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method pre_radix

/workspace/bin/micromamba run -n sglang python3 /workspace/sglang/benchmark/multi_turn_chat/bench_sglang.py --tokenizer Qwen/Qwen2-7B --port 8080 --parallel 128 --min-len-q 128 --max-len-q 256 --min-len-a 256 --max-len-a 512 --turns 20 --num-qa 256

btw, we modified the benchmark code to make the number of rounds of multi-round dialogue a random value to enhance the persuasiveness of our experimental results.

@ByronHsu
Copy link
Collaborator

ByronHsu commented Nov 4, 2024

https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2s97q9mki-hAaMglU8sV6pQvi3dttgIw

Very cool work! Are you on the slack channel? Let's have a offline discussion

@yukavio yukavio mentioned this pull request Nov 4, 2024
4 tasks
)

self.workers.append(send_to)
base_gpu_id += server_args.tp_size

if self.pre_raidx:
import threading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to top.

@merrymercy
Copy link
Contributor

Can you fix the CI tests? If this is lightweight, we can also merge this.

try:
node = deepcopy(self.tree_cache.root_node)
send_data = RadixCacheSend(
gpu_id=self.gpu_id, root_node=node, time=time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the radix cache will even contain GPU tensors. Please only send a simplified version without any GPU tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check the pytorch doc and find that may be we can use torch.multiprocessing. How do you think of it?
https://pytorch.org/docs/stable/notes/multiprocessing.html
image

Copy link
Contributor

@merrymercy merrymercy Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.multiprocessing is not helpful here because the best solution is to not transfer any TreeNode.value in the radix tree. You can implement a function to drop all TreeNode.value in the tree.

@ByronHsu
Copy link
Collaborator

ByronHsu commented Nov 9, 2024

I did a quick test on long system prompt QA and looks like the perf is slightly worse than round robin (the benchmark simulates a long prefix + multiple relatively shorter QA). Also i noticed the cache hit rate is unstable across workers

image

server launcher

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 127.0.0.1 --port 30000 --dp 8 --load-balance resources_aware
python -m sglang.launch_server --model-path  meta-llama/Meta-Llama-3.1-8B-Instruct  --host 127.0.0.1 --port 30000 --dp 8 --load-balance pre_radix

benchmark client

python long_prompt_multi_turn.py --port 30000 --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct

For the code, can you add more comments to make it easier to understand?

@merrymercy
Copy link
Contributor

We will not accept this. Please merge the efforts to the new load balancer #2114

@merrymercy merrymercy closed this Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants