Scheduler methods #1913

josephydu · 2024-11-04T02:34:18Z

We have added two new load balancing solutions, resources_aware and pre_radix

resources_aware

resources_aware takes into account the GPU resource usage to dynamically schedule requests. The comparison results of resources_aware are shown in the figure.

The script and environment that produces the result is as follows:
serving:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method resources_aware
bench:
/workspace/bin/micromamba run -n sglang python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8080 --dataset-name random --tokenizer meta-llama/Meta-Llama-3.1-8B --model meta-llama/Meta-Llama-3.1-8B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --num-prompts 90000--request-rate 15.7

pre_radix

pre_radix is implemented based on resources_aware. It can greatly improve the KV Cache hit rate. It is mainly used to handle multi-round dialogue situations. Its results are as follows:

We also counted the cache hit rate during the inference process, and the results are as follows:
round_robin cache hit rate

pre_radix cache hit rate

The script and environment that produces the result is as follows:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path Qwen/Qwen2-7B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method pre_radix

/workspace/bin/micromamba run -n sglang python3 /workspace/sglang/benchmark/multi_turn_chat/bench_sglang.py --tokenizer Qwen/Qwen2-7B --port 8080 --parallel 128 --min-len-q 128 --max-len-q 256 --min-len-a 256 --max-len-a 512 --turns 20 --num-qa 256

btw, we modified the benchmark code to make the number of rounds of multi-round dialogue a random value to enhance the persuasiveness of our experimental results.

…eduler_methods

merge main

…eduler_methods

ByronHsu · 2024-11-04T07:29:17Z

https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2s97q9mki-hAaMglU8sV6pQvi3dttgIw

Very cool work! Are you on the slack channel? Let's have a offline discussion

.pre-commit-config.yaml

merrymercy · 2024-11-08T03:55:59Z

python/sglang/srt/managers/data_parallel_controller.py

            )

            self.workers.append(send_to)
            base_gpu_id += server_args.tp_size

+        if self.pre_raidx:
+            import threading


move this to top.

merrymercy · 2024-11-08T03:57:33Z

Can you fix the CI tests? If this is lightweight, we can also merge this.

python/sglang/srt/managers/data_parallel_controller.py

python/sglang/srt/managers/scheduler.py

python/sglang/srt/mem_cache/radix_cache.py

merrymercy · 2024-11-08T08:03:36Z

python/sglang/srt/managers/scheduler.py

+            try:
+                node = deepcopy(self.tree_cache.root_node)
+                send_data = RadixCacheSend(
+                    gpu_id=self.gpu_id, root_node=node, time=time.time()


the radix cache will even contain GPU tensors. Please only send a simplified version without any GPU tensors.

I check the pytorch doc and find that may be we can use torch.multiprocessing. How do you think of it?
https://pytorch.org/docs/stable/notes/multiprocessing.html

torch.multiprocessing is not helpful here because the best solution is to not transfer any TreeNode.value in the radix tree. You can implement a function to drop all TreeNode.value in the tree.

ByronHsu · 2024-11-09T22:43:07Z

I did a quick test on long system prompt QA and looks like the perf is slightly worse than round robin (the benchmark simulates a long prefix + multiple relatively shorter QA). Also i noticed the cache hit rate is unstable across workers

server launcher

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 127.0.0.1 --port 30000 --dp 8 --load-balance resources_aware
python -m sglang.launch_server --model-path  meta-llama/Meta-Llama-3.1-8B-Instruct  --host 127.0.0.1 --port 30000 --dp 8 --load-balance pre_radix

benchmark client

python long_prompt_multi_turn.py --port 30000 --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct

For the code, can you add more comments to make it easier to understand?

merrymercy · 2024-11-22T09:58:27Z

We will not accept this. Please merge the efforts to the new load balancer #2114

josephydu added 7 commits October 21, 2024 16:05

add two new load-balance-method

53b8f0f

change the method name from zmq_raix to pre_radix

13387d3

Merge branch 'main' of https://github.com/sgl-project/sglang into sch…

e1131c8

…eduler_methods

Merge remote-tracking branch 'upstream/main' into scheduler_methods

c7c1520

merge main

fix bug in comments

f211917

fix bug

a022044

Merge branch 'main' of https://github.com/sgl-project/sglang into sch…

Loading
Loading status checks…

127c070

…eduler_methods

josephydu requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners November 4, 2024 02:34

yukavio mentioned this pull request Nov 4, 2024

Flex scheduler #1142

Closed

4 tasks

merrymercy reviewed Nov 8, 2024

View reviewed changes

deal with comments and fix a bug

549203f

merrymercy added the high priority label Nov 8, 2024

merrymercy assigned ByronHsu Nov 8, 2024

Merge remote-tracking branch 'upstream/main' into scheduler_methods

Loading
Loading status checks…

5416fff

merrymercy requested changes Nov 8, 2024

View reviewed changes

merrymercy removed the high priority label Nov 10, 2024

ByronHsu added the await-response label Nov 10, 2024

merrymercy closed this Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler methods #1913

Scheduler methods #1913

josephydu commented Nov 4, 2024

ByronHsu commented Nov 4, 2024

merrymercy Nov 8, 2024

merrymercy commented Nov 8, 2024

merrymercy Nov 8, 2024

josephydu Nov 8, 2024

merrymercy Nov 8, 2024 •

edited

Loading

ByronHsu commented Nov 9, 2024 •

edited

Loading

merrymercy commented Nov 22, 2024

Scheduler methods #1913

Scheduler methods #1913

Conversation

josephydu commented Nov 4, 2024

resources_aware

pre_radix

ByronHsu commented Nov 4, 2024

merrymercy Nov 8, 2024

Choose a reason for hiding this comment

merrymercy commented Nov 8, 2024

merrymercy Nov 8, 2024

Choose a reason for hiding this comment

josephydu Nov 8, 2024

Choose a reason for hiding this comment

merrymercy Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

ByronHsu commented Nov 9, 2024 • edited Loading

merrymercy commented Nov 22, 2024

merrymercy Nov 8, 2024 •

edited

Loading

ByronHsu commented Nov 9, 2024 •

edited

Loading