Flex scheduler #1142

yukavio · 2024-08-18T06:39:51Z

Motivation

Implement a better dispatch scheduler for DP mode, which could dispatch new requests depending on the remaining resources of different inference processes. It could help the server get better TTFT with high request rate compare to round-robin algorithm.

Modification

Checklist

Before submitting a PR for review, make sure it has passed verification in your local development environment at least.
Ensure pre-commit pre-commit run --all-files or other linting tools are used to fix potential lint issues.
Confirm that modifications are covered by complete unit tests. If not, please add more unit tests for correctness.
Modify documentation as needed, such as docstrings or example tutorials.

merrymercy · 2024-08-20T15:36:04Z

Hi @yukavio, can you briefly describe the context of this PR?

yukavio · 2024-08-26T12:29:17Z

Hi @yukavio, can you briefly describe the context of this PR?

I am trying to implement a load balancing strategy that is better than round robin when using Data Parallel.

merrymercy · 2024-08-26T14:22:35Z

If you can provide some context, high-level descriptions, and performance numbers, it can help us understand this PR better.

yukavio · 2024-08-27T02:41:22Z

If you can provide some context, high-level descriptions, and performance numbers, it can help us understand this PR better.

No problem. The entire strategy is not yet fully determined and I am still trying to do further iterative optimization. I will provide an overall description and corresponding performance data after completion.

yukavio · 2024-08-30T07:54:18Z

We implement a resources-aware scheduler in dp mode which could be enabled with --load-balance-method resources_aware. We tested the performance of the resources-aware scheduler and compared it with the round-robin algorithm on an 8*A100(40G) node.
The detail of bench command: python3 -m sglang.bench_serving --backend sglang --dataset-name random --tokenizer Qwen/Qwen2-7B --model Qwen/Qwen2-7B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --num-prompts 20000 --request-rate 16.0

round_robin:

resouces_aware:

and if we change the argument of num-continue-decode-step from 10 to 1 in global config, we could get better on ttft:

zhyncs · 2024-08-30T07:57:06Z

Hi @yukavio Nice work! Could you resolve the conflicts? Thanks.

zhyncs · 2024-08-30T08:19:13Z

@yukavio Also, it would be better to benchmark Llama 3.1 8B Instruct and Llama 3.1 70B Instruct. Thanks.

yukavio · 2024-08-30T09:30:28Z

@yukavio Also, it would be better to benchmark Llama 3.1 8B Instruct and Llama 3.1 70B Instruct. Thanks.

@zhyncs I can provide performance comparison results on LLAMA3.1 8B later. However, for LLAMA3.1 70B, the improvements in this PR are mainly based on DP. But if it's a 70GB model, it requires multiple 8-GPU machines to start enough DP Workers to conduct this test. For me, it's difficult to gather that many machines for testing.

zhyncs · 2024-08-30T09:32:11Z

But if it's a 70GB model, it requires multiple 8-GPU machines to start enough DP Workers to conduct this test.

@Ying1123 May you help take a look? Thanks.

merrymercy · 2024-08-30T13:19:13Z

python/sglang/srt/managers/controller_multi.py

+#     """A scheduler which dispatch """
+
+
+class ControllerMultiFlex:


revert the name change

merrymercy

please resolve the conflicts, fix the failed test cases, and add a new test case for data parallelism.

min-xu-et · 2024-08-30T17:45:14Z

server.sh

@@ -0,0 +1 @@
+python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-1.8B-Chat  --host 0.0.0.0 --port 8080 --mem-fraction-static 0.6  --chunked-prefill-size 512


no need to check this in?

yukavio · 2024-09-02T07:05:09Z

please resolve the conflicts, fix the failed test cases, and add a new test case for data parallelism.

I met some problems when merging upstream branch and I am trying to fix them. I will push the fix commit later and add a test case for it.

yukavio · 2024-09-12T11:50:03Z

please resolve the conflicts, fix the failed test cases, and add a new test case for data parallelism.

We met memory problems after merging the latest main branch. Details can be found at #1405. It looks like the latest main branch has some issues with memory management which did not happen in the old version.

merrymercy · 2024-09-22T11:42:38Z

#1405 (comment)

python/sglang/srt/managers/controller_multi.py

kavioyu added 3 commits August 15, 2024 11:03

implement resorces aware scheduler and tested

01e7ed9

temp

76c0087

fix performance

a2b9a62

yukavio mentioned this pull request Aug 30, 2024

[Develop] Performance Improving Feature #1105

Closed

merrymercy reviewed Aug 30, 2024

View reviewed changes

min-xu-et reviewed Aug 30, 2024

View reviewed changes

daquexian reviewed Sep 22, 2024

View reviewed changes

python/sglang/srt/managers/controller_multi.py Outdated Show resolved Hide resolved

yukavio force-pushed the flex_scheduler branch from 8d5c8d9 to d1b1035 Compare September 24, 2024 10:20

add radix cache scheduler

e034fcf

yukavio force-pushed the flex_scheduler branch from d1b1035 to e034fcf Compare September 24, 2024 10:32

fix .gitignore

e27c082

zhyncs requested review from Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners October 24, 2024 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flex scheduler #1142

Flex scheduler #1142

yukavio commented Aug 18, 2024 •

edited

Loading

merrymercy commented Aug 20, 2024

yukavio commented Aug 26, 2024 •

edited

Loading

merrymercy commented Aug 26, 2024 •

edited

Loading

yukavio commented Aug 27, 2024

yukavio commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

yukavio commented Aug 30, 2024

zhyncs commented Aug 30, 2024

merrymercy Aug 30, 2024

merrymercy left a comment

min-xu-et Aug 30, 2024

yukavio commented Sep 2, 2024

yukavio commented Sep 12, 2024 •

edited

Loading

merrymercy commented Sep 22, 2024

		# """A scheduler which dispatch """


		class ControllerMultiFlex:

		@@ -0,0 +1 @@
		python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-1.8B-Chat --host 0.0.0.0 --port 8080 --mem-fraction-static 0.6 --chunked-prefill-size 512

Flex scheduler #1142

Are you sure you want to change the base?

Flex scheduler #1142

Conversation

yukavio commented Aug 18, 2024 • edited Loading

Motivation

Modification

Checklist

merrymercy commented Aug 20, 2024

yukavio commented Aug 26, 2024 • edited Loading

merrymercy commented Aug 26, 2024 • edited Loading

yukavio commented Aug 27, 2024

yukavio commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

yukavio commented Aug 30, 2024

zhyncs commented Aug 30, 2024

merrymercy Aug 30, 2024

Choose a reason for hiding this comment

merrymercy left a comment

Choose a reason for hiding this comment

min-xu-et Aug 30, 2024

Choose a reason for hiding this comment

yukavio commented Sep 2, 2024

yukavio commented Sep 12, 2024 • edited Loading

merrymercy commented Sep 22, 2024

yukavio commented Aug 18, 2024 •

edited

Loading

yukavio commented Aug 26, 2024 •

edited

Loading

merrymercy commented Aug 26, 2024 •

edited

Loading

yukavio commented Sep 12, 2024 •

edited

Loading