-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flex scheduler #1142
base: main
Are you sure you want to change the base?
Flex scheduler #1142
Conversation
Hi @yukavio, can you briefly describe the context of this PR? |
I am trying to implement a load balancing strategy that is better than round robin when using Data Parallel. |
If you can provide some context, high-level descriptions, and performance numbers, it can help us understand this PR better. |
No problem. The entire strategy is not yet fully determined and I am still trying to do further iterative optimization. I will provide an overall description and corresponding performance data after completion. |
Hi @yukavio Nice work! Could you resolve the conflicts? Thanks. |
@yukavio Also, it would be better to benchmark Llama 3.1 8B Instruct and Llama 3.1 70B Instruct. Thanks. |
@zhyncs I can provide performance comparison results on LLAMA3.1 8B later. However, for LLAMA3.1 70B, the improvements in this PR are mainly based on DP. But if it's a 70GB model, it requires multiple 8-GPU machines to start enough DP Workers to conduct this test. For me, it's difficult to gather that many machines for testing. |
@Ying1123 May you help take a look? Thanks. |
# """A scheduler which dispatch """ | ||
|
||
|
||
class ControllerMultiFlex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert the name change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please resolve the conflicts, fix the failed test cases, and add a new test case for data parallelism.
@@ -0,0 +1 @@ | |||
python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-1.8B-Chat --host 0.0.0.0 --port 8080 --mem-fraction-static 0.6 --chunked-prefill-size 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to check this in?
I met some problems when merging upstream branch and I am trying to fix them. I will push the fix commit later and add a test case for it. |
We met memory problems after merging the latest main branch. Details can be found at #1405. It looks like the latest main branch has some issues with memory management which did not happen in the old version. |
8d5c8d9
to
d1b1035
Compare
d1b1035
to
e034fcf
Compare
Motivation
Implement a better dispatch scheduler for DP mode, which could dispatch new requests depending on the remaining resources of different inference processes. It could help the server get better TTFT with high request rate compare to round-robin algorithm.
Modification
Checklist
pre-commit run --all-files
or other linting tools are used to fix potential lint issues.