[Develop] Performance Improving Feature #1105

yukavio · 2024-08-15T02:59:10Z

I want to develop some features based on Sglang to improve the performance of srt.

A new scheduler of ControllerMulti that can more accurately identify the resource utilization of each instance and dispatch the latest requests to processes with low resource utilization.
SplitFuse, which enables decoding tokens and extending tokens could be computed in one batch.
Flexible request swapping. This feature allows a request to be transferred to other processes for continued computation when the process it belongs to lacks sufficient resources to continue decoding, thereby preventing the request from being halted. The transfer would be implemented by kv cache swapping to avoid extra computation.

Looking forward to everyone's suggestions.😊

The text was updated successfully, but these errors were encountered:

merrymercy · 2024-08-27T06:43:27Z

1 and 3 are interesting to us.
2 has been implemented here

sglang/python/sglang/srt/server_args.py

Lines 426 to 430 in 5ff25cd

    
           parser.add_argument( 
        
               "--enable-mixed-chunk", 
        
               action="store_true", 
        
               help="Enabling mixing prefill and decode in a batch when using chunked prefill.", 
        
           )

, although there is still room for improvement.

Please join our Slack channel and we can have more discussions there : https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

yukavio · 2024-08-30T08:07:31Z

1 and 3 are interesting to us. 2 has been implemented here

sglang/python/sglang/srt/server_args.py

Lines 426 to 430 in 5ff25cd

parser.add_argument(

"--enable-mixed-chunk",

action="store_true",

help="Enabling mixing prefill and decode in a batch when using chunked prefill.",

)

, although there is still room for improvement.
Please join our Slack channel and we can have more discussions there : https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

I have implemented the plan1 in this PR: #1142.
I am considering temporarily setting aside plans 2 and 3 because I believe that speculative decoding is a more crucial feature. This could significantly enhance the throughput of the inference server. I am implementing the Speculative inference based on EAGLE2 now. I will raise a PR later, and estimate that the initial version of development will be completed within two weeks.

zhyncs · 2024-08-30T08:12:17Z

Contributions are very welcome! https://arxiv.org/pdf/2406.16858

zhyncs · 2024-08-30T19:32:05Z

We very much welcome features that improve performance. Overall, we hope the PRs submitted can adhere to the following principles:

If possible, provide information on profiles before and after optimization, such as nsys.
You need to provide a comparison of benchmarks before and after optimization. If there are particularly many changes to the code or if the changes are very complex but the overall improvement is less than 10% or even 5%, we might not consider merging.
Reuse existing components as much as possible.
Add new components or refactor, need to add corresponding unit tests.
Thanks!

github-actions · 2024-10-30T01:12:36Z

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

zhyncs assigned Ying1123, merrymercy, zhyncs and hnyls2002 Aug 17, 2024

github-actions bot closed this as completed Oct 30, 2024

github-actions bot added the inactive label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Develop] Performance Improving Feature #1105

[Develop] Performance Improving Feature #1105

yukavio commented Aug 15, 2024 •

edited

Loading

merrymercy commented Aug 27, 2024 •

edited

Loading

yukavio commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

github-actions bot commented Oct 30, 2024

[Develop] Performance Improving Feature #1105

[Develop] Performance Improving Feature #1105

Comments

yukavio commented Aug 15, 2024 • edited Loading

merrymercy commented Aug 27, 2024 • edited Loading

yukavio commented Aug 30, 2024

zhyncs commented Aug 30, 2024

zhyncs commented Aug 30, 2024

github-actions bot commented Oct 30, 2024

yukavio commented Aug 15, 2024 •

edited

Loading

merrymercy commented Aug 27, 2024 •

edited

Loading