Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Develop] Performance Improving Feature #1105

Closed
yukavio opened this issue Aug 15, 2024 · 5 comments
Closed

[Develop] Performance Improving Feature #1105

yukavio opened this issue Aug 15, 2024 · 5 comments
Assignees
Labels

Comments

@yukavio
Copy link

yukavio commented Aug 15, 2024

I want to develop some features based on Sglang to improve the performance of srt.

  1. A new scheduler of ControllerMulti that can more accurately identify the resource utilization of each instance and dispatch the latest requests to processes with low resource utilization.
  2. SplitFuse, which enables decoding tokens and extending tokens could be computed in one batch.
  3. Flexible request swapping. This feature allows a request to be transferred to other processes for continued computation when the process it belongs to lacks sufficient resources to continue decoding, thereby preventing the request from being halted. The transfer would be implemented by kv cache swapping to avoid extra computation.

Looking forward to everyone's suggestions.😊

@merrymercy
Copy link
Contributor

merrymercy commented Aug 27, 2024

1 and 3 are interesting to us.
2 has been implemented here

parser.add_argument(
"--enable-mixed-chunk",
action="store_true",
help="Enabling mixing prefill and decode in a batch when using chunked prefill.",
)
, although there is still room for improvement.

Please join our Slack channel and we can have more discussions there : https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

@yukavio
Copy link
Author

yukavio commented Aug 30, 2024

1 and 3 are interesting to us. 2 has been implemented here

parser.add_argument(
"--enable-mixed-chunk",
action="store_true",
help="Enabling mixing prefill and decode in a batch when using chunked prefill.",
)

, although there is still room for improvement.
Please join our Slack channel and we can have more discussions there : https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

I have implemented the plan1 in this PR: #1142.
I am considering temporarily setting aside plans 2 and 3 because I believe that speculative decoding is a more crucial feature. This could significantly enhance the throughput of the inference server. I am implementing the Speculative inference based on EAGLE2 now. I will raise a PR later, and estimate that the initial version of development will be completed within two weeks.

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

Contributions are very welcome! https://arxiv.org/pdf/2406.16858

@zhyncs
Copy link
Member

zhyncs commented Aug 30, 2024

We very much welcome features that improve performance. Overall, we hope the PRs submitted can adhere to the following principles:

  1. If possible, provide information on profiles before and after optimization, such as nsys.
  2. You need to provide a comparison of benchmarks before and after optimization. If there are particularly many changes to the code or if the changes are very complex but the overall improvement is less than 10% or even 5%, we might not consider merging.
  3. Reuse existing components as much as possible.
  4. Add new components or refactor, need to add corresponding unit tests.
    Thanks!

Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants