Scheduler methods #14

josephydu · 2024-10-23T10:35:19Z

We have added two new load balancing solutions, resources_aware and pre_radix

resources_aware

resources_aware takes into account the GPU resource usage to dynamically schedule requests. The comparison results of resources_aware are shown in the figure.

The script and environment that produces the result is as follows:
serving:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method resources_aware
bench:
/workspace/bin/micromamba run -n sglang python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8080 --dataset-name random --tokenizer meta-llama/Meta-Llama-3.1-8B --model meta-llama/Meta-Llama-3.1-8B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --num-prompts 90000--request-rate 15.7

pre_radix

pre_radix is implemented based on resources_aware. It can greatly improve the KV Cache hit rate. It is mainly used to handle multi-round dialogue situations. Its results are as follows:

We also counted the cache hit rate during the inference process, and the results are as follows:
round_robin cache hit rate

pre_radix cache hit rate

The script and environment that produces the result is as follows:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path Qwen/Qwen2-7B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method pre_radix

/workspace/bin/micromamba run -n sglang python3 /workspace/sglang/benchmark/multi_turn_chat/bench_sglang.py --tokenizer Qwen/Qwen2-7B --port 8080 --parallel 128 --min-len-q 128 --max-len-q 256 --min-len-a 256 --max-len-a 512 --turns 20 --num-qa 256

btw, we modified the benchmark code to make the number of rounds of multi-round dialogue a random value to enhance the persuasiveness of our experimental results.

…eduler_methods

merge main

josephydu added 6 commits October 21, 2024 16:05

add two new load-balance-method

53b8f0f

change the method name from zmq_raix to pre_radix

13387d3

Merge branch 'main' of https://github.com/sgl-project/sglang into sch…

e1131c8

…eduler_methods

Merge remote-tracking branch 'upstream/main' into scheduler_methods

c7c1520

merge main

fix bug in comments

f211917

fix bug

a022044

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler methods #14

Scheduler methods #14

josephydu commented Oct 23, 2024

Scheduler methods #14

Are you sure you want to change the base?

Scheduler methods #14

Conversation

josephydu commented Oct 23, 2024

resources_aware

pre_radix