Scheduler methods #13

josephydu · 2024-10-21T08:43:55Z

We have added two new load balancing solutions, resources_aware and pre_radix

resources_aware

resources_aware takes into account the GPU resource usage to dynamically schedule requests. The comparison results of resources_aware are shown in the figure.

The script and environment that produces the result is as follows:
serving:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method resources_aware
bench:
/workspace/bin/micromamba run -n sglang python3 -m sglang.bench_serving --backend sglang --host 127.0.0.1 --port 8080 --dataset-name random --tokenizer meta-llama/Meta-Llama-3.1-8B --model meta-llama/Meta-Llama-3.1-8B --random-output-len 1024 --random-input-len 4096 --random-range-ratio 0.5 --seed 1234 --num-prompts 90000--request-rate 15.7

pre_radix

pre_radix is implemented based on resources_aware. It can greatly improve the KV Cache hit rate. It is mainly used to handle multi-round dialogue situations. Its results are as follows:

We also counted the cache hit rate during the inference process, and the results are as follows:
round_robin cache hit rate

pre_radix cache hit rate

The script and environment that produces the result is as follows:
/workspace/bin/micromamba run -n sglang python3 -m sglang.launch_server --model-path Qwen/Qwen2-7B --host 127.0.0.1 --port 8080 --mem-fraction-static 0.7 --dp-size 8 --load-balance-method pre_radix

/workspace/bin/micromamba run -n sglang python3 /workspace/sglang/benchmark/multi_turn_chat/bench_sglang.py --tokenizer Qwen/Qwen2-7B --port 8080 --parallel 128 --min-len-q 128 --max-len-q 256 --min-len-a 256 --max-len-a 512 --turns 20 --num-qa 256

btw, we modified the benchmark code to make the number of rounds of multi-round dialogue a random value to enhance the persuasiveness of our experimental results.

…sgl-project#1741)

yukavio · 2024-10-21T08:51:04Z

python/sglang/srt/managers/data_parallel_controller.py

+            self.main_available_kv_cache = available_mem.copy()
+        if self.pre_available_kv_cache != available_mem:
+            self.pre_available_kv_cache = available_mem.copy()
+            self.main_available_kv_cache = available_mem.copy()


only need remained one copy

.pre-commit-config.yaml

…eduler_methods

…t#1746)

Co-authored-by: Byron Hsu <[email protected]>

merge main

josephydu and others added 3 commits October 21, 2024 16:05

add two new load-balance-method

53b8f0f

change the method name from zmq_raix to pre_radix

13387d3

Maintain seq_lens_sum to make more FlashInfer operations non-blocking (…

09603c6

…sgl-project#1741)

yukavio reviewed Oct 21, 2024

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

josephydu and others added 13 commits October 21, 2024 16:54

Merge branch 'main' of https://github.com/sgl-project/sglang into sch…

e1131c8

…eduler_methods

Fix prefill oom (sgl-project#1743)

efb099c

Faster overlap mode scheduler (sgl-project#1738)

7ce3606

misc: add CODEOWNERS (sgl-project#1737)

e68b9e7

Fix sliding window attention and gemma-2 unit tests in CI (sgl-projec…

0061128

…t#1746)

Llama3.2 vision model support (sgl-project#1551)

94cde10

Update max_req_len and max_req_input_len (sgl-project#1748)

5e1558f

Release v0.3.4.post1 (sgl-project#1749)

1f26e8b

Fix edge case for truncated (sgl-project#1747)

17536e7

Fuse more ops & Simplify token mapping (sgl-project#1758)

ad4125d

[API] add get memory pool size (sgl-project#1760)

2fce449

Co-authored-by: Byron Hsu <[email protected]>

Merge remote-tracking branch 'upstream/main' into scheduler_methods

c7c1520

merge main

fix bug in comments

f211917

josephydu closed this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler methods #13

Scheduler methods #13

josephydu commented Oct 21, 2024

yukavio Oct 21, 2024

Scheduler methods #13

Scheduler methods #13

Conversation

josephydu commented Oct 21, 2024

resources_aware

pre_radix

yukavio Oct 21, 2024

Choose a reason for hiding this comment