Integrate vllm/ns into llm-on-ray #267

jiafuzha · 2024-07-16T08:52:46Z

This PR is to replace the closed PR, #264, which is from old branch. This PR merged some enhancements from NS main branch.

reshaped neural-speed as a full functional inference engine for vllm
integrated vllm ns extension into llm-on-ray and optimized deployment with ray
optimized neural-speed in several places, including compute graph construction, multiple numa node deployment and enabling flash attention kernel on llama-3-8b.
updated and fixed some benchmark script for IDC test and open-ai mode test, including multiple messages with different roles, removing empty chunk, fixing wrong first token latency and next token latency in open-ai mode.
only Llama-2-7b-chat-hf and Llama-3-8b-instruct are supported. But it can quickly extend to support other models.
addressed some review comments in last closed PR.
2X perf improvement compared to plain vLLM cpu.

Signed-off-by: Jiafu Zhang <[email protected]>

Signed-off-by: JoshuaL3000 <[email protected]>

…into vllm-ns-perf-test

Signed-off-by: Jiafu Zhang <[email protected]>

… threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]>

Signed-off-by: Jiafu Zhang <[email protected]>

…ine worker and router worker since they have different resource config Signed-off-by: Jiafu Zhang <[email protected]>

…7d49516' into vllm-ns-merged-209-7d49516

Signed-off-by: Jiafu Zhang <[email protected]>

…nce_engine Signed-off-by: Jiafu Zhang <[email protected]>

…in benchmark script Signed-off-by: Jiafu Zhang <[email protected]>

…nce_engine Signed-off-by: Jiafu Zhang <[email protected]>

Signed-off-by: Jiafu Zhang <[email protected]>

… and setting threads for cases not with ray Signed-off-by: Jiafu Zhang <[email protected]>

KepingYan and others added 30 commits April 17, 2024 09:12

add benchmark run script, visualize script

d2d1f20

upd

88cc01e

update multi replicas

083ae60

use --result-dir to parse results

4c6fa74

fix ci proxy

1b3b13a

add test ci

184e00e

add license

bd85b7d

fix

38c52ed

fix

78dc091

add autoscaling config

7cc0de0

fix ci

e241b25

fix ci

3eb1c08

add package matplotlib

882ff4d

verify CI test

21994cd

verify CI test

d688804

create assets folder to place pictures

c8eabbc

verify CI test

3905082

support openai autoscaling

97ec06a

remove

606f286

integrate vllm and ns

55c1dd1

Signed-off-by: Jiafu Zhang <[email protected]>

update config file

e709010

integrate vllm and ns

5b1bd85

Signed-off-by: Jiafu Zhang <[email protected]>

integrate vllm and ns

eb71ace

Signed-off-by: Jiafu Zhang <[email protected]>

remove .eggs

a969f7f

Signed-off-by: Jiafu Zhang <[email protected]>

integration adjustment

1b6aba3

Signed-off-by: Jiafu Zhang <[email protected]>

llm on ray deployed

ce3ac61

Signed-off-by: Jiafu Zhang <[email protected]>

llm on ray deployed

213ad89

Signed-off-by: Jiafu Zhang <[email protected]>

llm on ray deployed

9b4884f

Signed-off-by: Jiafu Zhang <[email protected]>

more doc

3cb6f64

Signed-off-by: Jiafu Zhang <[email protected]>

merge with master

3f9ba62

Signed-off-by: Jiafu Zhang <[email protected]>

jiafuzha and others added 25 commits June 27, 2024 07:49

fix formatting issue

5ac7907

Signed-off-by: Jiafu Zhang <[email protected]>

fix merge error

19fc069

Signed-off-by: JoshuaL3000 <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/vllm-ns-perf-test' …

76fe811

…into vllm-ns-perf-test

add vllm-ns ci

5760c65

Signed-off-by: Jiafu Zhang <[email protected]>

remove unnecessary logs

30efd3f

Signed-off-by: Jiafu Zhang <[email protected]>

remove some debug code

1d9b4e3

Signed-off-by: Jiafu Zhang <[email protected]>

add '--privileged' to docker run

a14a146

Signed-off-by: Jiafu Zhang <[email protected]>

set unlimited max lock memory for neural speed engine

4f59cb8

Signed-off-by: Jiafu Zhang <[email protected]>

merged with master

4df4f85

Signed-off-by: Jiafu Zhang <[email protected]>

llama-3-8B support

e781d0b

Signed-off-by: Jiafu Zhang <[email protected]>

extend token length limit to 8192 for mha

af7730a

Signed-off-by: Jiafu Zhang <[email protected]>

extend token length limit to 8192 for mha

a92f019

Signed-off-by: Jiafu Zhang <[email protected]>

extend token length limit to 8192 for mha (fix) and support different…

77ee207

… threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]>

extend token length limit to 8192 for mha (fix) and support different…

5154887

… threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]>

add llama3 for plain cpu

f8e51a2

Signed-off-by: Jiafu Zhang <[email protected]>

benchmark idc simple/medium/complex/verycomplex prompts

4ab3b0a

Signed-off-by: Jiafu Zhang <[email protected]>

benchmark idc simple/medium/complex/verycomplex prompts

5476705

Signed-off-by: Jiafu Zhang <[email protected]>

benchmark idc simple/medium/complex/verycomplex prompts

ea02ef3

Signed-off-by: Jiafu Zhang <[email protected]>

add inference_engine resource and app_router resource to distinct eng…

7952602

…ine worker and router worker since they have different resource config Signed-off-by: Jiafu Zhang <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/vllm-ns-merged-209-…

c5c6a12

…7d49516' into vllm-ns-merged-209-7d49516

enhanced benchmark script to support IDC test data

58ad614

Signed-off-by: Jiafu Zhang <[email protected]>

updated ray startup script to add resources for app_router and infere…

724eced

…nce_engine Signed-off-by: Jiafu Zhang <[email protected]>

fix first token latency and next token latency issue in open-ai mode …

7a4d7fd

…in benchmark script Signed-off-by: Jiafu Zhang <[email protected]>

updated ray startup script to add resources for app_router and infere…

5a59427

…nce_engine Signed-off-by: Jiafu Zhang <[email protected]>

addressed some review comments

d5694c2

Signed-off-by: Jiafu Zhang <[email protected]>

jiafuzha requested review from KepingYan, xwu99 and carsonwang July 16, 2024 08:52

jiafuzha added 2 commits July 16, 2024 09:12

fix lint issue

10f0f7c

Signed-off-by: Jiafu Zhang <[email protected]>

address review comment by getting number of threads from ray num-cpus…

52c3451

… and setting threads for cases not with ray Signed-off-by: Jiafu Zhang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate vllm/ns into llm-on-ray #267

Integrate vllm/ns into llm-on-ray #267

jiafuzha commented Jul 16, 2024 •

edited

Loading

Integrate vllm/ns into llm-on-ray #267

Are you sure you want to change the base?

Integrate vllm/ns into llm-on-ray #267

Conversation

jiafuzha commented Jul 16, 2024 • edited Loading

jiafuzha commented Jul 16, 2024 •

edited

Loading