Support qwen2 model, optimize phi3 model, revise model loading strategy #46

guoqingbao · 2024-07-04T09:59:34Z

Key changes for #44 :

QWen2 model added
Performance of Phi3 optimized
Redundant transpose removed in decoding stage for paged attention
Model loading strategy changed: given a model type, specify the local weight path or concret model id, this enables loading of different models under the same model architecture, for example, llama3 for model type of "llama"
ReadMe updated to reflect newly supported models and corresponding usage

Tested case:

cargo run --release -- --port 2000 --weight-path /home/qwen2-1.8b/ qwen2 --repeat-last-n 64

or

cargo run --release -- --port 2000 --model-id Qwen/Qwen1.5-1.8B-Chat qwen2 --repeat-last-n 64

Around 150 tokens/s achieved for qwen2 1.8B on A100 (mixed precision of FP32 and BF16).

EricLBuehler · 2024-07-04T10:06:54Z

Thank you for adding this, it is great. There are a bunch of unused items warnings when you build this, can you please check that?

guoqingbao · 2024-07-04T10:28:14Z

Thank you for adding this, it is great. There are a bunch of unused items warnings when you build this, can you please check that?

Fixed

EricLBuehler

Thank you!

guoqingbao added 3 commits July 4, 2024 17:33

Support qwen2 model, optimize phi3 model, revise model loading strategy

659b6f7

Default 1.8B model for qwen2

e19bf33

Typo fix

52b6b12

This was referenced Jul 4, 2024

Support chat serving for more models #44

Open

Support using arbitrary derivative models #34

Closed

guoqingbao added 2 commits July 4, 2024 18:29

Remove unused

7cffa66

Support rope scaling for phi3 models (Phi3 128k)

d763d4c

EricLBuehler approved these changes Jul 5, 2024

View reviewed changes

EricLBuehler merged commit 211346e into EricLBuehler:master Jul 5, 2024
5 checks passed

EricLBuehler mentioned this pull request Jul 5, 2024

LongRope support for Phi 3 #47

Closed

Provide feedback