Unified pipeline for models & support phi3 model #45

guoqingbao · 2024-07-02T02:55:17Z

This PR provides a unified pipeline for models and adds the phi3 model under the unified pipeline. The major changes are:

The LLaMaPipeline is removed and replaced with DefaultPipeline (and default loader), which can serve different models such as LLaMa and phi3.
The phi3 model is added and tested, achieving around 100 tokens/second generation speed on A100 (BF16).
Configuration for models is simplified.
The padding strategy is optimized.
ReadMe is revised to reflect the recent changes.

More models are expected to be added using the DefaultPipeline.

Mixed precision is used for the phi3 model because I found rope and rmsnorm require at least FP32 for long sequence generation (e.g., over 2000 tokens prompt).

Command line to run Phi3 3.8B chat service

cargo run --release -- --port 2000 --weight-path /home/phi3-3.8b/ phi3 --repeat-last-n 64

Please ignore the previous commit messages :)

Mention #44

…tion

EricLBuehler

Thank you!

guoqingbao added 24 commits June 19, 2024 15:02

Optional logprobs & fix llama eos/stop token

b7c2e3d

Cargo fmt

a449cad

Mention other options for chat completion request

f7f1988

Merge branch 'EricLBuehler:master' into master

61cc400

Configurable kvcache & fix repeat chat history

ae7f54c

Improve readability

78f184c

Merge branch 'master' of github.com:guoqingbao/candle-vllm

b476402

Instructions for ChatUI & add demo chat video

597aaec

Merge branch 'EricLBuehler:master' into master

e5b93a0

Optimization for decoding stage & try to fix blocktable issue

efe8c46

Support stream response for chat completion

7c13746

Merge branch 'master' of github.com:guoqingbao/candle-vllm

aab3a40

Update ReadMe & demo video

e2e8436

Reduce demo video size

30d4e99

Fix stream generation hang in release mode

0d9fc0b

Reduce the buffer size & update ReadMe

7a137aa

Fix LLaMa2 prompt instruction (for long conversation)

37c760a

Cargo fmt

6d21d9e

Padding to avoid block allocation issue & revision for prompt instruc…

47ae004

…tion

Merge branch 'master' of github.com:guoqingbao/candle-vllm

54667ef

Unfied pipeline for models & support phi3 model

df42cb8

Fix padding strategy

42ab769

Cargo fmt

8164981

Update ReadMe for supported models

c93301a

guoqingbao changed the title ~~Unfied pipeline for models & support phi3 model~~ Unified pipeline for models & support phi3 model Jul 2, 2024

guoqingbao mentioned this pull request Jul 2, 2024

Support chat serving for more models #44

Open

EricLBuehler approved these changes Jul 3, 2024

View reviewed changes

EricLBuehler merged commit 743a8b2 into EricLBuehler:master Jul 3, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified pipeline for models & support phi3 model #45

Unified pipeline for models & support phi3 model #45

guoqingbao commented Jul 2, 2024

EricLBuehler left a comment

Unified pipeline for models & support phi3 model #45

Unified pipeline for models & support phi3 model #45

Conversation

guoqingbao commented Jul 2, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment