Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support chat serving for more models #44

Open
guoqingbao opened this issue Jul 2, 2024 · 7 comments
Open

Support chat serving for more models #44

guoqingbao opened this issue Jul 2, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@guoqingbao
Copy link
Collaborator

Open this issue for tracking the progress of models supported in candle-vllm.

@guoqingbao
Copy link
Collaborator Author

Phi3 model is added in this PR #45

Command line to run Phi3 3.8B chat service

cargo run --release -- --port 2000 --weight-path /home/phi3-3.8b/ phi3 --repeat-last-n 64

It uses mixed precision (F32 for rope/rmsnorm & BF16 for others) for long sequence generation (e.g., prompt over 2k tokens). Tested speed on A100: 99 tokens/s for decoding

You may run Phi3 7B with different weight-path since the pipeline loads models using the corresponding config.json (I haven't tested Phi3 7B, but it should be workable in theory).

@guoqingbao
Copy link
Collaborator Author

Qwen2 model is added in this PR #46

Command line to run Qwen2 1.8B chat service

cargo run --release -- --port 2000 --weight-path /home/qwen2-1.8b/ qwen2 --repeat-last-n 64

or

cargo run --release -- --port 2000 --model-id Qwen/Qwen1.5-1.8B-Chat qwen2 --repeat-last-n 64

Tested speed on A100: ~150 tokens/s for decoding

@guoqingbao
Copy link
Collaborator Author

Mistral, Yi and StableLM are supported in #53 #57

Running cases:

cargo run --release -- --port 2000 --weight-path /home/mistral_7b/ mistral --repeat-last-n 32 --penalty 1.1 --
temperature 0.8
cargo run --release -- --port 2000 --weight-path /home/yi-6b/ yi --repeat-last-n 32
cargo run --release -- --port 2000 --weight-path /home/stablelm-zephyr-3b/ stable-lm --repeat-last-n 32

@guoqingbao guoqingbao self-assigned this Jul 19, 2024
@guoqingbao guoqingbao added the enhancement New feature or request label Jul 19, 2024
@guoqingbao
Copy link
Collaborator Author

LLaMa3/LLaMa3.1 supported in #67

Tested case:

cargo run --release -- --port 2000 --weight-path /home/Meta-Llama-3.1-8B-Instruct/ llama3 --repeat-last-n 64

65 tokens/s on A100 (BF16).

@guoqingbao
Copy link
Collaborator Author

We have added support for quantized models, refer to #77

@EricLBuehler
Copy link
Owner

@guoqingbao nice work with #77!

@guoqingbao
Copy link
Collaborator Author

@guoqingbao nice work with #77!

I'm planning to parallelize the model loading process, specifically for in-situ quantization. The current strategy of loading model weights (bought from candle) layer by layer is unnecessary and inefficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants