FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their homepage.
IPEX-LLM can be easily integrated into FastChat so that user can use IPEX-LLM
as a serving backend in the deployment.
Table of contents
You may install ipex-llm
with FastChat
as follows:
pip install --pre --upgrade ipex-llm[serving]
# Or
pip install --pre --upgrade ipex-llm[all]
To add GPU support for FastChat, you may install ipex-llm
as follows:
pip install --pre --upgrade ipex-llm[xpu,serving] -f https://developer.intel.com/ipex-whl-stable-xpu
You need first run the fastchat controller
python3 -m fastchat.serve.controller
Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.
details
Warning: This method has been deprecated, please change to use
IPEX-LLM
worker instead.
FastChat determines the Model adapter to use through path matching. Therefore, in order to load models using IPEX-LLM, you need to make some modifications to the model's name.
For instance, assuming you have downloaded the llama-7b-hf
from HuggingFace. Then, to use the IPEX-LLM
as backend, you need to change name from llama-7b-hf
to ipex-7b
.The key point here is that the model's path should include "ipex" and should not include paths matched by other model adapters.
Then we will use ipex-7b
as model-path.
note: This is caused by the priority of name matching list. The new added
IPEX-LLM
adapter is at the tail of the name-matching list so that it has the lowest priority. If model path contains other keywords likevicuna
which matches to another adapter with higher priority, then theIPEX-LLM
adapter will not work.
A special case is ChatGLM
models. For these models, you do not need to do any changes after downloading the model and the IPEX-LLM
backend will be used automatically.
Then we can run model workers
# On CPU
python3 -m ipex_llm.serving.fastchat.model_worker --model-path PATH/TO/ipex-7b --device cpu
# On GPU
python3 -m ipex_llm.serving.fastchat.model_worker --model-path PATH/TO/ipex-7b --device xpu
If you run successfully using IPEX
backend, you can see the output in log like this:
INFO - Converting the current model to sym_int4 format......
note: We currently only support int4 quantization for this method.
To integrate IPEX-LLM with FastChat
efficiently, we have provided a new model_worker implementation named ipex_llm_worker.py
.
To run the ipex_llm_worker
on CPU, using the following code:
source ipex-llm-init -t
# Available low_bit format including sym_int4, sym_int8, bf16 etc.
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "sym_int4" --trust-remote-code --device "cpu"
For GPU example:
# Available low_bit format including sym_int4, sym_int8, fp16 etc.
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "sym_int4" --trust-remote-code --device "xpu"
For a full list of accepted arguments, you can refer to the main method of the ipex_llm_worker.py
We also provide the vllm_worker
which uses the vLLM engine for better hardware utilization.
To run using the vLLM_worker
, we don't need to change model name, just simply uses the following command:
# On CPU
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu
# On GPU
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu
python3 -m fastchat.serve.gradio_web_server
This is the user interface that users will interact with.
By following these steps, you will be able to serve your models using the web UI with IPEX-LLM as the backend. You can open your browser and chat with a model now.
To start an OpenAI API server that provides compatible APIs using IPEX-LLM backend, you can launch the openai_api_server
and follow this doc to use it.
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000