Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
bigdl_llm_model.py		bigdl_llm_model.py
ipex_llm_worker.py		ipex_llm_worker.py
model_worker.py		model_worker.py
vllm_worker.py		vllm_worker.py

README.md

Serving using IPEX-LLM and FastChat

FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their homepage.

IPEX-LLM can be easily integrated into FastChat so that user can use IPEX-LLM as a serving backend in the deployment.

Table of contents

Install
Start the service

Install

You may install ipex-llm with FastChat as follows:

pip install --pre --upgrade ipex-llm[serving]

# Or
pip install --pre --upgrade ipex-llm[all]

To add GPU support for FastChat, you may install ipex-llm as follows:

pip install --pre --upgrade ipex-llm[xpu,serving] -f https://developer.intel.com/ipex-whl-stable-xpu

Start the service

Launch controller

You need first run the fastchat controller

python3 -m fastchat.serve.controller

Launch model worker(s) and load models

Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat.

IPEX-LLM model worker (deprecated)

details

Warning: This method has been deprecated, please change to use IPEX-LLM worker instead.

FastChat determines the Model adapter to use through path matching. Therefore, in order to load models using IPEX-LLM, you need to make some modifications to the model's name.

For instance, assuming you have downloaded the llama-7b-hf from HuggingFace. Then, to use the IPEX-LLM as backend, you need to change name from llama-7b-hf to ipex-7b.The key point here is that the model's path should include "ipex" and should not include paths matched by other model adapters.

Then we will use ipex-7b as model-path.

note: This is caused by the priority of name matching list. The new added IPEX-LLM adapter is at the tail of the name-matching list so that it has the lowest priority. If model path contains other keywords like vicuna which matches to another adapter with higher priority, then the IPEX-LLM adapter will not work.

A special case is ChatGLM models. For these models, you do not need to do any changes after downloading the model and the IPEX-LLM backend will be used automatically.

Then we can run model workers

# On CPU
python3 -m ipex_llm.serving.fastchat.model_worker --model-path PATH/TO/ipex-7b --device cpu

# On GPU
python3 -m ipex_llm.serving.fastchat.model_worker --model-path PATH/TO/ipex-7b --device xpu

If you run successfully using IPEX backend, you can see the output in log like this:

INFO - Converting the current model to sym_int4 format......

note: We currently only support int4 quantization for this method.

IPEX-LLM worker

To integrate IPEX-LLM with FastChat efficiently, we have provided a new model_worker implementation named ipex_llm_worker.py.

To run the ipex_llm_worker on CPU, using the following code:

source ipex-llm-init -t

# Available low_bit format including sym_int4, sym_int8, bf16 etc.
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "sym_int4" --trust-remote-code --device "cpu"

For GPU example:

# Available low_bit format including sym_int4, sym_int8, fp16 etc.
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "sym_int4" --trust-remote-code --device "xpu"

For a full list of accepted arguments, you can refer to the main method of the ipex_llm_worker.py

IPEX-LLM vLLM worker

We also provide the vllm_worker which uses the vLLM engine for better hardware utilization.

To run using the vLLM_worker, we don't need to change model name, just simply uses the following command:

# On CPU
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device cpu

# On GPU
python3 -m ipex_llm.serving.fastchat.vllm_worker --model-path REPO_ID_OR_YOUR_MODEL_PATH --device xpu

Launch Gradio web server

python3 -m fastchat.serve.gradio_web_server

This is the user interface that users will interact with.

By following these steps, you will be able to serve your models using the web UI with IPEX-LLM as the backend. You can open your browser and chat with a model now.

Launch RESTful API server

To start an OpenAI API server that provides compatible APIs using IPEX-LLM backend, you can launch the openai_api_server and follow this doc to use it.

python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastchat

fastchat

README.md

Serving using IPEX-LLM and FastChat

Install

Start the service

Launch controller

Launch model worker(s) and load models

IPEX-LLM model worker (deprecated)

IPEX-LLM worker

IPEX-LLM vLLM worker

Launch Gradio web server

Launch RESTful API server

Files

fastchat

Directory actions

More options

Directory actions

More options

Latest commit

History

fastchat

Folders and files

parent directory

README.md

Serving using IPEX-LLM and FastChat

Install

Start the service

Launch controller

Launch model worker(s) and load models

IPEX-LLM model worker (deprecated)

IPEX-LLM worker

IPEX-LLM vLLM worker

Launch Gradio web server

Launch RESTful API server