Huggingface Text Generation Inference (TGI) supports XPU backend starting from version 2.4.0
(see huggingface/tgi#2561). Previously, support for Intel Extension for PyTorch (IPEX) was available and got extended to cover direct integration of XPU backend in PyTorch (available starting from PyTorch 2.4).
At the moment it's required to build and install TGI from sources to run against PyTorch XPU backend since pre-built docker images are not yet available for this configuration:
- Install Rust. On Linux, execute:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Clone TGI, build and install as follows by reusing "cpu" installation path:
git clone https://github.com/huggingface/text-generation-inference.git tgi && cd tgi
make install-cpu
Start TGI server for one of the supported models as follows:
text-generation-launcher --model-id meta-llama/Llama-3.2-3B-Instruct --port 8080
On successful launch the following will be printed in the end of the console output:
...
2024-12-10T21:49:50.317182Z INFO text_generation_router::server: router/src/server.rs:2402: Connected
2024-12-10T21:50:03.478010Z INFO chat_completions{total_time="735.200735ms" validation_time="955.594µs" queue_time="73.176µs" inference_time="734.172108ms" time_per_token="36.708605ms" seed="Some(13465365087046561476)"}: text_generation_router::server: router/src/server.rs:622: Success
Verify connection to TGI server:
- Using completions API:
curl localhost:8080/v1/chat/completions \
-X POST \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": true,
"max_tokens": 20
}' \
-H 'Content-Type: application/json'
- Using generate API:
curl 127.0.0.1:8080/generate \
-X POST \
-d '{
"inputs":"What is Deep Learning?",
"parameters":{
"max_new_tokens":20
}
}' \
-H 'Content-Type: application/json'
TGI server will log the connections and some associated stats. Load on the GPU should be observed in the monitor tool executed on the same system in parallel. Ouput from the generate route should contain inference result.
Limitations on missing attention kernels still hold.
Models with list of values as eos_token_id
(such as Llama 3) are now supported. Issue huggingface/tgi#2440 resolved.
The following limitations can be noted for XPU backend:
-
Models requiring attention are not supported due to lack of TGI custom attention kernels for XPU PyTorch backend. Few models might underperform for the same reason since execution will fallback into non-attention path. Check server/text_generation_server/models/init.py for details.
-
Models which has a list of values as
eos_token_id
(such as Llama 3) are not supported on TGI no-attention path. See huggingface/tgi#2440