Skip to content

Latest commit

 

History

History
305 lines (246 loc) · 17.4 KB

README.md

File metadata and controls

305 lines (246 loc) · 17.4 KB

IPEX-LLM

ipex-llm is a library for running LLM (large language model) on Intel XPU (from Laptop to GPU to Cloud) using INT4 with very low latency1 (for any PyTorch model).

It is built on top of the excellent work of llama.cpp, gptq, ggml, llama-cpp-python, bitsandbytes, qlora, gptq_for_llama, chatglm.cpp, redpajama.cpp, gptneox.cpp, bloomz.cpp, etc.

Demos

See the optimized performance of chatglm2-6b and llama-2-13b-chat models on 12th Gen Intel Core CPU and Intel Arc GPU below.

12th Gen Intel Core CPU Intel Arc GPU
chatglm2-6b llama-2-13b-chat chatglm2-6b llama-2-13b-chat

Verified models

Over 20 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, ChatGLM/ChatGLM2, Mistral, Falcon, MPT, Dolly, StarCoder, Whisper, Baichuan, InternLM, QWen, Aquila, MOSS, and more; see the complete list below.

Model CPU Example GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) link1, link2 link
LLaMA 2 link1, link2 link1, link2-low GPU memory example
ChatGLM link
ChatGLM2 link link
ChatGLM3 link link
Mistral link link
Falcon link link
MPT link link
Dolly-v1 link link
Dolly-v2 link link
Replit Code link link
RedPajama link1, link2
Phoenix link1, link2
StarCoder link1, link2 link
Baichuan link link
Baichuan2 link link
InternLM link link
Qwen link link
Qwen1.5 link link
Qwen-VL link link
Aquila link link
Aquila2 link link
MOSS link
Whisper link link
Phi-1_5 link link
Flan-t5 link link
LLaVA link link
CodeLlama link link
Skywork link
InternLM-XComposer link
WizardCoder-Python link
CodeShell link
Fuyu link
Distil-Whisper link link
Yi link link
BlueLM link link
Mamba link link
SOLAR link link
Phixtral link link
InternLM2 link link
RWKV4 link
RWKV5 link
Bark link link
SpeechT5 link
DeepSeek-MoE link
Ziya-Coding-34B-v1.0 link
Phi-2 link link
Yuan2 link link
DeciLM-7B link link
Deepseek link link

Working with ipex-llm

Table of Contents

Install

CPU

You may install ipex-llm on Intel CPU as follows:

pip install --pre --upgrade ipex-llm[all]

Note: ipex-llm has been tested on Python 3.9

GPU

You may install ipex-llm on Intel GPU as follows:

# below command will install intel_extension_for_pytorch==2.0.110+xpu as default
# you can install specific ipex/torch version for your need
pip install --pre --upgrade ipex-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu

Note: ipex-llm has been tested on Python 3.9

Run Model

You may run the models using ipex-llm through one of the following APIs:

  1. Hugging Face transformers API
  2. Native INT4 Model
  3. LangChain API
  4. CLI (command line interface) Tool
1. Hugging Face transformers API

You may run any Hugging Face Transformers model as follows:

CPU INT4

You may apply INT4 optimizations to any Hugging Face Transformers model on Intel CPU as follows.

#load Hugging Face Transformers model with INT4 optimizations
from ipex_llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

#run the optimized model on Intel CPU
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)

See the complete examples here.

GPU INT4

You may apply INT4 optimizations to any Hugging Face Transformers model on Intel GPU as follows.

#load Hugging Face Transformers model with INT4 optimizations
from ipex_llm.transformers import AutoModelForCausalLM
import intel_extension_for_pytorch
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

#run the optimized model on Intel GPU
model = model.to('xpu')

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu())

See the complete examples here.

More Low-Bit Support
  • Save and load

    After the model is optimized using ipex-llm, you may save and load the model as follows:

    model.save_low_bit(model_path)
    new_model = AutoModelForCausalLM.load_low_bit(model_path)

    See the complete example here.

  • Additonal data types

    In addition to INT4, You may apply other low bit optimizations (such as INT8, INT5, NF4, etc.) as follows:

    model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")

    See the complete example here.

2. Native INT4 model

You may also convert Hugging Face Transformers models into native INT4 model format for maximum performance as follows.

Notes: Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; for other models, you may use the Hugging Face transformers model format as described above).

#convert the model
from ipex_llm import llm_convert
ipex_llm_path = llm_convert(model='/path/to/model/',
        outfile='/path/to/output/', outtype='int4', model_family="llama")

#load the converted model
#switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models
from ipex_llm.transformers import LlamaForCausalLM
llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...)
  
#run the converted model
input_ids = llm.tokenize(prompt)
output_ids = llm.generate(input_ids, ...)
output = llm.batch_decode(output_ids)

See the complete example here.

3. LangChain API

You may run the models using the LangChain API in ipex-llm.

  • Using Hugging Face transformers model

    You may run any Hugging Face Transformers model (with INT4 optimiztions applied) using the LangChain API as follows:

    from ipex_llm.langchain.llms import TransformersLLM
    from ipex_llm.langchain.embeddings import TransformersEmbeddings
    from langchain.chains.question_answering import load_qa_chain
    
    embeddings = TransformersEmbeddings.from_model_id(model_id=model_path)
    ipex_llm = TransformersLLM.from_model_id(model_id=model_path, ...)
    
    doc_chain = load_qa_chain(ipex_llm, ...)
    output = doc_chain.run(...)

    See the examples here.

  • Using native INT4 model

    You may also convert Hugging Face Transformers models into native INT4 format, and then run the converted models using the LangChain API as follows.

    Notes:* Currently only llama/bloom/gptneox/starcoder/chatglm model families are supported; for other models, you may use the Hugging Face transformers model format as described above).

    from ipex_llm.langchain.llms import LlamaLLM
    from ipex_llm.langchain.embeddings import LlamaEmbeddings
    from langchain.chains.question_answering import load_qa_chain
    
    #switch to ChatGLMEmbeddings/GptneoxEmbeddings/BloomEmbeddings/StarcoderEmbeddings to load other models
    embeddings = LlamaEmbeddings(model_path='/path/to/converted/model.bin')
    #switch to ChatGLMLLM/GptneoxLLM/BloomLLM/StarcoderLLM to load other models
    ipex_llm = LlamaLLM(model_path='/path/to/converted/model.bin')
    
    doc_chain = load_qa_chain(ipex_llm, ...)
    doc_chain.run(...)

    See the examples here.

4. CLI Tool

Note: Currently ipex-llm CLI supports LLaMA (e.g., vicuna), GPT-NeoX (e.g., redpajama), BLOOM (e.g., pheonix) and GPT2 (e.g., starcoder) model architecture; for other models, you may use the Hugging Face transformers or LangChain APIs.

  • Convert model

    You may convert the downloaded model into native INT4 format using llm-convert.

    #convert PyTorch (fp16 or fp32) model; 
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-convert "/path/to/model/" --model-format pth --model-family "bloom" --outfile "/path/to/output/"
    
    #convert GPTQ-4bit model
    #only llama model family is currently supported
    llm-convert "/path/to/model/" --model-format gptq --model-family "llama" --outfile "/path/to/output/"
  • Run model

    You may run the converted model using llm-cli or llm-chat (built on top of main.cpp in llama.cpp)

    #help
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-cli -x gptneox -h
    
    #text completion
    #llama/bloom/gptneox/starcoder model family is currently supported
    llm-cli -t 16 -x gptneox -m "/path/to/output/model.bin" -p 'Once upon a time,'
    
    #chat mode
    #llama/gptneox model family is currently supported
    llm-chat -m "/path/to/output/model.bin" -x llama

ipex-llm API Doc

See the inital ipex-llm API Doc here.

Footnotes

  1. Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.