ov.cpu.llm.experimental

This repo demonstrates a LLM optimization method by custom-ops for OpenVINO. In order to inference the LLM efficiently, this repo introduces a new Op called MHA and re-construct the LLM based on this new-ops.

This environment and benchmark can be built in a Docker environment (section 1), or inside a Linux/Windows bare metal system (section 2).

1. Docker-based Environment

1.1 Build Docker image

Build a docker image that has all dependencies, custom OpenVINO and custom ops builds installed:

docker build -t llm-openvino .

1.2 Generate optimized model with Docker

With this built Docker image, you can then generate the optimized OpenVINO IR (optionally with weight compression) as follows. Note that here we're assuming that you have already downloaded a model from the Huggingface hub into a local cache on the host (e.g. using hugginface-cli download), and volume mounting it into the Docker container using the Docker -v argument as illustrated below.

mkdir -p $HOME/models
docker run --rm -v $HOME/.cache/huggingface:/cache/huggingface -v $HOME/models:/models -it openvino-llm \
  python3 models/llama.py \
    --quant_type nncf_w8 \
    --org_model_path /cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/08751db2aca9bf2f7f80d2e516117a53d7450235 \
    --ov_model_path /models/llama-2-7b-chat-ov

This will put the optimized OpenVINO IR (and associated tokenizer) into ~/models/llama-2-7b-chat-ov on the host.

1.2 Benchmark optimized model with Docker

Now we're ready to run the benchmarks using this model, again mounting that directory into the launched docker container. The following command runs the benchmark of the compressed model for 3 iterations, with BF16 precision.

docker run --privileged --rm -v $HOME/models:/models -v $(pwd):/results -it openvino-llm \
  python3 llm_pipeline.py -m /models/llama-2-7b-chat-ov/nncf_w8 --bf16 -r 3 --greedy -p "What is OpenVINO?" --output-results /results/results.csv

A sample output is below:

[setupvars.sh] OpenVINO environment initialized
Using pad_token, but it is not set yet.
Init OpenVINO model ...
VNode_14200 created executor: llm::experimental::MultiHeadAttention,LLMDNN,BF16
Start test ...
round 0:
  [1,  689+15]  4685.5ms = 3773.2ms + 245.3ms + (47.6ms x 14) + 1.2ms
        0. [' Hello! I am an AI assistant, How can I help you?']
...

Since we applied --output-results above, you will find the results in the results.csv file.

See section 2 below for more examples of model generation and benchmark options.

2. Bare Metal Environment

2.1. Build Dependency on Linux

You could refer to build_linux for more details. Please set the install dir for openvino. Note, please make sure the gcc version is at least 11.2.

Build OpenVINO

git clone https://github.com/usstq/openvino.git -b vnode-lc
cd openvino && git submodule update --init --recursive 
python3 -m pip install -U pip 
python3 -m pip install -r ./src/bindings/python/src/compatibility/openvino/requirements-dev.txt
python3 -m pip install -r ./src/bindings/python/wheel/requirements-dev.txt
python3 -m pip install -r ./src/bindings/python/requirements.txt
mkdir build && cd build
cmake -DENABLE_INTEL_GPU=OFF -DENABLE_INTEL_GNA=OFF -DENABLE_PYTHON=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<ov install dir> ..

# if want to run the model on multiple numa nodes, use the following
# cmake -DENABLE_INTEL_GPU=OFF -DENABLE_PYTHON=ON -DCMAKE_BUILD_TYPE=Release -DTHREADING=OMP -DCMAKE_INSTALL_PREFIX=<ov install dir> ..
make --jobs=$(nproc --all)
make install
cd <ov install dir>/tools/
python3 -m pip install  openvino*.whl

Build Custom Ops Library

Please Do Reminder to enable the customized OpenVINO environment for this repo

source <ov install dir>/setupvars.sh
cd custom_ops
mkdir build && cd build
cmake ..
make -j8
# custom_ops/build/libov-cpu-llm-experimental.so

2.2. Build Dependency on Windows

You could refer to build_windows for more details. Please set the install dir for openvino. Note, please make sure the MSVC version is at least Visual Studio 16 2019.

Build OpenVINO

git clone https://github.com/usstq/openvino.git -b vnode-lc
cd openvino && git submodule update --init --recursive
python3 -m pip install -U pip
python3 -m pip install -r ./src/bindings/python/src/compatibility/openvino/requirements-dev.txt
python3 -m pip install -r ./src/bindings/python/wheel/requirements-dev.txt
python3 -m pip install -r ./src/bindings/python/requirements.txt
mkdir build && cd build
cmake -G "Visual Studio 16 2019" -DENABLE_INTEL_GPU=OFF -DENABLE_INTEL_GNA=OFF -DENABLE_PYTHON=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<ov install dir> ..

# if want to run the model on multiple numa nodes, use the following
# cmake -G "Visual Studio 16 2019" -DENABLE_INTEL_GPU=OFF -DENABLE_INTEL_GNA=OFF -DENABLE_PYTHON=ON -DCMAKE_BUILD_TYPE=Release -DTHREADING=OMP -DCMAKE_INSTALL_PREFIX=<ov install dir> ..
cmake --build . --config Release --verbose -j8
cmake --install .
cd <ov install dir>/tools/
python3 -m pip install  openvino*.whl

Build Custom Ops Library

Please Do Reminder to enable the customized OpenVINO environment for this repo

<ov install dir>/setupvars.bat
cd custom_ops
mkdir build && cd build
cmake -G "Visual Studio 16 2019" ..
cmake --build . --config Release --verbose -j8
# custom_ops\\build\\Release\\ov-cpu-llm-experimental.dll

2.3. Setup Demo Environment

Install python env

pip3 install -r requirements.txt
pip3 install -e .

2.4. Model Conversion

convert orginal model into OpenVINO FP32 IR:

python models/gptj.py
python models/gptneox.py
python models/falcon.py
python models/llama.py
python models/chatglm2.py

convert orginal model into OpenVINO INT8 IR with weight compression:

python models/gptj.py --quant_type=Q4_1 # valid types: F16/Q8_C/Q4_C/Q8_0/Q4_0/Q4_1/nncf_w8
python models/gptneox.py --quant_type=Q4_1
python models/falcon.py --quant_type=Q4_1
python models/llama.py --quant_type=Q4_1
python models/chatglm2.py --quant_type=Q4_1

2.5. Run Pipeline

# greedy search:  f32/bf16 
numactl -N 0 --membind=0  python llm_pipeline.py -m ./gen/gptj_6b/ -p "What's Oxygen?" -r 3 --greedy
numactl -N 0 --membind=0  python llm_pipeline.py -m ./gen/gptj_6b/ -p "What's Oxygen?" -r 3 --greedy --bf16
# beam search:  f32/bf16 
numactl -N 0 --membind=0  python llm_pipeline.py -m ./gen/gptj_6b/ -p "What's Oxygen?" -r 3
numactl -N 0 --membind=0  python llm_pipeline.py -m ./gen/gptj_6b/ -p "What's Oxygen?" -r 3 --bf16
# specific input token length (support multiple langth, multiple round)
numactl -N 0 --membind=0  python llm_pipeline.py -m ./gen/gptj_6b/ -pl 32 512 1024 2016 8192 -r 3 --bf16
# run on all numa nodes
python llm_pipeline.py -m ./gen/falcon_40b -bs 1 --bf16 -pl 8000

Quantization with experimental FC node

Inspired by excellent project llama.cpp, we use following quantization methods:

Weights are quantized off-line
Activations are quantized dynamically at runtime

quant_type	description
`F16`	FP16 weight format
`Q8_C`	per-output channel symmetric weight-quantization
`Q4_C`	per-output channel asymmetric weight-quantization
`Q8_0`, `Q4_0`	llama.cpp style per-32 weights symmetric weight-quantization
`Q4_1`	llama.cpp style per-32 weights asymmetric weight-quantization
`nncf_w8`	per-output channel asymmetric weight-quantization from nncf

Note

asymmetric quantization improves accuracy (PPL) at lower quantization bits, so Q4_C uses asymmetric quantization (with integer zero-point which has higher accuracy than non-integer zero-point)

performance/accuracy report

# performance
numactl -C0-15  python llm_pipeline.py -m ./gen/llama-2-7b-chat/Q8_0/ -p "I am retail store manager with new ice cream flavor Super Sweet White Coffee. Can you generate a twitter post to promote it?" -r 1 --greedy -al 32

# perplexity
# download wikitext-2-raw from :
#   https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
#   https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
python ./llm_perplexity.py -f=./wikitext-2-raw/wiki.test.raw -ov ./gen/llama-2-7b-chat/F16/

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
custom_ops		custom_ops
models		models
pipeline		pipeline
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
entry_point.sh		entry_point.sh
llm_perplexity.py		llm_perplexity.py
llm_pipeline.py		llm_pipeline.py
prompts.json		prompts.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
simple_pipeline.py		simple_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ov.cpu.llm.experimental

1. Docker-based Environment

1.1 Build Docker image

1.2 Generate optimized model with Docker

1.2 Benchmark optimized model with Docker

2. Bare Metal Environment

2.1. Build Dependency on Linux

Build OpenVINO

Build Custom Ops Library

2.2. Build Dependency on Windows

Build OpenVINO

Build Custom Ops Library

2.3. Setup Demo Environment

2.4. Model Conversion

2.5. Run Pipeline

Quantization with experimental FC node

performance/accuracy report

About

Releases

Packages

Contributors 5

Languages

License

luo-cheng2021/ov.cpu.llm.experimental

Folders and files

Latest commit

History

Repository files navigation

ov.cpu.llm.experimental

1. Docker-based Environment

1.1 Build Docker image

1.2 Generate optimized model with Docker

1.2 Benchmark optimized model with Docker

2. Bare Metal Environment

2.1. Build Dependency on Linux

Build OpenVINO

Build Custom Ops Library

2.2. Build Dependency on Windows

Build OpenVINO

Build Custom Ops Library

2.3. Setup Demo Environment

2.4. Model Conversion

2.5. Run Pipeline

Quantization with experimental FC node

performance/accuracy report

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages