Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpu+npu amd hybrid support #252

Merged
merged 2 commits into from
Dec 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/actions/server-testing/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ inputs:
required: false
default: ""
description: "Location of the OGA for RyzenAI NPU install directory on disk"
amd_oga_hybrid:
required: false
default: ""
description: "Location of the OGA for RyzenAI Hybrid install directory on disk"
hf_token:
required: false
default: ""
Expand Down
109 changes: 109 additions & 0 deletions docs/ort_genai_hybrid.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Introduction

[onnxruntime-genai (aka OGA)](https://github.com/microsoft/onnxruntime-genai/tree/main?tab=readme-ov-file) is a new framework created by Microsoft for running ONNX LLMs.

## Hybrid instructions

### Warnings

- The OGA wheels need to be installed in a specific order or you will end up with the wrong packages in your environment. If you see pip dependency errors, please delete your conda env and start over with a fresh environment.

### Requirements
- [NPU Drivers (version .237)](https://ryzenai.docs.amd.com/en/latest/inst.html#install-npu-drivers)
- [Hybrid LLM artifacts package](https://github.com/aigdat/ryzenai-sw-ea/blob/main/ryzen_ai_13_ga/hybrid-llm-artifacts_1.3.0.zip)

### Installation

1. NOTE: ⚠️ DO THESE STEPS IN EXACTLY THIS ORDER ⚠️
1. Install `lemonade`:
1. Create a conda environment: `conda create -n oga-hybrid python=3.10` (Python 3.10 is required)
1. Activate: `conda activate oga-hybrid`
1. `cd REPO_ROOT`
1. `pip install -e .[llm-oga-hybrid]`
1. Download required OGA packages
1. Access the [Hybrid LLM artifacts package](https://account.amd.com/en/member/ryzenai-sw-ea.html#tabs-a5e122f973-item-4757898120-tab) and download `hybrid-llm-artifacts_1.3.0.zip` and `onnxruntime_vitisai-1.19.0.dev20241217-cp310-cp310-win_amd64.whl`.
1. Copy the `onnxruntime_vitisai-1.19.0.dev20241217-cp310-cp310-win_amd64.whl` file to the `hybrid-llm-artifacts_1.3.0\hybrid-llm-artifacts\onnxruntime_genai\wheel` folder.
1. Unzip `hybrid-llm-artifacts_1.3.0.zip`
1. Create the system environment variable `AMD_OGA_HYBRID` and set it to the path of the `hybrid-llm-artifacts_1.3.0` folder.
1. Restart your terminal
1. Install the wheels:
1. `cd hybrid-llm-artifacts_1.3.0\hybrid-llm-artifacts\onnxruntime_genai\wheel`
1. `pip install onnxruntime_genai_directml-0.4.0.dev0-cp310-cp310-win_amd64.whl`
1. `pip install onnxruntime_vitisai-1.19.0.dev20241217-cp310-cp310-win_amd64.whl`
1. Install driver
1. Download NPU driver from [NPU Drivers (version .237)](https://ryzenai.docs.amd.com/en/latest/inst.html#install-npu-drivers)
1. Unzip `NPU_RAI1.3.zip`
1. Right click `kipudrv.inf` and select `Install`
1. Check under `Device Manager` to ensure that `NPU Compute Accelerator` is using version `32.0.203.237`.

### Runtime

To test basic functionality, point lemonade to any of the models under [quark-quantized-onnx-hybrid-llms-for-ryzen-ai-1.3](https://huggingface.co/collections/amd/quark-awq-g128-int4-asym-fp16-onnx-hybrid-13-674b307d2ffa21dd68fa41d5):

```
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid oga-load --device hybrid --dtype int4 llm-prompt -p "hello whats your name?" --max-new-tokens 15
```

```
Building "amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid"
✓ Loading OnnxRuntime-GenAI model
✓ Prompting LLM

amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid:
<built-in function input> (executed 1x)
Build dir: C:\Users\ramkr\.cache\lemonade\amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid
Status: Successful build!
Dtype: int4
Device: hybrid
Response: hello whats your name? i'm a robot, and i'm here to help you with any questions



Woohoo! Saved to ~\.cache\lemonade\amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid
```

To test/use the websocket server:

```
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid oga-load --device hybrid --dtype int4 serve --max-new-tokens 50
```

Then open the address (http://localhost:8000) in a browser and chat with it.

```
Building "amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid"
✓ Loading OnnxRuntime-GenAI model
Launching LLM Server

INFO: Started server process [8704]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
INFO: ::1:57038 - "GET / HTTP/1.1" 200 OK
INFO: ('::1', 57042) - "WebSocket /ws" [accepted]
INFO: connection open
```

To run a single MMLU test:

```
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid oga-load --device hybrid --dtype int4 accuracy-mmlu --tests management
```

```
Building "amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid"
✓ Loading OnnxRuntime-GenAI model
✓ Measuring accuracy with MMLU

amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid:
<built-in function input> (executed 1x)
Build dir: C:\Users\ramkr\.cache\lemonade\amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid
Status: Successful build!
Dtype: int4
Device: hybrid
Mmlu Management Accuracy: 49.515 %



Woohoo! Saved to ~\.cache\lemonade\amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid
```
2 changes: 1 addition & 1 deletion docs/ort_genai_igpu.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# OnnxRuntime GenAI (OGA) for iGPU and CPU

onnxruntime-genai (aka OGA) is a new framework created by Microsoft for running ONNX LLMs: https://github.com/microsoft/onnxruntime-genai/tree/main?tab=readme-ov-file
[onnxruntime-genai (aka OGA)](https://github.com/microsoft/onnxruntime-genai/tree/main?tab=readme-ov-file) is a new framework created by Microsoft for running ONNX LLMs

## Installation

Expand Down
73 changes: 37 additions & 36 deletions docs/ort_genai_npu.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Introduction

onnxruntime-genai (aka OGA) is a new framework created by Microsoft for running ONNX LLMs: https://github.com/microsoft/onnxruntime-genai/tree/main?tab=readme-ov-file
[onnxruntime-genai (aka OGA)](https://github.com/microsoft/onnxruntime-genai/tree/main?tab=readme-ov-file) is a new framework created by Microsoft for running ONNX LLMs

## NPU instructions

Expand All @@ -15,10 +15,10 @@ onnxruntime-genai (aka OGA) is a new framework created by Microsoft for running
1. Create a conda environment: `conda create -n oga-npu python=3.10` (Python 3.10 is required)
1. Activate: `conda activate oga-npu`
1. `cd REPO_ROOT`
1. `pip install -e .[oga-npu]`
1. `pip install -e .[llm-oga-npu]`
1. Download required OGA packages
1. Access the [AMD RyzenAI EA Lounge](https://account.amd.com/en/member/ryzenai-sw-ea.html#tabs-a5e122f973-item-4757898120-tab) and download `amd_oga_Oct4_2024.zip` from `Ryzen AI 1.3 EA Release`.
1. Unzip `amd_oga_Oct4_2024.zip`
1. Access the [AMD RyzenAI EA Lounge](https://account.amd.com/en/member/ryzenai-sw-ea.html#tabs-a5e122f973-item-4757898120-tab) and download `npu-llm-artifacts_1.3.0.zip` from `Ryzen AI 1.3 Model Release`.
1. Unzip `npu-llm-artifacts_1.3.0.zip`
1. Setup your folder structure:
1. Copy the `amd_oga` folder from the above zip file, if desired
1. Create the system environment variable `AMD_OGA` and set it to the path to the `amd_oga` folder
Expand All @@ -28,79 +28,80 @@ onnxruntime-genai (aka OGA) is a new framework created by Microsoft for running
1. `pip install onnxruntime_vitisai-1.20.0-cp310-cp310-win_amd64.whl`
1. `pip install voe-1.2.0-cp310-cp310-win_amd64.whl`
1. Install driver
1. Access the [AMD RyzenAI EA Lounge](https://account.amd.com/en/member/ryzenai-sw-ea.html#tabs-a5e122f973-item-4757898120-tab) and download `Win24AIDriver.zip` from `Ryzen AI 1.3 Preview Release`.
1. Unzip `Win24AIDriver.zip`
1. Download NPU driver from [NPU Drivers (version .237)](https://ryzenai.docs.amd.com/en/latest/inst.html#install-npu-drivers)
1. Unzip `NPU_RAI1.3.zip`
1. Right click `kipudrv.inf` and select `Install`
1. Check under `Device Manager` to ensure that `NPU Compute Accelerator` is using version `32.0.203.219`.
1. Check under `Device Manager` to ensure that `NPU Compute Accelerator` is using version `32.0.203.237`.

### Runtime

To test basic functionality, point lemonade to any of the models under [quark-quantized-onnx-llms-for-ryzen-ai-1.3-ea](https://huggingface.co/collections/amd/quark-quantized-onnx-llms-for-ryzen-ai-13-ea-66fc8e24927ec45504381902):
To test basic functionality, point lemonade to any of the models under [quark_awq_g128_int4_asym_bf16_onnx_npu 1.3](https://huggingface.co/collections/amd/quark-awq-g128-int4-asym-bf16-onnx-npu-13-6759f510b8132db53e044aaf)

```
lemonade -i amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix oga-load --device npu --dtype int4 llm-prompt -p "hello whats your name?" --max-new-tokens 15
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix --device npu --dtype int4 llm-prompt -p "hello whats your name?" --max-new-tokens 15
```

```
Building "amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix"
[Vitis AI EP] No. of Operators : CPU 73 MATMULNBITS 99
[Vitis AI EP] No. of Subgraphs :MATMULNBITS 33
Building "amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid"
✓ Loading OnnxRuntime-GenAI model
✓ Prompting LLM

amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix:
amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix:
<built-in function input> (executed 1x)
Build dir: C:\Users\danie/.cache/lemonade\amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix
Build dir: C:\Users\ramkr\.cache\lemonade\amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix
Status: Successful build!
Dtype: int4
Device: npu
Response: hello whats your name?
Hi, I'm a 21 year old male from the
Response: hello whats your name? i'm a robot, and i'm here to help you with any questions



Woohoo! Saved to ~\.cache\lemonade\amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix
```

To test/use the websocket server:

```
lemonade -i amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix oga-load --device npu --dtype int4 serve --max-new-tokens 50
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix --device npu --dtype int4 serve --max-new-tokens 50
```

Then open the address (http://localhost:8000) in a browser and chat with it.

```
Building "amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix"
[Vitis AI EP] No. of Operators : CPU 73 MATMULNBITS 99
[Vitis AI EP] No. of Subgraphs :MATMULNBITS 33
Building "amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix"
✓ Loading OnnxRuntime-GenAI model
Launching LLM Server


INFO: Started server process [27752]
INFO: Started server process [8704]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
INFO: ::1:54973 - "GET / HTTP/1.1" 200 OK
INFO: ('::1', 54975) - "WebSocket /ws" [accepted]
INFO: ::1:57038 - "GET / HTTP/1.1" 200 OK
INFO: ('::1', 57042) - "WebSocket /ws" [accepted]
INFO: connection open
I'm a newbie here. I'm looking for a good place to buy a domain name. I've been looking around and i've found a few good places.
```

To run a single MMLU test:

```
lemonade -i amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix oga-load --device npu --dtype int4 accuracy-mmlu --tests management
lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix oga-load --device npu --dtype int4 accuracy-mmlu --tests management
```

```
Building "amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix"
[Vitis AI EP] No. of Operators : CPU 73 MATMULNBITS 99
[Vitis AI EP] No. of Subgraphs :MATMULNBITS 33
✓ Loading OnnxRuntime-GenAI model
✓ Measuring accuracy with MMLU
Building "amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix"
✓ Loading OnnxRuntime-GenAI model
✓ Measuring accuracy with MMLU

amd/Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix:
amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix:
<built-in function input> (executed 1x)
Build dir: C:\Users\danie/.cache/lemonade\amd_Llama-2-7b-hf-awq-g128-int4-asym-fp32-onnx-ryzen-strix
Build dir: C:\Users\ramkr\.cache\lemonade\amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix
Status: Successful build!
Dtype: int4
Device: npu
Mmlu Management Accuracy: 56.31 %
Dtype: int4
Device: npu
Mmlu Management Accuracy: 49.515 %



Woohoo! Saved to ~\.cache\lemonade\amd_Llama-3.2-1B-Instruct-awq-g128-int4-asym-bf16-onnx-ryzen-strix

```
6 changes: 6 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,12 @@
"fastapi",
"uvicorn[standard]",
],
"llm-oga-hybrid": [
"transformers",
"torch",
"onnx==1.16.1",
"numpy==1.26.4",
],
},
classifiers=[],
entry_points={
Expand Down
76 changes: 73 additions & 3 deletions src/turnkeyml/llm/tools/ort_genai/oga.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import os
import time
import json
import shutil
from fnmatch import fnmatch
from queue import Queue
from huggingface_hub import snapshot_download
Expand All @@ -35,7 +36,13 @@
oga_model_builder_cache_path = "model_builder"

# Mapping from processor to executiion provider, used in pathnames and by model_builder
execution_providers = {"cpu": "cpu", "npu": "npu", "igpu": "dml", "cuda": "cuda"}
execution_providers = {
"cpu": "cpu",
"npu": "npu",
"igpu": "dml",
"hybrid": "hybrid",
"cuda": "cuda",
}


class OrtGenaiTokenizer(TokenizerAdapter):
Expand Down Expand Up @@ -248,7 +255,7 @@ def parser(add_help: bool = True) -> argparse.ArgumentParser:
parser.add_argument(
"-d",
"--device",
choices=["igpu", "npu", "cpu", "cuda"],
choices=["igpu", "npu", "cpu", "hybrid", "cuda"],
default="igpu",
help="Which device to load the model on to (default: igpu)",
)
Expand Down Expand Up @@ -312,8 +319,10 @@ def run(
"cpu": {"int4": "*/*", "fp32": "*/*"},
"igpu": {"int4": "*/*", "fp16": "*/*"},
"npu": {"int4": "amd/**-onnx-ryzen-strix"},
"hybrid": {"int4": "amd/**-hybrid"},
"cuda": {"int4": "*/*", "fp16": "*/*"},
}

hf_supported = (
device in hf_supported_models
and dtype in hf_supported_models[device]
Expand Down Expand Up @@ -358,7 +367,7 @@ def run(
)

# Download the model from HF
if device == "npu":
if device == "npu" or device == "hybrid":

# NPU models on HF are ready to go and HF does its own caching
full_model_path = snapshot_download(
Expand All @@ -367,6 +376,67 @@ def run(
)
oga_models_subfolder = None

if device == "hybrid":
# Locate the directory containing hybrid-llm-artifacts_1.3.0 in the system PATH
hybrid_artifacts_path = None
hybrid_artifacts_path = os.environ.get("AMD_OGA_HYBRID")

if hybrid_artifacts_path is None:
raise RuntimeError(
"Could not find hybrid-llm-artifacts_1.3.0 in system PATH. "
"Please ensure it is added to your PATH environment variable."
)

if hybrid_artifacts_path:
# Construct the path to onnx_custom_ops.dll
custom_ops_path = os.path.join(
hybrid_artifacts_path,
"hybrid-llm-artifacts",
"onnx_utils",
"bin",
"onnx_custom_ops.dll",
)

config_path = os.path.join(full_model_path, "genai_config.json")

# Check if the config file exists
if os.path.exists(config_path):
with open(config_path, "r", encoding="utf-8") as f:
config = json.load(f)

# Modify the custom_ops_library under decoder -> session_options
if (
"model" in config
and "decoder" in config["model"]
and "session_options" in config["model"]["decoder"]
):
config["model"]["decoder"]["session_options"][
"custom_ops_library"
] = custom_ops_path

# Write the changes back to the file
with open(config_path, "w", encoding="utf-8") as f:
json.dump(config, f, indent=4)

# Copy DirectML.dll from lib to bin folder
src_dll = os.path.join(
hybrid_artifacts_path,
"hybrid-llm-artifacts",
"onnxruntime_genai",
"lib",
"DirectML.dll",
)
dst_dll = os.path.join(
hybrid_artifacts_path,
"hybrid-llm-artifacts",
"onnx_utils",
"bin",
"DirectML.dll",
)

# Create destination directory if it doesn't exist
os.makedirs(os.path.dirname(dst_dll), exist_ok=True)
shutil.copy2(src_dll, dst_dll)
else:
# device is 'cpu' or 'igpu'

Expand Down
Loading