Run Multi LoRA with ONNX models

Setup

Install Olive

This installs Olive from main. Replace with version 0.8.0 when it is released.
```
pip install git+https://github.com/microsoft/olive
```
Install ONNX Runtime generate()
```
pip install onnxruntime-genai
```
Install other dependencies
```
pip install optimum peft
```
Downgrade torch and transformers

TODO: There is an export bug with torch 2.5.0 and an incompatibility with transformers>=4.45.0
```
pip uninstall torch
pip install torch==2.4
pip uninstall transformers
pip install transformers==4.44
```
Choose a model

In this example we'll use Llama-3-8b

You need to register with Meta for a license to use this model. You can do this by accessing the above page, signing in, and registering for access. Access should be granted quickly. Esnure that the huggingface-cli is installed (pip install huggingface-hub[cli]) and you are logged in via huggingface-cli login.
Locate datasets and/or existing adapters

In this example, we will two pre-tuned adapters
- Coldstart/Llama-3.1-8B-Instruct-Surfer-Dude-Personality
- Coldstart/Llama-3.1-8B-Instruct-Hillbilly-Personality

Generate model and adapters in ONNX format

Convert existing adapters into ONNX format

Note the output path cannot have any period (.) characters.

Note also that this step requires 63GB of memory on the machine on which it is running.

Export the model to ONNX format

Note: add --use_model_builder when this is ready

olive capture-onnx-graph -m meta-llama/Llama-3.1-8B-Instruct --adapter_path Coldstart/Llama-3.1-8B-Instruct-Surfer-Dude-Personality -o models\Llama-3-1-8B-Instruct-LoRA --torch_dtype float32 --use_ort_genai

(Optional) Quantize the model

olive quantize -m models\Llama-3-1-8B-Instruct-LoRA --algorithm rtn --implementation matmul4 -o models\Llama-3-1-8B-Instruct-LoRA-int4

Adapt model

olive generate-adapter -m models\Llama-3-1-8B-Instruct-LoRA-int4 -o models\Llama-3-1-8B-Instruct-LoRA-int4\adapted

Convert adapters to ONNX

This steps assumes you quantized the model in Step 2. If you skipped step 2, then remove the --quantize_int4 argument.

olive convert-adapters --adapter_path Coldstart/Llama-3.1-8B-Instruct-Surfer-Dude-Personality --output_path adapters\Llama-3.1-8B-Instruct-Surfer-Dude-Personality --dtype float32 --quantize_int4

olive convert-adapters --adapter_path Coldstart/Llama-3.1-8B-Instruct-Hillbilly-Personality --output_path adapters\Llama-3.1-8B-Instruct-Hillbilly-Personality --dtype float32 --quantize_int4

Write your application

See app.py as an example.

Call the application

python app.py -m models\Llama-3-1-8B-Instruct-LoRA-int4\adapted\model -a adapters\adapters\Llama-3.1-8B-Instruct-Hillbilly-Personality.onnx_adapter -t "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -s "You are a friendly chatbot" -p "Hi, how are you today?"

Appendix:

Fine-tune your own data

Note: this requires CUDA

Use the olive fine-tune command: https://microsoft.github.io/Olive/features/cli.html#finetune

Here is an example usage of the commmand:

olive finetune --method qlora -m meta-llama/Meta-Llama-3-8B -d nampdn-ai/tiny-codes --train_split "train[:4096]" --eval_split "train[4096:4224]" --text_template "### Language: {programming_language} \n### Question: {prompt} \n### Answer: {response}" --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --max_steps 150 --logging_steps 50 -o adapters\tiny-codes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Run Multi LoRA with ONNX models

Setup

Generate model and adapters in ONNX format

Convert existing adapters into ONNX format

Write your application

Call the application

Appendix:

Fine-tune your own data

Files

README.md

Latest commit

History

README.md

File metadata and controls

Run Multi LoRA with ONNX models

Setup

Generate model and adapters in ONNX format

Convert existing adapters into ONNX format

Write your application

Call the application

Appendix:

Fine-tune your own data