Skip to content

Commit

Permalink
[NPU] Add Optimized Support for Llama3.2-1B/3B on NPU (#12339)
Browse files Browse the repository at this point in the history
* Add initial support for llama3.2-1b/3b

* move llama3.2 support into current llama_mp impl
  • Loading branch information
sgwhat authored Nov 6, 2024
1 parent 872a744 commit a7b6668
Show file tree
Hide file tree
Showing 6 changed files with 360 additions and 127 deletions.
19 changes: 19 additions & 0 deletions python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ In this directory, you will find examples on how to directly run HuggingFace `tr
|------------|----------------------------------------------------------------|
| Llama2 | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) |
| Llama3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
| Llama3.2-1B | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) |
| Llama3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) |
| Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
| Chatglm2 | [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) |
| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct), [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |
Expand All @@ -33,6 +35,9 @@ conda activate llm
:: install ipex-llm with 'npu' option
pip install --pre --upgrade ipex-llm[npu]
:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct
pip install transformers==4.45.0 accelerate==0.33.0
```

## 2. Runtime Configurations
Expand Down Expand Up @@ -82,6 +87,8 @@ done
The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU, including
- [Llama2-7B](./llama.py)
- [Llama3-8B](./llama.py)
- [Llama3.2-1B](./llama.py)
- [Llama3.2-3B](./llama.py)
- [Qwen2-1.5B](./qwen.py)
- [Qwen2.5-7B](./qwen.py)
- [MiniCPM-1B](./minicpm.py)
Expand All @@ -106,6 +113,12 @@ python llama.py
:: to run Meta-Llama-3-8B-Instruct (LNL driver version: 32.0.101.2715)
python llama.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct
:: to run Llama-3.2-1B-Instruct
python llama.py --repo-id-or-model-path meta-llama/Llama-3.2-1B-Instruct
:: to run Llama-3.2-3B-Instruct
python llama.py --repo-id-or-model-path meta-llama/Llama-3.2-3B-Instruct
:: to run Qwen2-1.5B-Instruct (LNL driver version: 32.0.101.2715)
python qwen.py
Expand Down Expand Up @@ -145,6 +158,12 @@ python llama.py --disable-transpose-value-cache
:: to run Meta-Llama-3-8B-Instruct (LNL driver version: 32.0.101.2715)
python llama.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --disable-transpose-value-cache
:: to run Llama-3.2-1B-Instruct
python llama.py --repo-id-or-model-path meta-llama/Llama-3.2-1B-Instruct --disable-transpose-value-cache
:: to run Llama-3.2-3B-Instruct
python llama.py --repo-id-or-model-path meta-llama/Llama-3.2-3B-Instruct --disable-transpose-value-cache
:: to run Qwen2-1.5B-Instruct (LNL driver version: 32.0.101.2715)
python qwen.py --disable-transpose-value-cache
Expand Down
2 changes: 2 additions & 0 deletions python/llm/example/NPU/HF-Transformers-AutoModels/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ This folder contains examples of running IPEX-LLM on Intel NPU:
|------------|----------------------------------------------------------------|
| Llama2 | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) |
| Llama3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
| Llama3.2-1B | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) |
| Llama3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) |
| Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
| Chatglm2 | [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) |
| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct), [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |
Expand Down
18 changes: 14 additions & 4 deletions python/llm/src/ipex_llm/transformers/npu_models/convert_mp.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,8 @@ def convert_llama(
intra_pp=None,
transpose_value_cache=True,
):
from ipex_llm.transformers.npu_models.llama_mp import gen_llama_fused_model_forward
from ipex_llm.transformers.npu_models.llama_mp import gen_llama_fused_model_forward,\
gen_llama_32_fused_model_forward
from ipex_llm.transformers.npu_models.llama_mp import DecodeRunner, PrefillRunner
from transformers.models.llama.modeling_llama import LlamaModel

Expand All @@ -193,9 +194,18 @@ def convert_llama(
max_prompt_len=max_prompt_len,
transpose_value_cache=transpose_value_cache,
)
llama_model_forward = gen_llama_fused_model_forward(
prefill_runner=prefill_runner, decode_runner=decode_runner
)
from packaging import version
import transformers
trans_version = transformers.__version__
if version.parse(trans_version) == version.parse("4.45.0"):
# llama-3.2-3B & llama-3.2-1B
llama_model_forward = gen_llama_32_fused_model_forward(
prefill_runner=prefill_runner, decode_runner=decode_runner
)
else:
llama_model_forward = gen_llama_fused_model_forward(
prefill_runner=prefill_runner, decode_runner=decode_runner
)
convert_forward(model, LlamaModel, llama_model_forward)
from transformers.models.llama.modeling_llama import LlamaForCausalLM
from ipex_llm.transformers.npu_models.llama_mp import llama2_casullm_forward
Expand Down
5 changes: 4 additions & 1 deletion python/llm/src/ipex_llm/transformers/npu_models/kv.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ class DynamicFusedNormalCache(DynamicCache):
# Experimental support for fused decoderlayer implementation on NPU
# Currently only for llama2

def __init__(self) -> None:
def __init__(self, num_hidden_layers: Optional[int] = None) -> None:
self.key_cache: Dict[int, torch.Tensor] = {}
self.value_cache: Dict[int, torch.Tensor] = {}
self.min_layer_idx = sys.maxsize
Expand All @@ -158,6 +158,9 @@ def update(
cache_kwargs: Optional[Dict[str, Any]]=None,
) -> Tuple[torch.Tensor, torch.Tensor]:

if key_states == []:
return key_states, value_states

batch_size, num_heads, seq_len, head_dim = key_states.shape

max_seq_length = cache_kwargs["max_seq_len"] if "max_seq_len" in cache_kwargs else None
Expand Down
Loading

0 comments on commit a7b6668

Please sign in to comment.