Skip to content

Commit

Permalink
Enable fp8 inference for Llava-Next and add Fused_SDPA (#1120)
Browse files Browse the repository at this point in the history
Co-authored-by: regisss <[email protected]>
  • Loading branch information
tthakkal and regisss authored Jul 15, 2024
1 parent 609e450 commit c495f47
Show file tree
Hide file tree
Showing 9 changed files with 483 additions and 17 deletions.
120 changes: 111 additions & 9 deletions examples/image-to-text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,28 +15,68 @@ limitations under the License.
-->

# Image to Text Examples

This directory contains a script that showcases how to use the Transformers pipeline API to run image to text task on HPUs.
This directory contains a script that showcases how to perform image to text generation on Intel® Gaudi® AI Accelerators.

## Single-HPU inference

Models that have been validated:
- [nlpconnect/vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning)
- [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
- [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)
- [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
- [llava-hf/llava-1.5-13b-hf](https://huggingface.co/llava-hf/llava-1.5-13b-hf)
- [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)
- [llava-hf/llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf)
- [llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf)

### Inference with BF16

To run Salesforce/blip-image-captioning-large inference, use the following command:
```bash
python3 run_pipeline.py \
--model_name_or_path Salesforce/blip-image-captioning-large \
--image_path "https://ankur3107.github.io/assets/images/image-captioning-example.png" \
--use_hpu_graphs \
--bf16
```
Models that have been validated:
- [nlpconnect/vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning)
- [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
- [Salesforce/blip-image-captioning-base](https://huggingface.co/Salesforce/blip-image-captioning-base)

### Running with FP8
To run Llava-1.5-7b inference, use the following command:
```bash
python3 run_pipeline.py \
--model_name_or_path llava-hf/llava-1.5-7b-hf \
--use_hpu_graphs \
--bf16
```

To run Llava-1.5-13b inference, use the following command:
```bash
python3 run_pipeline.py \
--model_name_or_path llava-hf/llava-1.5-13b-hf \
--use_hpu_graphs \
--bf16
```

To run Llava-v1.6-mistral-7b inference, use the following command:
```bash
python3 run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
--use_hpu_graphs \
--bf16
```

Llava-1.5-7b and Llava-1.5-13b in FP8 are enabled using the Quantization Toolkit (HQT), which provides model measurement and quantization capabilities in PyTorch.
To run Llava-v1.6-vicuna-13b inference, use the following command:
```bash
python3 run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
--use_hpu_graphs \
--bf16
```

### Inference with FP8

More information on enabling fp8 in SynapseAI is available here:
Inference for Llava-1.5-7b, Llava-1.5-13b, Llava-v1.6-mistral-7b and Llava-v1.6-vicuna-13b in FP8 precision are enabled using the Quantization Toolkit (HQT), which provides model measurement and quantization capabilities in PyTorch.

More information on enabling FP8 in SynapseAI is available here:
https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html

Here is an example to measure the tensor quantization statistics on Llava-1.5-7b:
Expand All @@ -56,3 +96,65 @@ QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_pipeline.py \
--use_hpu_graphs \
--bf16
```


Here is an example to measure the tensor quantization statistics on Llava-v1.6-mistral-7b:
```bash
QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16
```

Here is an example to quantize the model based on previous measurements for Llava-v1.6-mistral-7b:
```bash
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16
```

Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b:
```bash
QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16
```

Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b:
```bash
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16
```

### Inference with FusedSDPA

Habana FusedSDPA is a fused and optimized implementation of torch.nn.functional.scaled_dot_product_attention() for Gaudi. For more details, refer to [Gaudi online documentation](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html?highlight=fusedsdpa#using-fused-scaled-dot-product-attention-fusedsdpa). Currently FusedSDPA works with BF16 precision for Llava models.

Use the following commands to run Llava-1.5-7b inference with FusedSDPA
```bash
python3 run_pipeline.py \
--model_name_or_path llava-hf/llava-1.5-7b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--use_flash_attention
```


Use the following commands to run Llava-v1.6-mistral-7b inference with FusedSDPA
```bash
python3 run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--use_flash_attention
```
10 changes: 8 additions & 2 deletions examples/image-to-text/run_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,12 @@ def main():
action="store_true",
help="Whether to ignore eos, set False to disable it.",
)
parser.add_argument(
"--use_flash_attention",
action="store_true",
help="Whether to enable Habana Flash Attention, provided that the model supports it.",
)

args = parser.parse_args()

# set args.quant_config with env variable if it is set
Expand All @@ -109,7 +115,7 @@ def main():
args.prompt = "<image>\nUSER: What's the content of the image?\nASSISTANT:"
elif args.prompt is None and model_type == "llava_next":
args.prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
if args.model_name_or_path == "llava-hf/llava-v1.6-vicuna-13b-hf":
if args.model_name_or_path in ["llava-hf/llava-v1.6-vicuna-13b-hf", "llava-hf/llava-v1.6-vicuna-7b-hf"]:
args.prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\nWhat is shown in this image? ASSISTANT:"

image_paths = args.image_path
Expand Down Expand Up @@ -149,6 +155,7 @@ def main():
"hpu_graphs": args.use_hpu_graphs,
"max_new_tokens": args.max_new_tokens,
"ignore_eos": args.ignore_eos,
"use_flash_attention": args.use_flash_attention,
}
if args.use_hpu_graphs:
from habana_frameworks.torch.hpu import wrap_in_hpu_graph
Expand All @@ -165,7 +172,6 @@ def main():
# warm up
for i in range(args.warmup):
generator(images, prompt=args.prompt, batch_size=args.batch_size, generate_kwargs=generate_kwargs)

torch.hpu.synchronize()
if args.quant_config:
habana_quantization_toolkit.finish_measurements(generator.model)
Expand Down
10 changes: 10 additions & 0 deletions optimum/habana/transformers/modeling_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,12 @@
from .models import (
GaudiBloomForCausalLM,
GaudiBloomMLP,
GaudiCLIPAttention,
GaudiCLIPEncoder,
GaudiCLIPEncoderLayer,
GaudiCLIPVisionEmbeddings,
GaudiCLIPVisionModel,
GaudiCLIPVisionTransformer,
GaudiCodeGenAttention,
GaudiCodeGenForCausalLM,
GaudiFalconAttention,
Expand Down Expand Up @@ -376,6 +381,11 @@ def adapt_transformers_to_gaudi():

# Optimization for Clip on Gaudi
transformers.models.clip.modeling_clip.CLIPVisionEmbeddings = GaudiCLIPVisionEmbeddings
transformers.models.clip.modeling_clip.CLIPAttention = GaudiCLIPAttention
transformers.models.clip.modeling_clip.CLIPEncoderLayer = GaudiCLIPEncoderLayer
transformers.models.clip.modeling_clip.CLIPEncoder = GaudiCLIPEncoder
transformers.models.clip.modeling_clip.CLIPVisionTransformer = GaudiCLIPVisionTransformer
transformers.models.clip.modeling_clip.CLIPVisionModel = GaudiCLIPVisionModel

# Optimization for falcon generation on Gaudi
transformers.models.falcon.modeling_falcon.FalconAttention = GaudiFalconAttention
Expand Down
9 changes: 8 additions & 1 deletion optimum/habana/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,14 @@
gaudi_bloom_convert_to_standard_cache,
gaudi_bloom_model_forward,
)
from .clip import GaudiCLIPVisionEmbeddings
from .clip import (
GaudiCLIPAttention,
GaudiCLIPEncoder,
GaudiCLIPEncoderLayer,
GaudiCLIPVisionEmbeddings,
GaudiCLIPVisionModel,
GaudiCLIPVisionTransformer,
)
from .codegen import (
GaudiCodeGenAttention,
GaudiCodeGenForCausalLM,
Expand Down
9 changes: 8 additions & 1 deletion optimum/habana/transformers/models/clip/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,8 @@
from .modeling_clip import GaudiCLIPVisionEmbeddings
from .modeling_clip import (
GaudiCLIPAttention,
GaudiCLIPEncoder,
GaudiCLIPEncoderLayer,
GaudiCLIPVisionEmbeddings,
GaudiCLIPVisionModel,
GaudiCLIPVisionTransformer,
)
Loading

0 comments on commit c495f47

Please sign in to comment.