Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087

Zhiwei35 · 2024-06-19T03:50:48Z

What does this PR do?

sync with llama and add the following three features:

fastsoftmax
cache optimization
bf16 RoPE

validated results

cmd

python run_generation.py --model_name_or_path /home/Qwen1.5-7B-Chat/ --use_kv_cache --max_new_to
kens 100 --bf16 --batch_size 4  --use_hpu_graph --trim_logits --bucket_size 128 --bucket_internal --reuse_cache --use_flash_attention --flash_attention_fast_softmax

w/ this PR

06/19/2024 03:35:28 - INFO - __main__ - Time to first token = 372.23238800652325ms
Warming up
06/19/2024 03:35:29 - INFO - __main__ - Time to first token = 48.07302700646687ms
Warming up
06/19/2024 03:35:30 - INFO - __main__ - Time to first token = 12.022547001834027ms
06/19/2024 03:35:30 - INFO - __main__ - Running generate...
06/19/2024 03:35:31 - INFO - __main__ - Time to first token = 11.604846993577667ms
06/19/2024 03:35:31 - INFO - __main__ - Time to first token = 11.934215988731012ms
06/19/2024 03:35:32 - INFO - __main__ - Time to first token = 11.152709004818462ms
06/19/2024 03:35:33 - INFO - __main__ - Time to first token = 11.47253200178966ms
06/19/2024 03:35:34 - INFO - __main__ - Time to first token = 11.617230004048906ms

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ("DeepSpeed is a machine learning framework that provides high-performance training and inference for deep learning models. It is built on top of PyTorch, a popular deep learning library, and leverages NVIDIA's GPU acceleration through the use of NVIDIA's Deep Learning SDK (cuDNN).\n\nDeepSpeed introduces several key features to improve performance:\n\n1. **Zero-Shot Optimization**: DeepSpeed supports dynamic mixed-precision training, allowing you to train with lower precision (e.g., FP16) while still achieving high accuracy. This reduces memory",)

input 2: ('He is working on',)
output 1: ('He is working on a new____（项目）．\nproject\n\n\n\n\n getchar()函数的作用是（ ）\nA. 从键盘上读取一个字符\nB. 从键盘上读取一行数据\nC. 从文件中读取一个字符\nD. 从文件中读取一行数据\n\n答案是A，从键盘上读取一个字符。`getchar()`函数通常用于C语言等标准输入流中，每次',)

input 3: ('He has a',)
output 1: ('He has a lot of work to do, so he can’t go out with you tonight. A. mustn’t B. needn’t C. can’t D. shouldn’t\n\n句意：他有许多工作要做，所以他今晚不能和你出去。can’t意为“不能”，符合题意，故选C。\n\nC智慧职教: 在Word中，要将文档中的所有段落首行缩进2个字符，应使用（ ）命令。智慧职',)

input 4: ('He got all',)
output 1: ('He got all the facts right, ________ one small mistake.\nA. except for\nB. besides\nC. beside\nD. except\n答案：A\n解析：except for意为“除了……之外”，后接名词或代词，表示整体上肯定，局部有例外；besides意为“除了……之外还有……”；beside意为“在旁边”；except意为“除了……之外”，后面不接宾语。根据句意“他把所有的',)


Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 518.894516133132 tokens/second
Number of HPU graphs                = 17
Memory allocated                    = 15.1 GB
Max memory allocated                = 15.17 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 2.9683832399896346 seconds
-------------------------------------------------------------------------------------------------------------

w/o this PR

06/19/2024 03:28:36 - INFO - __main__ - Time to first token = 376.61240801389795ms
Warming up
06/19/2024 03:28:37 - INFO - __main__ - Time to first token = 48.32350900687743ms
Warming up
06/19/2024 03:28:38 - INFO - __main__ - Time to first token = 11.489711978356354ms
06/19/2024 03:28:38 - INFO - __main__ - Running generate...
06/19/2024 03:28:38 - INFO - __main__ - Time to first token = 11.71710601192899ms
06/19/2024 03:28:39 - INFO - __main__ - Time to first token = 11.269912996795028ms
06/19/2024 03:28:40 - INFO - __main__ - Time to first token = 11.530675008543767ms
06/19/2024 03:28:41 - INFO - __main__ - Time to first token = 11.291447997791693ms
06/19/2024 03:28:41 - INFO - __main__ - Time to first token = 11.283341998932883ms

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ("DeepSpeed is a machine learning framework that provides high-performance training and inference for deep learning models. It is built on top of PyTorch, a popular deep learning library, and leverages NVIDIA's GPU acceleration through the use of NVIDIA's Deep Learning SDK (cuDNN).\n\nDeepSpeed introduces several key features to improve performance:\n\n1. **Zero-Shot Optimization**: DeepSpeed supports dynamic mixed-precision training, allowing you to train with lower precision (e.g., FP16) while still achieving high accuracy. This reduces memory",)

input 2: ('He is working on',)
output 1: ('He is working on a new____（项目）．\nproject\n\n\n\n\n getchar()函数的作用是（ ）\nA. 从键盘上读取一个字符，存储到指定的字符变量中\nB. 从键盘上读取一行数据，存储到指定的字符串变量中\nC. 从屏幕输出一个字符\nD. 从文件中读取一个字符\n\n答案：A\n食管癌的典型症状是(',)

input 3: ('He has a',)
output 1: ('He has a lot of work to do, so he can’t go out with you tonight. A. must B. can C. need D. may\n\n句意：他有许多工作要做，所以他今晚不能和你出去。A. 必须；B. 能；C. 需要；D. 可能。根据句意可知，这里表示“不能”，故选B。\n\nB智慧职教: 2018年1月1日，甲',)

input 4: ('He got all',)
output 1: ('He got all the facts right, ______ some important details. A．except for B．except that C．but for D．but\n\n正确答案是A．except for后面接的词或短语通常表示整体中除去的部分，而except后面接的是从整体中被排除出去的细节，故选A． getchar()函数的作用是（ ）。\n\n从标准输入设备读取一个字符并返回智慧职教: 在Word2010中，要将文档',)


Stats:
--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 518.6809485476322 tokens/second
Number of HPU graphs                = 17
Memory allocated                    = 15.1 GB
Max memory allocated                = 15.17 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 2.979208919001394 seconds
--------------------------------------------------------------------------------------------------------------

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

libinta · 2024-06-25T01:28:34Z

@Zhiwei35 can you please add test case too?

regisss · 2024-07-03T17:42:38Z

@Zhiwei35 Please run make style

HuggingFaceDocBuilderDev · 2024-07-03T17:44:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/habana/transformers/models/qwen2/modeling_qwen2.py

Zhiwei35 · 2024-07-09T09:51:57Z

@libinta @regisss test case is added and style errors are fixed

enhance qwen2 with fastsoftmax and bf16 rope and cache optimization

507bdc3

Zhiwei35 requested a review from regisss as a code owner June 19, 2024 03:50

libinta approved these changes Jun 25, 2024

View reviewed changes

libinta added the run-test Run CI for PRs from external contributors label Jun 26, 2024

mgonchar reviewed Jul 3, 2024

View reviewed changes

optimum/habana/transformers/models/qwen2/modeling_qwen2.py Show resolved Hide resolved

Zhiwei35 added 2 commits July 9, 2024 09:18

add test case

f0384d6

Merge branch 'main' into enhance_qwen2

ce796d0

fix style error

0ee5ce4

regisss approved these changes Jul 10, 2024

View reviewed changes

regisss merged commit 5660db6 into huggingface:main Jul 10, 2024
2 of 3 checks passed

abhatkal mentioned this pull request Jul 22, 2024

Starcoder2 : KVCache and flash attention (FusedSDPA) enablement #1149

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087

Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087

Zhiwei35 commented Jun 19, 2024

libinta commented Jun 25, 2024

regisss commented Jul 3, 2024

HuggingFaceDocBuilderDev commented Jul 3, 2024

Zhiwei35 commented Jul 9, 2024

Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087

Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087

Conversation

Zhiwei35 commented Jun 19, 2024

What does this PR do?

validated results

Before submitting

libinta commented Jun 25, 2024

regisss commented Jul 3, 2024

HuggingFaceDocBuilderDev commented Jul 3, 2024

Zhiwei35 commented Jul 9, 2024