Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087

Merged
merged 4 commits into from
Jul 10, 2024
Merged

Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087

merged 4 commits into from
Jul 10, 2024

Conversation

Zhiwei35
Copy link
Contributor

What does this PR do?

sync with llama and add the following three features:

  1. fastsoftmax
  2. cache optimization
  3. bf16 RoPE

validated results

  1. cmd
python run_generation.py --model_name_or_path /home/Qwen1.5-7B-Chat/ --use_kv_cache --max_new_to
kens 100 --bf16 --batch_size 4  --use_hpu_graph --trim_logits --bucket_size 128 --bucket_internal --reuse_cache --use_flash_attention --flash_attention_fast_softmax
  1. w/ this PR
06/19/2024 03:35:28 - INFO - __main__ - Time to first token = 372.23238800652325ms
Warming up
06/19/2024 03:35:29 - INFO - __main__ - Time to first token = 48.07302700646687ms
Warming up
06/19/2024 03:35:30 - INFO - __main__ - Time to first token = 12.022547001834027ms
06/19/2024 03:35:30 - INFO - __main__ - Running generate...
06/19/2024 03:35:31 - INFO - __main__ - Time to first token = 11.604846993577667ms
06/19/2024 03:35:31 - INFO - __main__ - Time to first token = 11.934215988731012ms
06/19/2024 03:35:32 - INFO - __main__ - Time to first token = 11.152709004818462ms
06/19/2024 03:35:33 - INFO - __main__ - Time to first token = 11.47253200178966ms
06/19/2024 03:35:34 - INFO - __main__ - Time to first token = 11.617230004048906ms

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ("DeepSpeed is a machine learning framework that provides high-performance training and inference for deep learning models. It is built on top of PyTorch, a popular deep learning library, and leverages NVIDIA's GPU acceleration through the use of NVIDIA's Deep Learning SDK (cuDNN).\n\nDeepSpeed introduces several key features to improve performance:\n\n1. **Zero-Shot Optimization**: DeepSpeed supports dynamic mixed-precision training, allowing you to train with lower precision (e.g., FP16) while still achieving high accuracy. This reduces memory",)

input 2: ('He is working on',)
output 1: ('He is working on a new____(项目).\nproject\n\n\n\n\n getchar()函数的作用是( )\nA. 从键盘上读取一个字符\nB. 从键盘上读取一行数据\nC. 从文件中读取一个字符\nD. 从文件中读取一行数据\n\n答案是A,从键盘上读取一个字符。`getchar()`函数通常用于C语言等标准输入流中,每次',)

input 3: ('He has a',)
output 1: ('He has a lot of work to do, so he can’t go out with you tonight. A. mustn’t B. needn’t C. can’t D. shouldn’t\n\n句意:他有许多工作要做,所以他今晚不能和你出去。can’t意为“不能”,符合题意,故选C。\n\nC智慧职教: 在Word中,要将文档中的所有段落首行缩进2个字符,应使用( )命令。智慧职',)

input 4: ('He got all',)
output 1: ('He got all the facts right, ________ one small mistake.\nA. except for\nB. besides\nC. beside\nD. except\n答案:A\n解析:except for意为“除了……之外”,后接名词或代词,表示整体上肯定,局部有例外;besides意为“除了……之外还有……”;beside意为“在旁边”;except意为“除了……之外”,后面不接宾语。根据句意“他把所有的',)


Stats:
-------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 518.894516133132 tokens/second
Number of HPU graphs                = 17
Memory allocated                    = 15.1 GB
Max memory allocated                = 15.17 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 2.9683832399896346 seconds
-------------------------------------------------------------------------------------------------------------
  1. w/o this PR
06/19/2024 03:28:36 - INFO - __main__ - Time to first token = 376.61240801389795ms
Warming up
06/19/2024 03:28:37 - INFO - __main__ - Time to first token = 48.32350900687743ms
Warming up
06/19/2024 03:28:38 - INFO - __main__ - Time to first token = 11.489711978356354ms
06/19/2024 03:28:38 - INFO - __main__ - Running generate...
06/19/2024 03:28:38 - INFO - __main__ - Time to first token = 11.71710601192899ms
06/19/2024 03:28:39 - INFO - __main__ - Time to first token = 11.269912996795028ms
06/19/2024 03:28:40 - INFO - __main__ - Time to first token = 11.530675008543767ms
06/19/2024 03:28:41 - INFO - __main__ - Time to first token = 11.291447997791693ms
06/19/2024 03:28:41 - INFO - __main__ - Time to first token = 11.283341998932883ms

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ("DeepSpeed is a machine learning framework that provides high-performance training and inference for deep learning models. It is built on top of PyTorch, a popular deep learning library, and leverages NVIDIA's GPU acceleration through the use of NVIDIA's Deep Learning SDK (cuDNN).\n\nDeepSpeed introduces several key features to improve performance:\n\n1. **Zero-Shot Optimization**: DeepSpeed supports dynamic mixed-precision training, allowing you to train with lower precision (e.g., FP16) while still achieving high accuracy. This reduces memory",)

input 2: ('He is working on',)
output 1: ('He is working on a new____(项目).\nproject\n\n\n\n\n getchar()函数的作用是( )\nA. 从键盘上读取一个字符,存储到指定的字符变量中\nB. 从键盘上读取一行数据,存储到指定的字符串变量中\nC. 从屏幕输出一个字符\nD. 从文件中读取一个字符\n\n答案:A\n食管癌的典型症状是(',)

input 3: ('He has a',)
output 1: ('He has a lot of work to do, so he can’t go out with you tonight. A. must B. can C. need D. may\n\n句意:他有许多工作要做,所以他今晚不能和你出去。A. 必须;B. 能;C. 需要;D. 可能。根据句意可知,这里表示“不能”,故选B。\n\nB智慧职教: 2018年1月1日,甲',)

input 4: ('He got all',)
output 1: ('He got all the facts right, ______ some important details. A.except for B.except that C.but for D.but\n\n正确答案是A.except for后面接的词或短语通常表示整体中除去的部分,而except后面接的是从整体中被排除出去的细节,故选A. getchar()函数的作用是( )。\n\n从标准输入设备读取一个字符并返回智慧职教: 在Word2010中,要将文档',)


Stats:
--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 518.6809485476322 tokens/second
Number of HPU graphs                = 17
Memory allocated                    = 15.1 GB
Max memory allocated                = 15.17 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 2.979208919001394 seconds
--------------------------------------------------------------------------------------------------------------

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@Zhiwei35 Zhiwei35 requested a review from regisss as a code owner June 19, 2024 03:50
@libinta
Copy link
Collaborator

libinta commented Jun 25, 2024

@Zhiwei35 can you please add test case too?

@libinta libinta added the run-test Run CI for PRs from external contributors label Jun 26, 2024
@regisss
Copy link
Collaborator

regisss commented Jul 3, 2024

@Zhiwei35 Please run make style

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Zhiwei35
Copy link
Contributor Author

Zhiwei35 commented Jul 9, 2024

@libinta @regisss test case is added and style errors are fixed

@regisss regisss merged commit 5660db6 into huggingface:main Jul 10, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants