GPT-J support reuse_cache #1094

atakaha · 2024-06-24T17:08:47Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

atakaha · 2024-06-24T17:23:32Z

RoPE and FusedSDPA enabling will be committed by other PRs.

libinta · 2024-06-25T01:31:06Z

@atakaha can you provide performance for below and adding test case?

Model | Batch Size | TP (1) | Input Length | Output Length |
GPT-J 6B | 512 | 1 | 128 | 128 |
GPT-J 6B | 32 | 1 | 128 | 2048 |
GPT-J 6B | 32 | 1 | 2048 | 128 |
GPT-J 6B | 16 | 1 | 2048 | 2048 |

atakaha · 2024-06-25T20:28:25Z

@libinta, Here is performance table. All enabled reuse_cache

Model	Batch Size	Input Length	Output Length	precision	Throughtput (tokens/sec)	Mem Alloc (GB)	Max Mem alloc (GB)
GPT-J 6B	512	128	128	bf16	8076.863	73.87	88.08
GPT-J 6B	512	128	128	fp8	14694.365	40.42	66.34
GPT-J 6B	32	128	2048	bf16	1561.541	41.34	42.71
GPT-J 6B	32	128	2048	fp8	2859.099	21.04	21.88
GPT-J 6B	32	2048	128	bf16	765.932	49.63	76.06
GPT-J 6B	32	2048	128	fp8	1257.655	26.9	52.82
GPT-J 6B	16	2048	2048	bf16	799.621	43.58	56.59
GPT-J 6B	16	2048	2048	fp8	1450.878	22.92	35.68

yafshar · 2024-07-11T19:58:56Z

@libinta, Here is performance table. All enabled reuse_cache
Model Batch Size Input Length Output Length precision Throughtput (tokens/sec) Mem Alloc (GB) Max Mem alloc (GB)
GPT-J 6B 512 128 128 bf16 8076.863 73.87 88.08
GPT-J 6B 512 128 128 fp8 14694.365 40.42 66.34
GPT-J 6B 32 128 2048 bf16 1561.541 41.34 42.71
GPT-J 6B 32 128 2048 fp8 2859.099 21.04 21.88
GPT-J 6B 32 2048 128 bf16 765.932 49.63 76.06
GPT-J 6B 32 2048 128 fp8 1257.655 26.9 52.82
GPT-J 6B 16 2048 2048 bf16 799.621 43.58 56.59
GPT-J 6B 16 2048 2048 fp8 1450.878 22.92 35.68

@atakaha how did you run for this performance table. I see some mismatch. For example, GPT-J 6B, BS=16 INN/OUT=2048 uses more that 80GB memory. Would you provide me with your command.

atakaha · 2024-07-11T20:44:21Z

@yafshar
The sample command lines I used are:

measurement
QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path EleutherAI/gpt-j-6b --use_hpu_graphs --use_kv_cache --reuse_cache --limit_hpu_graphs --max_input_tokens 128 --max_new_tokens 128 --batch_size 1 --bf16
bf16
python run_generation.py --model_name_or_path EleutherAI/gpt-j-6b --use_hpu_graphs --use_kv_cache --reuse_cache --limit_hpu_graphs --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 16 --bf16
fp8
QUANT_CONFIG=quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path EleutherAI/gpt-j-6b --use_hpu_graphs --use_kv_cache --reuse_cache --limit_hpu_graphs --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 16 --bf16

libinta · 2024-07-11T22:36:43Z

Seems bf16 2048/2048 are not as good as A100 fp16 882 TensorRT-LLM/docs/source/performance/perf-overview.md at main * NVIDIA/TensorRT-LLM (github.com)<https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md>, fp8 is only half of H100. There are more optimizations can be ported, please check other models such as llama, falcon, mistral, qwen2. From: Akihiro Takahashi ***@***.***> Sent: Tuesday, June 25, 2024 1:29 PM To: huggingface/optimum-habana ***@***.***> Cc: Libin Tang ***@***.***>; Mention ***@***.***> Subject: Re: [huggingface/optimum-habana] GPT-J support reuse_cache (PR #1094) @libinta<https://github.com/libinta>, Here is performance table. All enabled reuse_cache Model Batch Size Input Length Output Length precision Throughtput (tokens/sec) Mem Alloc (GB) Max Mem alloc (GB) GPT-J 6B 512 128 128 bf16 8076.863 73.87 88.08 GPT-J 6B 512 128 128 fp8 14694.365 40.42 66.34 GPT-J 6B 32 128 2048 bf16 1561.541 41.34 42.71 GPT-J 6B 32 128 2048 fp8 2859.099 21.04 21.88 GPT-J 6B 32 2048 128 bf16 765.932 49.63 76.06 GPT-J 6B 32 2048 128 fp8 1257.655 26.9 52.82 GPT-J 6B 16 2048 2048 bf16 799.621 43.58 56.59 GPT-J 6B 16 2048 2048 fp8 1450.878 22.92 35.68 - Reply to this email directly, view it on GitHub<#1094 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGLGHZI32WJQCSXVTFZLYDDZJHHIBAVCNFSM6AAAAABJ2G6XM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZHEYTCMZQGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

atakaha · 2024-07-12T15:29:50Z

@libinta for more optimization, We are going to add FusedSDPA and others after this PR. Is this OK?

optimum/habana/transformers/models/gptj/modeling_gptj.py

optimum/habana/transformers/models/gptj/__init__.py

yafshar

Look good for now!

@atakaha will follow up with more optimizations in the separate PRs to be comparative.
@regisss would you please check this! Since the performance is not comparative, please don't put into front page! Thanks

HuggingFaceDocBuilderDev · 2024-07-25T18:26:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Takahashi, Akihiro <[email protected]> Co-authored-by: Lau, Kiangpeng <[email protected]>

Co-authored-by: Yaser Afshar <[email protected]>

Signed-off-by: Takahashi, Akihiro <[email protected]> Co-authored-by: Lau, Kiangpeng <[email protected]> Co-authored-by: Yaser Afshar <[email protected]>

GPT-J support reuse_cache huggingface#1094

regisss

LGTM

atakaha requested review from ssarkar2, bhargaveede, vivekgoe, ZhaiFeiyue and regisss as code owners June 24, 2024 17:08

atakaha force-pushed the gptj branch from 764067f to e867032 Compare June 24, 2024 17:21

libinta added the review wip label Jul 9, 2024

yafshar reviewed Jul 12, 2024

View reviewed changes

optimum/habana/transformers/models/gptj/modeling_gptj.py Show resolved Hide resolved

yafshar reviewed Jul 12, 2024

View reviewed changes

optimum/habana/transformers/models/gptj/modeling_gptj.py Show resolved Hide resolved

yafshar reviewed Jul 12, 2024

View reviewed changes

optimum/habana/transformers/models/gptj/modeling_gptj.py Show resolved Hide resolved

yafshar reviewed Jul 12, 2024

View reviewed changes