-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT-J support reuse_cache #1094
Conversation
RoPE and FusedSDPA enabling will be committed by other PRs. |
@atakaha can you provide performance for below and adding test case? Model | Batch Size | TP (1) | Input Length | Output Length | |
@libinta, Here is performance table. All enabled reuse_cache
|
@atakaha how did you run for this performance table. I see some mismatch. For example, GPT-J 6B, BS=16 INN/OUT=2048 uses more that 80GB memory. Would you provide me with your command. |
@yafshar
|
Seems bf16 2048/2048 are not as good as A100 fp16 882 TensorRT-LLM/docs/source/performance/perf-overview.md at main * NVIDIA/TensorRT-LLM (github.com)<https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md>, fp8 is only half of H100.
There are more optimizations can be ported, please check other models such as llama, falcon, mistral, qwen2.
From: Akihiro Takahashi ***@***.***>
Sent: Tuesday, June 25, 2024 1:29 PM
To: huggingface/optimum-habana ***@***.***>
Cc: Libin Tang ***@***.***>; Mention ***@***.***>
Subject: Re: [huggingface/optimum-habana] GPT-J support reuse_cache (PR #1094)
@libinta<https://github.com/libinta>, Here is performance table. All enabled reuse_cache
Model
Batch Size
Input Length
Output Length
precision
Throughtput (tokens/sec)
Mem Alloc (GB)
Max Mem alloc (GB)
GPT-J 6B
512
128
128
bf16
8076.863
73.87
88.08
GPT-J 6B
512
128
128
fp8
14694.365
40.42
66.34
GPT-J 6B
32
128
2048
bf16
1561.541
41.34
42.71
GPT-J 6B
32
128
2048
fp8
2859.099
21.04
21.88
GPT-J 6B
32
2048
128
bf16
765.932
49.63
76.06
GPT-J 6B
32
2048
128
fp8
1257.655
26.9
52.82
GPT-J 6B
16
2048
2048
bf16
799.621
43.58
56.59
GPT-J 6B
16
2048
2048
fp8
1450.878
22.92
35.68
-
Reply to this email directly, view it on GitHub<#1094 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGLGHZI32WJQCSXVTFZLYDDZJHHIBAVCNFSM6AAAAABJ2G6XM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBZHEYTCMZQGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@libinta for more optimization, We are going to add FusedSDPA and others after this PR. Is this OK? |
a12a2ee
to
f7d6343
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Signed-off-by: Takahashi, Akihiro <[email protected]> Co-authored-by: Lau, Kiangpeng <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
Signed-off-by: Takahashi, Akihiro <[email protected]> Co-authored-by: Lau, Kiangpeng <[email protected]> Co-authored-by: Yaser Afshar <[email protected]>
GPT-J support reuse_cache huggingface#1094
GPT-J support reuse_cache huggingface#1094
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do?
Fixes # (issue)
Before submitting