Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT-J support reuse_cache #1094

Merged
merged 3 commits into from
Aug 2, 2024
Merged

GPT-J support reuse_cache #1094

merged 3 commits into from
Aug 2, 2024

Conversation

atakaha
Copy link
Contributor

@atakaha atakaha commented Jun 24, 2024

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@atakaha
Copy link
Contributor Author

atakaha commented Jun 24, 2024

RoPE and FusedSDPA enabling will be committed by other PRs.

@libinta
Copy link
Collaborator

libinta commented Jun 25, 2024

@atakaha can you provide performance for below and adding test case?

Model | Batch Size | TP (1) | Input Length | Output Length |
GPT-J 6B | 512 | 1 | 128 | 128 |
GPT-J 6B | 32 | 1 | 128 | 2048 |
GPT-J 6B | 32 | 1 | 2048 | 128 |
GPT-J 6B | 16 | 1 | 2048 | 2048 |

@atakaha
Copy link
Contributor Author

atakaha commented Jun 25, 2024

@libinta, Here is performance table. All enabled reuse_cache

Model Batch Size Input Length Output Length precision Throughtput (tokens/sec) Mem Alloc (GB) Max Mem alloc (GB)
GPT-J 6B 512 128 128 bf16 8076.863 73.87 88.08
GPT-J 6B 512 128 128 fp8 14694.365 40.42 66.34
GPT-J 6B 32 128 2048 bf16 1561.541 41.34 42.71
GPT-J 6B 32 128 2048 fp8 2859.099 21.04 21.88
GPT-J 6B 32 2048 128 bf16 765.932 49.63 76.06
GPT-J 6B 32 2048 128 fp8 1257.655 26.9 52.82
GPT-J 6B 16 2048 2048 bf16 799.621 43.58 56.59
GPT-J 6B 16 2048 2048 fp8 1450.878 22.92 35.68

@yafshar
Copy link
Contributor

yafshar commented Jul 11, 2024

@libinta, Here is performance table. All enabled reuse_cache
Model Batch Size Input Length Output Length precision Throughtput (tokens/sec) Mem Alloc (GB) Max Mem alloc (GB)
GPT-J 6B 512 128 128 bf16 8076.863 73.87 88.08
GPT-J 6B 512 128 128 fp8 14694.365 40.42 66.34
GPT-J 6B 32 128 2048 bf16 1561.541 41.34 42.71
GPT-J 6B 32 128 2048 fp8 2859.099 21.04 21.88
GPT-J 6B 32 2048 128 bf16 765.932 49.63 76.06
GPT-J 6B 32 2048 128 fp8 1257.655 26.9 52.82
GPT-J 6B 16 2048 2048 bf16 799.621 43.58 56.59
GPT-J 6B 16 2048 2048 fp8 1450.878 22.92 35.68

@atakaha how did you run for this performance table. I see some mismatch. For example, GPT-J 6B, BS=16 INN/OUT=2048 uses more that 80GB memory. Would you provide me with your command.

@atakaha
Copy link
Contributor Author

atakaha commented Jul 11, 2024

@yafshar
The sample command lines I used are:

  • measurement
    QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path EleutherAI/gpt-j-6b --use_hpu_graphs --use_kv_cache --reuse_cache --limit_hpu_graphs --max_input_tokens 128 --max_new_tokens 128 --batch_size 1 --bf16

  • bf16
    python run_generation.py --model_name_or_path EleutherAI/gpt-j-6b --use_hpu_graphs --use_kv_cache --reuse_cache --limit_hpu_graphs --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 16 --bf16

  • fp8
    QUANT_CONFIG=quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path EleutherAI/gpt-j-6b --use_hpu_graphs --use_kv_cache --reuse_cache --limit_hpu_graphs --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 16 --bf16

@libinta
Copy link
Collaborator

libinta commented Jul 11, 2024 via email

@atakaha
Copy link
Contributor Author

atakaha commented Jul 12, 2024

@libinta for more optimization, We are going to add FusedSDPA and others after this PR. Is this OK?

@atakaha atakaha force-pushed the gptj branch 2 times, most recently from a12a2ee to f7d6343 Compare July 15, 2024 17:37
Copy link
Contributor

@yafshar yafshar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good for now!

@atakaha will follow up with more optimizations in the separate PRs to be comparative.
@regisss would you please check this! Since the performance is not comparative, please don't put into front page! Thanks

@libinta libinta added run-test Run CI for PRs from external contributors synapse1.17 PR that should be available along with Synapse 1.17 but have no dependency on Synapse 1.17 content. and removed review wip labels Jul 24, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

atakaha and others added 3 commits July 30, 2024 11:14
Signed-off-by: Takahashi, Akihiro <[email protected]>
Co-authored-by: Lau, Kiangpeng <[email protected]>
Signed-off-by: Takahashi, Akihiro <[email protected]>
Co-authored-by: Lau, Kiangpeng <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
emascarenhas added a commit to emascarenhas/optimum-habana that referenced this pull request Aug 1, 2024
vidyasiv added a commit to emascarenhas/optimum-habana that referenced this pull request Aug 2, 2024
Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@regisss regisss merged commit abc4f0e into huggingface:main Aug 2, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors synapse1.17 PR that should be available along with Synapse 1.17 but have no dependency on Synapse 1.17 content.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants