Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpt_bigcode: added FusedSDPA kernel #1138

Merged
merged 1 commit into from
Jul 29, 2024

Conversation

mgonchar
Copy link
Contributor

@mgonchar mgonchar commented Jul 17, 2024

Added support of following options to gpt_bigcode (starcoderbase) model use_flash_attention,
flash_attention_recompute,
flash_attention_fast_softmax,
flash_attention_causal_mask

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@mgonchar mgonchar force-pushed the gpt_bigcode_fusedsdpa branch 2 times, most recently from 80da185 to 416ad8a Compare July 17, 2024 18:23
@mgonchar
Copy link
Contributor Author

Original implementation

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16

Stats

---------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 3970.2326582228557 tokens/second
Number of HPU graphs                = 15
Memory allocated                    = 84.78 GB
Max memory allocated                = 94.62 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 64.8335093039932 seconds
---------------------------------------------------------------------------------------------------------------

FusedSDPA

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention

Stats

--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 4175.830644970393 tokens/second
Number of HPU graphs                = 15
Memory allocated                    = 13.41 GB
Max memory allocated                = 94.45 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 75.58716204599841 seconds
--------------------------------------------------------------------------------------------------------------

Copy link
Contributor

@vidyasiv vidyasiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgonchar , thank you for your PR.

  • Please run make style prior to future submissions, it takes care of code formatting fixes
  • Please verify if bigcode/starcoder which also uses this model file runs alright for you. I am seeing an error RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. for python run_generation.py --model_name_or_path bigcode/starcoder --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --bf16 --use_flash_attention. I suspect this has to do with "starcoderbase" check and it works if substituted with "starcoder" but we need a better match because we want to avoid starcoder2 perhaps?
  • Testing: Please add a test for starcoder, starcoderbase with flash attention options in tests/test_text_generation_example.py

I will do a another pass after we resolve these

@mgonchar mgonchar force-pushed the gpt_bigcode_fusedsdpa branch 2 times, most recently from eb41c76 to 815c896 Compare July 18, 2024 23:46
@mgonchar
Copy link
Contributor Author

@vidyasiv I've updated this PR, based on your feedback. Please have a look

Copy link
Contributor

@vidyasiv vidyasiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo.I was able to run tests and so far LGTM

Copy link
Contributor

@vidyasiv vidyasiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-run make style

@mgonchar
Copy link
Contributor Author

@vidyasiv done

@vidyasiv
Copy link
Contributor

@regisss , please take a look

@libinta libinta added run-test Run CI for PRs from external contributors synapse1.17 PR that should be available along with Synapse 1.17 but have no dependency on Synapse 1.17 content. and removed review wip labels Jul 24, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- Added support of following options to gpt_bigcode (starcoder class of models)
  use_flash_attention,
  flash_attention_recompute,
  flash_attention_fast_softmax,
  flash_attention_causal_mask

- Updated test for starcoder model
@mgonchar
Copy link
Contributor Author

PR rebased. I've rechecked rebased code - no regressions found. @regisss please have a look

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@regisss regisss merged commit 59d182d into huggingface:main Jul 29, 2024
4 checks passed
@mgonchar mgonchar deleted the gpt_bigcode_fusedsdpa branch July 29, 2024 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors synapse1.17 PR that should be available along with Synapse 1.17 but have no dependency on Synapse 1.17 content.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants