Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpt_big_code: make flash attention impl quantization friendly #1282

Merged

Conversation

mgonchar
Copy link
Contributor

  • introduce GaudiGPTBigCodeAttention class
  • wrapped FusedSDPA kernel to separate ModuleFusedSDPA class

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@mgonchar mgonchar force-pushed the main_gpt_bigcode_quant_friendly_fsdpa branch from bb9418e to 31b1591 Compare August 21, 2024 12:38
@imangohari1
Copy link
Contributor

imangohari1 commented Aug 21, 2024

@mgonchar
this PR needs testing results of both transformers and language modeling similar to one done in #1234
pls. run these and make sure both functionality and perf are intact.

@mgonchar
Copy link
Contributor Author

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?

@imangohari1
Copy link
Contributor

imangohari1 commented Aug 22, 2024

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?

  1. Please modify the code here for gpt_bigcode and/or find a case where we can test these changes.
  2. please run GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v before and after changes and make sure there is no new one is introduced.

@mgonchar
Copy link
Contributor Author

mgonchar commented Aug 22, 2024

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?

1. Please modify the code here for gpt_bigcode and/or find a case where we can test these changes.

2. please run `GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v` before and after changes and make sure there is no new one is introduced.

Sorry, I don't quite understand what do you mean in item (1), which code I should modify to test it?

I did manual launches and will provide logs for (2)

@imangohari1
Copy link
Contributor

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?

1. Please modify the code here for gpt_bigcode and/or find a case where we can test these changes.

2. please run `GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v` before and after changes and make sure there is no new one is introduced.

Sorry, I don't quite understand what do you mean in item (1), which code I should modify to test it?

I did manual launches and will provide logs for (2)

Thanks.
#1234 (comment)
I would recommend modifying this ^ or finding a similar case where we can test these changes.

@mgonchar
Copy link
Contributor Author

mgonchar commented Aug 23, 2024

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?

1. Please modify the code here for gpt_bigcode and/or find a case where we can test these changes.

2. please run `GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v` before and after changes and make sure there is no new one is introduced.

Sorry, I don't quite understand what do you mean in item (1), which code I should modify to test it?
I did manual launches and will provide logs for (2)

Thanks. #1234 (comment) I would recommend modifying this ^ or finding a similar case where we can test these changes.

Hi @imangohari1 here are results:

my git history during test

31b1591c (HEAD -> main) gpt_big_code: make flash attention impl quantization friendly
afa3d221 (origin/main, origin/HEAD) Fix cache position issue in mixtral (#1272)
  1. I used test_text_generation_example previously I updated it with starcoder - specific throughput data
    no changes in passrate with 1360e94b and with 485908be

  2. launched tests/transformers/tests/models/ as you suggested

31b1591c 25 failed, 874 passed, 326 skipped, 1 xpassed, 59 warnings
afa3d221 27 failed, 872 passed, 326 skipped, 1 xpassed, 58 warnings

So, it becomes better with my change :)

In fact I believe this is some noise in passrates, delta consists of these tests, which are not related at all to my change

FAILED tests/transformers/tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelTest::test_save_load
FAILED tests/transformers/tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2RobustModelTest::test_ctc_train

@mgonchar mgonchar force-pushed the main_gpt_bigcode_quant_friendly_fsdpa branch from 31b1591 to 745f74d Compare August 29, 2024 15:07
@mgonchar
Copy link
Contributor Author

rebased and retested, results are same

@imangohari1
Copy link
Contributor

@mgonchar
Thanks.
I still am trying to undersand how can we test these changes.
Can you point me to some tests you did from example folders with/without these change?

@mgonchar
Copy link
Contributor Author

mgonchar commented Sep 10, 2024

Hi @imangohari1 this change is not doing much of a thing, it basically just wraps fused kernel call to separate class with nn.Module inheritance, that's basically it.

To test it you may try to launch

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention

and

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16

In second case flash attention is not used and my code is not involved in calculation. To test for possible side effects of this change you may rollback and launch

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention

with and without my commit. Also there is a test, covering gpt_big_code based model, and it is also working.

I launched all of this during my manual tests and it was working.

@imangohari1
Copy link
Contributor

Hi @imangohari1 this change is not doing much of a thing, it basically just wraps fused kernel call to separate class with nn.Module inheritance, that's basically it.

To test it you may try to launch

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention

and

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16

In second case flash attention is not used and my code is not involved in calculation. To test for possible side effects of this change you may rollback and launch

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention

with and without my commit. Also there is a test, covering gpt_big_code based model, and it is also working.

I launched all of this during my manual tests and it was working.

I have tested this cmds before and after. I think they are fine.
please answer to @jiminha's comments.

p.s. this PR has been tested with an integration branch and the results looked fine. CI #255
https://github.com/huggingface/optimum-habana/compare/main...imangohari1:optimum-habana:ig/ci_20240911?expand=1

@mgonchar
Copy link
Contributor Author

hi @imangohari1 thanks for having a look. There is no comments from @jiminha in this PR, you've probably mixed it with another one here If you've tested it, I think this PR should be good to go to CI and merge? Can you please trigger it?

Copy link
Contributor

@imangohari1 imangohari1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
The PR is tested both locally and with Integrated CI 255.

@libinta libinta added run-test Run CI for PRs from external contributors and removed review wip labels Sep 24, 2024
@regisss
Copy link
Collaborator

regisss commented Sep 25, 2024

Please run make style

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- introduce GaudiGPTBigCodeAttention class
- wrapped FusedSDPA kernel to separate ModuleFusedSDPA class
@mgonchar mgonchar force-pushed the main_gpt_bigcode_quant_friendly_fsdpa branch from 745f74d to 88ad54e Compare September 25, 2024 11:40
@mgonchar
Copy link
Contributor Author

@regisss rebased, retested, style corrected

@regisss regisss merged commit c31dfab into huggingface:main Sep 25, 2024
3 of 4 checks passed
@mgonchar mgonchar deleted the main_gpt_bigcode_quant_friendly_fsdpa branch September 25, 2024 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants