gpt_big_code: make flash attention impl quantization friendly #1282

mgonchar · 2024-08-21T12:03:29Z

introduce GaudiGPTBigCodeAttention class
wrapped FusedSDPA kernel to separate ModuleFusedSDPA class

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

imangohari1 · 2024-08-21T15:33:05Z

@mgonchar
this PR needs testing results of both transformers and language modeling similar to one done in #1234
pls. run these and make sure both functionality and perf are intact.

mgonchar · 2024-08-22T08:36:15Z

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?

imangohari1 · 2024-08-22T15:51:17Z

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?

Please modify the code here for gpt_bigcode and/or find a case where we can test these changes.
please run GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v before and after changes and make sure there is no new one is introduced.

mgonchar · 2024-08-22T16:07:18Z

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?
1. Please modify the code here for gpt_bigcode and/or find a case where we can test these changes.

2. please run `GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v` before and after changes and make sure there is no new one is introduced.

Sorry, I don't quite understand what do you mean in item (1), which code I should modify to test it?

I did manual launches and will provide logs for (2)

imangohari1 · 2024-08-22T16:15:36Z

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?
1. Please modify the code here for gpt_bigcode and/or find a case where we can test these changes.

2. please run `GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v` before and after changes and make sure there is no new one is introduced.
Sorry, I don't quite understand what do you mean in item (1), which code I should modify to test it?

I did manual launches and will provide logs for (2)

Thanks.
#1234 (comment)
I would recommend modifying this ^ or finding a similar case where we can test these changes.

mgonchar · 2024-08-23T14:00:13Z

@mgonchar this PR needs testing results of both transformers and language modeling similar to one done in #1234 pls. run these and make sure both functionality and perf are intact.

Hi @imangohari1 the PR you are referring to is fixing 2 particular test cases for particular model. Which tests exactly do you want me to run?
1. Please modify the code here for gpt_bigcode and/or find a case where we can test these changes.

2. please run `GAUDI2_CI=1 RUN_SLOW=true python -m pytest tests/transformers/tests/models/ -s -v` before and after changes and make sure there is no new one is introduced.
Sorry, I don't quite understand what do you mean in item (1), which code I should modify to test it?
I did manual launches and will provide logs for (2)
Thanks. #1234 (comment) I would recommend modifying this ^ or finding a similar case where we can test these changes.

Hi @imangohari1 here are results:

my git history during test

31b1591c (HEAD -> main) gpt_big_code: make flash attention impl quantization friendly
afa3d221 (origin/main, origin/HEAD) Fix cache position issue in mixtral (#1272)

I used test_text_generation_example previously I updated it with starcoder - specific throughput data
no changes in passrate with 1360e94b and with 485908be
launched tests/transformers/tests/models/ as you suggested

31b1591c 25 failed, 874 passed, 326 skipped, 1 xpassed, 59 warnings
afa3d221 27 failed, 872 passed, 326 skipped, 1 xpassed, 58 warnings

So, it becomes better with my change :)

In fact I believe this is some noise in passrates, delta consists of these tests, which are not related at all to my change

FAILED tests/transformers/tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2ModelTest::test_save_load
FAILED tests/transformers/tests/models/wav2vec2/test_modeling_wav2vec2.py::Wav2Vec2RobustModelTest::test_ctc_train

mgonchar · 2024-08-29T15:45:42Z

rebased and retested, results are same

imangohari1 · 2024-09-10T19:19:51Z

@mgonchar
Thanks.
I still am trying to undersand how can we test these changes.
Can you point me to some tests you did from example folders with/without these change?

mgonchar · 2024-09-10T19:48:19Z

Hi @imangohari1 this change is not doing much of a thing, it basically just wraps fused kernel call to separate class with nn.Module inheritance, that's basically it.

To test it you may try to launch

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention

and

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16

In second case flash attention is not used and my code is not involved in calculation. To test for possible side effects of this change you may rollback and launch

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention

with and without my commit. Also there is a test, covering gpt_big_code based model, and it is also working.

I launched all of this during my manual tests and it was working.

imangohari1 · 2024-09-17T20:04:28Z

Hi @imangohari1 this change is not doing much of a thing, it basically just wraps fused kernel call to separate class with nn.Module inheritance, that's basically it.

To test it you may try to launch
python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention
and
python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16
In second case flash attention is not used and my code is not involved in calculation. To test for possible side effects of this change you may rollback and launch
python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention
with and without my commit. Also there is a test, covering gpt_big_code based model, and it is also working.

I launched all of this during my manual tests and it was working.

I have tested this cmds before and after. I think they are fine.
please answer to @jiminha's comments.

p.s. this PR has been tested with an integration branch and the results looked fine. CI #255
https://github.com/huggingface/optimum-habana/compare/main...imangohari1:optimum-habana:ig/ci_20240911?expand=1

mgonchar · 2024-09-18T10:19:58Z

hi @imangohari1 thanks for having a look. There is no comments from @jiminha in this PR, you've probably mixed it with another one here If you've tested it, I think this PR should be good to go to CI and merge? Can you please trigger it?

imangohari1

LGTM!
The PR is tested both locally and with Integrated CI 255.

regisss · 2024-09-25T09:49:35Z

Please run make style

HuggingFaceDocBuilderDev · 2024-09-25T09:52:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- introduce GaudiGPTBigCodeAttention class - wrapped FusedSDPA kernel to separate ModuleFusedSDPA class

mgonchar · 2024-09-25T11:40:45Z

@regisss rebased, retested, style corrected

mgonchar requested review from ZhaiFeiyue and regisss as code owners August 21, 2024 12:03

mgonchar force-pushed the main_gpt_bigcode_quant_friendly_fsdpa branch from bb9418e to 31b1591 Compare August 21, 2024 12:38

mgonchar force-pushed the main_gpt_bigcode_quant_friendly_fsdpa branch from 31b1591 to 745f74d Compare August 29, 2024 15:07

libinta added the review wip label Aug 31, 2024

imangohari1 approved these changes Sep 18, 2024

View reviewed changes

libinta added run-test Run CI for PRs from external contributors and removed review wip labels Sep 24, 2024

gpt_big_code: make flash attention impl quantization friendly

88ad54e

- introduce GaudiGPTBigCodeAttention class - wrapped FusedSDPA kernel to separate ModuleFusedSDPA class

mgonchar force-pushed the main_gpt_bigcode_quant_friendly_fsdpa branch from 745f74d to 88ad54e Compare September 25, 2024 11:40

regisss approved these changes Sep 25, 2024

View reviewed changes

regisss merged commit c31dfab into huggingface:main Sep 25, 2024
3 of 4 checks passed

mgonchar deleted the main_gpt_bigcode_quant_friendly_fsdpa branch September 25, 2024 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt_big_code: make flash attention impl quantization friendly #1282

gpt_big_code: make flash attention impl quantization friendly #1282

mgonchar commented Aug 21, 2024

imangohari1 commented Aug 21, 2024 •

edited

Loading

mgonchar commented Aug 22, 2024

imangohari1 commented Aug 22, 2024 •

edited

Loading

mgonchar commented Aug 22, 2024 •

edited

Loading

imangohari1 commented Aug 22, 2024

mgonchar commented Aug 23, 2024 •

edited

Loading

mgonchar commented Aug 29, 2024

imangohari1 commented Sep 10, 2024

mgonchar commented Sep 10, 2024 •

edited

Loading

imangohari1 commented Sep 17, 2024

mgonchar commented Sep 18, 2024

imangohari1 left a comment

regisss commented Sep 25, 2024

HuggingFaceDocBuilderDev commented Sep 25, 2024

mgonchar commented Sep 25, 2024

gpt_big_code: make flash attention impl quantization friendly #1282

gpt_big_code: make flash attention impl quantization friendly #1282

Conversation

mgonchar commented Aug 21, 2024

Before submitting

imangohari1 commented Aug 21, 2024 • edited Loading

mgonchar commented Aug 22, 2024

imangohari1 commented Aug 22, 2024 • edited Loading

mgonchar commented Aug 22, 2024 • edited Loading

imangohari1 commented Aug 22, 2024

mgonchar commented Aug 23, 2024 • edited Loading

mgonchar commented Aug 29, 2024

imangohari1 commented Sep 10, 2024

mgonchar commented Sep 10, 2024 • edited Loading

imangohari1 commented Sep 17, 2024

mgonchar commented Sep 18, 2024

imangohari1 left a comment

Choose a reason for hiding this comment

regisss commented Sep 25, 2024

HuggingFaceDocBuilderDev commented Sep 25, 2024

mgonchar commented Sep 25, 2024

imangohari1 commented Aug 21, 2024 •

edited

Loading

imangohari1 commented Aug 22, 2024 •

edited

Loading

mgonchar commented Aug 22, 2024 •

edited

Loading

mgonchar commented Aug 23, 2024 •

edited

Loading

mgonchar commented Sep 10, 2024 •

edited

Loading