-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deepseeed crash with Sentence Transformer Trainer #1328
Fix deepseeed crash with Sentence Transformer Trainer #1328
Conversation
88f4100
to
e9fd825
Compare
@nngokhale after checking the eval results you may need setup the learning rate like 1e-7 you can get the reasonable results. All others seem ok... args = SentenceTransformerGaudiTrainingArguments( |
examples/sentence-transformers-training/nli/training_nli_lora.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Yaser Afshar <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
Co-authored-by: Yaser Afshar <[email protected]>
@nngokhale I tested with your newly updated the results seems all reasonable. Thanks for your update! |
@nngokhale, thanks for addressing the comments. It is a very nice contribution. I will finish the review in a bit |
@nngokhale with the new addition and using |
Please also run make test_installs
python -m pytest tests/sentence_transformers/test_training_nli.py
python -m pytest tests/sentence_transformers/test_training_stsbenchmark.py |
Surprisingly PEFT is already installed by the 1.17 gaudi pytorch docker. I didn't need to install it. This may be due inclusion of neural compressor. (Required-by: neural_compressor_3x_pt) Should I still create a requirements.txt? |
Added peft test to each of the above tests. tests/sentence_transformers/test_training_nli.py .. [100%] ================================================================================================================================= warnings summary ================================================================================================================================= tests/sentence_transformers/test_training_nli.py::test_training_nli[False] tests/sentence_transformers/test_training_nli.py::test_training_nli[False] tests/sentence_transformers/test_training_nli.py::test_training_nli[False] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html tests/sentence_transformers/test_training_stsbenchmark.py .. [100%] ================================================================================================================================= warnings summary ================================================================================================================================= tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False] tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False] tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False] tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False] -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html |
@nngokhale confirmed with your two testcases and all passed. Thanks for adding the test! tests/sentence_transformers/test_training_nli.py .. tests/sentence_transformers/test_training_stsbenchmark.py .. |
No need, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@regisss this PR is ready, would you please check this!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Loss model override logic updated and model save overridden for gaudi to handle state_dict.
Update training_nli.py:
command line:
python ../../gaudi_spawn.py --world_size 2 --use_deepspeed training_nli.py
Fixes the following crashes when using deepspeed zero2
Fixes # (issue)
Before submitting