Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deepseeed crash with Sentence Transformer Trainer #1328

Merged
merged 18 commits into from
Sep 24, 2024

Conversation

nngokhale
Copy link
Contributor

@nngokhale nngokhale commented Sep 12, 2024

Loss model override logic updated and model save overridden for gaudi to handle state_dict.

Update training_nli.py:

...
args = SentenceTransformerGaudiTrainingArguments(
...
        deepspeed="deepspeed_zero_2.json",
    )

command line:
python ../../gaudi_spawn.py --world_size 2 --use_deepspeed training_nli.py

Fixes the following crashes when using deepspeed zero2

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/optimum-habana/examples/sentence-transformers-training/nli/training_nli.py", line 125, in <module>
[rank0]:     main()
[rank0]:   File "/root/optimum-habana/examples/sentence-transformers-training/nli/training_nli.py", line 106, in main
[rank0]:     trainer.train()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank0]:     loss = self.compute_loss(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/sentence_transformers/st_gaudi_trainer.py", line 325, in compute_loss
[rank0]:     loss = loss_fn(features, labels)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/losses/SoftmaxLoss.py", line 107, in forward
[rank0]:     reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/losses/SoftmaxLoss.py", line 107, in <listcomp>
[rank0]:     reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1856, in forward
[rank0]:     if self.module.training:
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 484, in __getattr__
[rank0]:     return getattr(self, name)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 484, in __getattr__
[rank0]:     return getattr(self, name)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 484, in __getattr__
[rank0]:     return getattr(self, name)
[rank0]:   [Previous line repeated 486 more times]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 483, in __getattr__
[rank0]:     if name in dir(self):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2546, in __dir__
[rank0]:     module_attrs = dir(self.__class__)
[rank0]: RecursionError: maximum recursion depth exceeded while calling a Python object
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/optimum-habana/examples/sentence-transformers-training/nli/training_nli.py", line 125, in <module>
[rank0]:     main()
[rank0]:   File "/root/optimum-habana/examples/sentence-transformers-training/nli/training_nli.py", line 106, in main
[rank0]:     trainer.train()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1052, in _inner_training_loop
[rank0]:     self._maybe_log_save_evaluate(tr_loss, _grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1269, in _maybe_log_save_evaluate
[rank0]:     self._save_checkpoint(model, trial, metrics=metrics)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1327, in _save_checkpoint
[rank0]:     self.save_model(output_dir, _internal_call=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1630, in save_model
[rank0]:     self._save(output_dir, state_dict=state_dict)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/sentence_transformers/st_gaudi_trainer.py", line 724, in _save
[rank0]:     self.model.save_pretrained(output_dir, safe_serialization=self.args.save_safetensors)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py", line 1072, in save_pretrained
[rank0]:     self.save(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py", line 1037, in save
[rank0]:     module.save(model_path, safe_serialization=safe_serialization)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/models/Transformer.py", line 180, in save
[rank0]:     self.auto_model.save_pretrained(output_path, safe_serialization=safe_serialization)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2691, in save_pretrained
[rank0]:     state_dict_split = split_torch_state_dict_into_shards(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/serialization/_torch.py", line 330, in split_torch_state_dict_into_shards
[rank0]:     return split_state_dict_into_shards_factory(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/serialization/_base.py", line 108, in split_state_dict_into_shards_factory
[rank0]:     storage_id = get_storage_id(tensor)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/serialization/_torch.py", line 359, in get_torch_storage_id
[rank0]:     unique_id = storage_ptr(tensor)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/serialization/_torch.py", line 410, in storage_ptr
[rank0]:     return tensor.storage().data_ptr()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 956, in data_ptr
[rank0]:     return self._data_ptr()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 960, in _data_ptr
[rank0]:     return self._untyped_storage.data_ptr()
[rank0]: RuntimeError: Attempted to access the data pointer on an invalid python storage.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@nngokhale nngokhale force-pushed the SentenceTransformerDeepSpeedFix branch from 88f4100 to e9fd825 Compare September 12, 2024 02:06
@ZhengHongming888
Copy link
Contributor

@nngokhale after checking the eval results you may need setup the learning rate like 1e-7 you can get the reasonable results. All others seem ok...

args = SentenceTransformerGaudiTrainingArguments(
# Required parameter:
learning_rate=1e-7,
output_dir=output_dir,
...

@ZhengHongming888
Copy link
Contributor

@nngokhale I tested with your newly updated the results seems all reasonable. Thanks for your update!

@yafshar
Copy link
Contributor

yafshar commented Sep 18, 2024

@nngokhale, thanks for addressing the comments. It is a very nice contribution. I will finish the review in a bit

@yafshar
Copy link
Contributor

yafshar commented Sep 18, 2024

@nngokhale with the new addition and using peft. It is missing from the default oh installation. Please add a requirements.txt file and add the correct version.

@yafshar
Copy link
Contributor

yafshar commented Sep 18, 2024

@nngokhale

  • please run make style and fix the errors.
  • It will be great if you can add a test or update/complete the available ones and integrate them with test_sentence_transformers

Please also run

make test_installs
python -m pytest tests/sentence_transformers/test_training_nli.py 
python -m pytest tests/sentence_transformers/test_training_stsbenchmark.py

@nngokhale
Copy link
Contributor Author

nngokhale commented Sep 19, 2024

@nngokhale with the new addition and using peft. It is missing from the default oh installation. Please add a requirements.txt file and add the correct version.

Surprisingly PEFT is already installed by the 1.17 gaudi pytorch docker. I didn't need to install it. This may be due inclusion of neural compressor. (Required-by: neural_compressor_3x_pt) Should I still create a requirements.txt?

@nngokhale
Copy link
Contributor Author

@nngokhale

  • please run make style and fix the errors.
  • It will be great if you can add a test or update/complete the available ones and integrate them with test_sentence_transformers

Please also run

make test_installs
python -m pytest tests/sentence_transformers/test_training_nli.py 
python -m pytest tests/sentence_transformers/test_training_stsbenchmark.py

Added peft test to each of the above tests.
Test Results:
=============================================================================================================================== test session starts ================================================================================================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /root/optimum-habana
configfile: setup.cfg
collected 2 items

tests/sentence_transformers/test_training_nli.py .. [100%]

================================================================================================================================= warnings summary =================================================================================================================================
tests/sentence_transformers/test_training_nli.py::test_training_nli[False]
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn(

tests/sentence_transformers/test_training_nli.py::test_training_nli[False]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(

tests/sentence_transformers/test_training_nli.py::test_training_nli[False]
tests/sentence_transformers/test_training_nli.py::test_training_nli[True]
/root/optimum-habana/optimum/habana/transformers/training_args.py:296: FutureWarning: --use_hpu_graphs is deprecated and will be removed in a future version of 🤗 Optimum Habana. Use --use_hpu_graphs_for_training or --use_hpu_graphs_for_inference instead.
warnings.warn(

tests/sentence_transformers/test_training_nli.py::test_training_nli[False]
tests/sentence_transformers/test_training_nli.py::test_training_nli[True]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================================== 2 passed, 6 warnings in 117.77s (0:01:57) =====================================================================================================================
=============================================================================================================================== test session starts ================================================================================================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /root/optimum-habana
configfile: setup.cfg
collected 2 items

tests/sentence_transformers/test_training_stsbenchmark.py .. [100%]

================================================================================================================================= warnings summary =================================================================================================================================
tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn(

tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(

tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[True]
/root/optimum-habana/optimum/habana/transformers/training_args.py:296: FutureWarning: --use_hpu_graphs is deprecated and will be removed in a future version of 🤗 Optimum Habana. Use --use_hpu_graphs_for_training or --use_hpu_graphs_for_inference instead.
warnings.warn(

tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[True]
/root/optimum-habana/optimum/habana/transformers/training_args.py:366: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(

tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[True]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================================== 2 passed, 8 warnings in 127.43s (0:02:07) =====================================================================================================================

@ZhengHongming888
Copy link
Contributor

@nngokhale confirmed with your two testcases and all passed. Thanks for adding the test!

tests/sentence_transformers/test_training_nli.py ..
================================================================== 2 passed, 7 warnings in 141.40s (0:02:21)

tests/sentence_transformers/test_training_stsbenchmark.py ..
================================================================== 2 passed, 9 warnings in 174.33s (0:02:54) =======

@yafshar
Copy link
Contributor

yafshar commented Sep 19, 2024

@nngokhale with the new addition and using peft. It is missing from the default oh installation. Please add a requirements.txt file and add the correct version.

Surprisingly PEFT is already installed by the 1.17 gaudi pytorch docker. I didn't need to install it. This may be due inclusion of neural compressor. (Required-by: neural_compressor_3x_pt) Should I still create a requirements.txt?

No need, thanks

Copy link
Contributor

@yafshar yafshar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@regisss this PR is ready, would you please check this!

@libinta libinta added run-test Run CI for PRs from external contributors and removed review wip labels Sep 20, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@regisss regisss merged commit 091c8a5 into huggingface:main Sep 24, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants