Fix deepseeed crash with Sentence Transformer Trainer #1328

nngokhale · 2024-09-12T02:02:48Z

Loss model override logic updated and model save overridden for gaudi to handle state_dict.

Update training_nli.py:

...
args = SentenceTransformerGaudiTrainingArguments(
...
        deepspeed="deepspeed_zero_2.json",
    )

command line:
python ../../gaudi_spawn.py --world_size 2 --use_deepspeed training_nli.py

Fixes the following crashes when using deepspeed zero2

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/optimum-habana/examples/sentence-transformers-training/nli/training_nli.py", line 125, in <module>
[rank0]:     main()
[rank0]:   File "/root/optimum-habana/examples/sentence-transformers-training/nli/training_nli.py", line 106, in main
[rank0]:     trainer.train()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 978, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1575, in training_step
[rank0]:     loss = self.compute_loss(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/sentence_transformers/st_gaudi_trainer.py", line 325, in compute_loss
[rank0]:     loss = loss_fn(features, labels)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/losses/SoftmaxLoss.py", line 107, in forward
[rank0]:     reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/losses/SoftmaxLoss.py", line 107, in <listcomp>
[rank0]:     reps = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1585, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1856, in forward
[rank0]:     if self.module.training:
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 484, in __getattr__
[rank0]:     return getattr(self, name)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 484, in __getattr__
[rank0]:     return getattr(self, name)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 484, in __getattr__
[rank0]:     return getattr(self, name)
[rank0]:   [Previous line repeated 486 more times]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 483, in __getattr__
[rank0]:     if name in dir(self):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2546, in __dir__
[rank0]:     module_attrs = dir(self.__class__)
[rank0]: RecursionError: maximum recursion depth exceeded while calling a Python object

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/optimum-habana/examples/sentence-transformers-training/nli/training_nli.py", line 125, in <module>
[rank0]:     main()
[rank0]:   File "/root/optimum-habana/examples/sentence-transformers-training/nli/training_nli.py", line 106, in main
[rank0]:     trainer.train()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1052, in _inner_training_loop
[rank0]:     self._maybe_log_save_evaluate(tr_loss, _grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1269, in _maybe_log_save_evaluate
[rank0]:     self._save_checkpoint(model, trial, metrics=metrics)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1327, in _save_checkpoint
[rank0]:     self.save_model(output_dir, _internal_call=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1630, in save_model
[rank0]:     self._save(output_dir, state_dict=state_dict)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/sentence_transformers/st_gaudi_trainer.py", line 724, in _save
[rank0]:     self.model.save_pretrained(output_dir, safe_serialization=self.args.save_safetensors)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py", line 1072, in save_pretrained
[rank0]:     self.save(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py", line 1037, in save
[rank0]:     module.save(model_path, safe_serialization=safe_serialization)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sentence_transformers/models/Transformer.py", line 180, in save
[rank0]:     self.auto_model.save_pretrained(output_path, safe_serialization=safe_serialization)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2691, in save_pretrained
[rank0]:     state_dict_split = split_torch_state_dict_into_shards(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/serialization/_torch.py", line 330, in split_torch_state_dict_into_shards
[rank0]:     return split_state_dict_into_shards_factory(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/serialization/_base.py", line 108, in split_state_dict_into_shards_factory
[rank0]:     storage_id = get_storage_id(tensor)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/serialization/_torch.py", line 359, in get_torch_storage_id
[rank0]:     unique_id = storage_ptr(tensor)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/serialization/_torch.py", line 410, in storage_ptr
[rank0]:     return tensor.storage().data_ptr()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 956, in data_ptr
[rank0]:     return self._data_ptr()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 960, in _data_ptr
[rank0]:     return self._untyped_storage.data_ptr()
[rank0]: RuntimeError: Attempted to access the data pointer on an invalid python storage.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

ZhengHongming888 · 2024-09-13T15:16:37Z

@nngokhale after checking the eval results you may need setup the learning rate like 1e-7 you can get the reasonable results. All others seem ok...

args = SentenceTransformerGaudiTrainingArguments(
# Required parameter:
learning_rate=1e-7,
output_dir=output_dir,
...

examples/sentence-transformers-training/nli/README.md

examples/sentence-transformers-training/sts/README.md

optimum/habana/sentence_transformers/__init__.py

optimum/habana/sentence_transformers/st_gaudi_trainer.py

optimum/habana/sentence_transformers/st_gaudi_transformer_tokenize.py

examples/sentence-transformers-training/nli/README.md

examples/sentence-transformers-training/nli/ds_config.json

examples/sentence-transformers-training/sts/ds_config.json

examples/sentence-transformers-training/nli/training_nli_lora.py

Co-authored-by: Yaser Afshar <[email protected]>

ZhengHongming888 · 2024-09-17T18:06:40Z

@nngokhale I tested with your newly updated the results seems all reasonable. Thanks for your update!

yafshar · 2024-09-18T16:00:03Z

@nngokhale, thanks for addressing the comments. It is a very nice contribution. I will finish the review in a bit

yafshar · 2024-09-18T16:37:46Z

@nngokhale with the new addition and using peft. It is missing from the default oh installation. Please add a requirements.txt file and add the correct version.

yafshar · 2024-09-18T17:02:40Z

@nngokhale

please run make style and fix the errors.
It will be great if you can add a test or update/complete the available ones and integrate them with test_sentence_transformers

Please also run

make test_installs
python -m pytest tests/sentence_transformers/test_training_nli.py 
python -m pytest tests/sentence_transformers/test_training_stsbenchmark.py

nngokhale · 2024-09-19T05:55:17Z

@nngokhale with the new addition and using peft. It is missing from the default oh installation. Please add a requirements.txt file and add the correct version.

Surprisingly PEFT is already installed by the 1.17 gaudi pytorch docker. I didn't need to install it. This may be due inclusion of neural compressor. (Required-by: neural_compressor_3x_pt) Should I still create a requirements.txt?

nngokhale · 2024-09-19T15:39:39Z

@nngokhale

please run make style and fix the errors.

It will be great if you can add a test or update/complete the available ones and integrate them with test_sentence_transformers

Please also run
make test_installs
python -m pytest tests/sentence_transformers/test_training_nli.py 
python -m pytest tests/sentence_transformers/test_training_stsbenchmark.py

Added peft test to each of the above tests.
Test Results:
=============================================================================================================================== test session starts ================================================================================================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /root/optimum-habana
configfile: setup.cfg
collected 2 items

tests/sentence_transformers/test_training_nli.py .. [100%]

================================================================================================================================= warnings summary =================================================================================================================================
tests/sentence_transformers/test_training_nli.py::test_training_nli[False]
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn(

tests/sentence_transformers/test_training_nli.py::test_training_nli[False]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(

tests/sentence_transformers/test_training_nli.py::test_training_nli[False]
tests/sentence_transformers/test_training_nli.py::test_training_nli[True]
/root/optimum-habana/optimum/habana/transformers/training_args.py:296: FutureWarning: --use_hpu_graphs is deprecated and will be removed in a future version of 🤗 Optimum Habana. Use --use_hpu_graphs_for_training or --use_hpu_graphs_for_inference instead.
warnings.warn(

tests/sentence_transformers/test_training_nli.py::test_training_nli[False]
tests/sentence_transformers/test_training_nli.py::test_training_nli[True]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================================== 2 passed, 6 warnings in 117.77s (0:01:57) =====================================================================================================================
=============================================================================================================================== test session starts ================================================================================================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0
rootdir: /root/optimum-habana
configfile: setup.cfg
collected 2 items

tests/sentence_transformers/test_training_stsbenchmark.py .. [100%]

================================================================================================================================= warnings summary =================================================================================================================================
tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn(

tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(

tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[True]
/root/optimum-habana/optimum/habana/transformers/training_args.py:296: FutureWarning: --use_hpu_graphs is deprecated and will be removed in a future version of 🤗 Optimum Habana. Use --use_hpu_graphs_for_training or --use_hpu_graphs_for_inference instead.
warnings.warn(

tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[True]
/root/optimum-habana/optimum/habana/transformers/training_args.py:366: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(

tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[False]
tests/sentence_transformers/test_training_stsbenchmark.py::test_training_stsbenchmark[True]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1150: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================================== 2 passed, 8 warnings in 127.43s (0:02:07) =====================================================================================================================

ZhengHongming888 · 2024-09-19T15:55:48Z

@nngokhale confirmed with your two testcases and all passed. Thanks for adding the test!

tests/sentence_transformers/test_training_nli.py ..
================================================================== 2 passed, 7 warnings in 141.40s (0:02:21)

tests/sentence_transformers/test_training_stsbenchmark.py ..
================================================================== 2 passed, 9 warnings in 174.33s (0:02:54) =======

yafshar · 2024-09-19T18:02:25Z

@nngokhale with the new addition and using peft. It is missing from the default oh installation. Please add a requirements.txt file and add the correct version.

Surprisingly PEFT is already installed by the 1.17 gaudi pytorch docker. I didn't need to install it. This may be due inclusion of neural compressor. (Required-by: neural_compressor_3x_pt) Should I still create a requirements.txt?

No need, thanks

yafshar

LGTM!

@regisss this PR is ready, would you please check this!

HuggingFaceDocBuilderDev · 2024-09-24T09:28:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

nngokhale requested a review from regisss as a code owner September 12, 2024 02:02

Fix deepseeed crash with Sentence Transformer Trainer

e9fd825

nngokhale force-pushed the SentenceTransformerDeepSpeedFix branch from 88f4100 to e9fd825 Compare September 12, 2024 02:06

ZhengHongming888 added 2 commits September 13, 2024 16:39

add examples for mistral 7b model with deepspeed zero2/3

b726f9a

add the LoRA + gradient_checkpointing example for single card

a3e9af9