You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I follow the same process as step 1. It's OK for me to set the nproc_per_node to 1 in base_training_args.sh (and export CUDA_VISIBLE_DEVICES to my custom device). However when I set it to a value larger than 1 (and set CUDA_VISIBLE_DEVICES at the same time), it always gets stuck when it comes to this place:
[train set] examples: 13533; # avg tokens: 370.9773254394531
[train set] examples: 13533; # avg completion tokens: 105.39820861816406
/mnt/workspace/anaconda3/envs/LESS/lib/python3.9/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
[INFO|trainer.py:568] 2024-06-28 22:31:18,153 >> Using auto half precision backend
Also, to avoid another issue, I add base_training_args="$base_training_args --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune" before setting the training_args.
The experiment was done on 4 H100. The python version is 3.9.0 and the whole pip list is below:
How did you solve this problem? I trained this step on two A6000 cards and it got stuck at the same position.
[INFO|trainer.py:568] 2024-11-08 10:51:53,438 >> Using auto half precision backend
When I follow the same process as step 1. It's OK for me to set the nproc_per_node to 1 in
base_training_args.sh
(and export CUDA_VISIBLE_DEVICES to my custom device). However when I set it to a value larger than 1 (and set CUDA_VISIBLE_DEVICES at the same time), it always gets stuck when it comes to this place:Also, to avoid another issue, I add
base_training_args="$base_training_args --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune"
before setting thetraining_args
.The experiment was done on 4 H100. The python version is 3.9.0 and the whole pip list is below:
What should I do to make it run on multi GPUs? By the way it works correctly on a 2 A100 sever though the environment may not be totally the same.
The text was updated successfully, but these errors were encountered: