Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

At step1, single GPU works while multiple GPUs get stuck. #22

Open
timturing opened this issue Jun 28, 2024 · 2 comments
Open

At step1, single GPU works while multiple GPUs get stuck. #22

timturing opened this issue Jun 28, 2024 · 2 comments

Comments

@timturing
Copy link

When I follow the same process as step 1. It's OK for me to set the nproc_per_node to 1 in base_training_args.sh (and export CUDA_VISIBLE_DEVICES to my custom device). However when I set it to a value larger than 1 (and set CUDA_VISIBLE_DEVICES at the same time), it always gets stuck when it comes to this place:

[train set] examples: 13533; # avg tokens: 370.9773254394531
[train set] examples: 13533; # avg completion tokens: 105.39820861816406
/mnt/workspace/anaconda3/envs/LESS/lib/python3.9/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
  warnings.warn(
[INFO|trainer.py:568] 2024-06-28 22:31:18,153 >> Using auto half precision backend

Also, to avoid another issue, I add base_training_args="$base_training_args --fsdp 'full_shard auto_wrap' --fsdp_config llama_finetune" before setting the training_args.
The experiment was done on 4 H100. The python version is 3.9.0 and the whole pip list is below:

accelerate               0.28.0
aiohttp                  3.9.5
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    23.2.0
bitsandbytes             0.40.0
certifi                  2024.6.2
charset-normalizer       3.3.2
click                    8.1.7
datasets                 2.20.0
dill                     0.3.8
docker-pycreds           0.4.0
fast_jl                  0.1.3
filelock                 3.15.4
frozenlist               1.4.1
fsspec                   2024.5.0
gitdb                    4.0.11
GitPython                3.1.43
huggingface-hub          0.23.4
idna                     3.7
Jinja2                   3.1.4
less                     0.1         /mnt/workspace/LESS
MarkupSafe               2.1.5
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.2.1
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.18.1
nvidia-nvjitlink-cu12    12.5.40
nvidia-nvtx-cu12         12.1.105
packaging                24.1
pandas                   2.2.2
peft                     0.7.1
pip                      24.0
platformdirs             4.2.2
protobuf                 5.27.2
psutil                   6.0.0
pyarrow                  16.1.0
pyarrow-hotfix           0.6
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
regex                    2024.5.15
requests                 2.32.3
safetensors              0.4.3
scipy                    1.13.1
sentry-sdk               2.7.1
setproctitle             1.3.3
setuptools               69.5.1
six                      1.16.0
smmap                    5.0.1
sympy                    1.12.1
tokenizers               0.15.2
torch                    2.1.2
tqdm                     4.66.4
traker                   0.1.3
transformers             4.36.2
triton                   2.1.0
typing_extensions        4.12.2
tzdata                   2024.1
urllib3                  2.2.2
wandb                    0.17.3
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4

What should I do to make it run on multi GPUs? By the way it works correctly on a 2 A100 sever though the environment may not be totally the same.

@Zrc007
Copy link

Zrc007 commented Sep 25, 2024

you could change nproc_per_node in less/scripts/train/base_training_args.sh

@QinWHang
Copy link

QinWHang commented Nov 8, 2024

How did you solve this problem? I trained this step on two A6000 cards and it got stuck at the same position.
[INFO|trainer.py:568] 2024-11-08 10:51:53,438 >> Using auto half precision backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants