Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when run sh run_qwen.sh #487

Open
CharlesJhonson opened this issue Dec 18, 2024 · 3 comments
Open

error when run sh run_qwen.sh #487

CharlesJhonson opened this issue Dec 18, 2024 · 3 comments
Labels
good first issue Good for newcomers

Comments

@CharlesJhonson
Copy link

I run sh run_qwen.sh locally on a GPU machine. Errors as follow, could someone help.

conda list |grep trl
trl                       0.13.0                   pypi_0    pypi
conda list |grep transformers
transformers              4.47.1                   pypi_0    pypi
sh run_qwen.sh
********************
It's effective
********************
Applied Liger kernels to Qwen2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.01it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 81, in <module>
[rank0]:     train()
[rank0]:   File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 67, in train
[rank0]:     trainer = SFTTrainer(
[rank0]:   File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]: TypeError: SFTTrainer.__init__() got an unexpected keyword argument 'max_seq_length'
E1218 16:54:26.878000 140467821201216 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 102105) of binary: /home/miniforge3/envs/ligerkernel/bin/python
Traceback (most recent call last):
  File "/home/miniforge3/envs/ligerkernel/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
training.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-18_16:54:26
  host      : 23
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 102105)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@bboyleonp666
Copy link
Contributor

@CharlesJhonson, I checked the documentation for trl. It seems that there's a change in trl.SFTTrainer in v0.13.0. I haven't dived into the details yet, but I have found that the max_seq_length is removed from the trl.SFTTrainer and can be found in trl.SFTConfig.

@Tcc0403
Copy link
Collaborator

Tcc0403 commented Dec 21, 2024

huggingface/trl#2306
huggingface/trl@5e90682#diff-67e157adfcd37d677fba66f610e3dfb56238cc550f221e8683fcfa0556e0f7caL150
It seems max_seq_length as a deprecated arg has been removed in this patch.

max_seq_length=custom_args.max_seq_length,

Deleting this line and checking what arguments should be added in the traing_args dict should fix the issue.

Some links that might be helpful: trl.SFTTrainer, trl.SFTConfig, transformers.HfArgumentParser, transformers.TrainingArguments

@Tcc0403 Tcc0403 added the good first issue Good for newcomers label Dec 21, 2024
@CharlesJhonson
Copy link
Author

ok thanks very much! @bboyleonp666 @Tcc0403
I will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants