Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple training errors in the pre-training code #24

Open
HelloWorldLTY opened this issue Aug 28, 2024 · 3 comments
Open

Multiple training errors in the pre-training code #24

HelloWorldLTY opened this issue Aug 28, 2024 · 3 comments

Comments

@HelloWorldLTY
Copy link

Hi, I found that there exist several errors in the pre-training code (the file run.sh) and corresponding code. I have mentioned one in the pull request.Furthermore, it seems that we should use $PATH_TO_DATA_DICT to specific variable in the shell.

After correcting the path and file name, I found another error in the training stage:

=41667/41667=Iterations/Batches
Iteration:   0%|                                                                                 | 0/41667 [00:00<?, ?it/s]Finish Epoch:  0
Iteration:   0%|                                                                                 | 0/41667 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/main.py", line 85, in <module>
    run(args)
  File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/main.py", line 44, in run
    trainer.val()
  File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/training.py", line 189, in val
    self.model.module.dnabert2.load_state_dict(torch.load(load_dir+'/pytorch_model.bin'))
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
                     ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './results/epoch1.train_2w.csv.lr3e-06.lrscale100.bs48.maxlength2000.tmp0.05.seed1.con_methodsame_species.mixTrue.mix_layer_num-1.curriculumTrue/10000/pytorch_model.bin'

Would you please share your thoughts about how to address it? Thanks.

@github-staff github-staff deleted a comment Aug 28, 2024
@github-staff github-staff deleted a comment from ViniciusSCG Oct 1, 2024
@Andyargueasae
Copy link

Hi @HelloWorldLTY I also encountered the same problem when finishing the first epoch, and still waiting for an answer.

@Andyargueasae
Copy link

It looks like that the code did not have a recognizable step in saving the pytorch_model.bin, and loaded it directly.

@HelloWorldLTY
Copy link
Author

Hi, I finally drop dnabert-s and focus on dnabert2, which seems more feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@HelloWorldLTY @Andyargueasae and others