Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble following documentation #226

Open
ccsiszer opened this issue Oct 3, 2022 · 13 comments
Open

Trouble following documentation #226

ccsiszer opened this issue Oct 3, 2022 · 13 comments

Comments

@ccsiszer
Copy link

ccsiszer commented Oct 3, 2022

Hi. I am trying to follow the documentation to install and train the model.
I have successfully installed everything and have run the following commands so far, also successfully:
bash scripts/download_alphafold_dbs.sh data/
bash scripts/download_mmseqs_dbs.sh data/
bash scripts/prep_mmseqs_dbs.sh data/

In my data directory, I have the following:
total 407176420
drwxrwxr-x 2 ubuntu ubuntu 6144 Oct 3 19:16 bfd
drwxrwxr-x 2 ubuntu ubuntu 6144 Oct 3 14:05 colabfold
-rw-rw-r-- 1 ubuntu ubuntu 117965643010 Sep 30 21:20 colabfold_envdb_202108.tar.gz
drwxrwxr-x 2 ubuntu ubuntu 38912 Oct 1 19:03 mmseqs_dbs
drwxrwxr-x 5 ubuntu ubuntu 6144 Oct 1 18:45 tmp
drwxrwxr-x 2 ubuntu ubuntu 6144 Oct 3 15:03 uniref30
-rw-rw-r-- 1 ubuntu ubuntu 149491476480 Sep 30 16:23 uniref30_2103.tar
-rw-rw-r-- 1 ubuntu ubuntu 149491476480 Oct 1 09:47 uniref30_2103.tar.gz

I am now trying to run the training part but I feel I am missing the data I need. For instance, I thought I was going to be able to do this:
python3 scripts/precompute_alignments_mmseqs.py input.fasta
data/mmseqs_dbs
uniref30_2103_db
alignment_dir
~/MMseqs2/build/bin/mmseqs
/usr/bin/hhsearch
--env_db colabfold_envdb_202108_db
--pdb70 data/pdb70/pdb70

But I don't seem to have what I need to create the input.fasta file and I also don't have colabfold_envdb_202108_db and data/pdb70/pdb70.

Can someone kindly point me in the right direction? I am no a data scientists, I am a data engineer/IT/wear many hats person so if I say something that doesn't make much sense in terms of models, etc. I apologize.

Thank you.

@gahdritz
Copy link
Collaborator

gahdritz commented Oct 6, 2022

Hm. PDB70 should be downloaded by download_alphafold_dbs.sh and the ColabFold database by download_mmseqs_dbs.sh. If those two didn't end up in your data/ directory, the downloads must have failed for some reason. Do you still have the console output from when you ran the download scripts? If so, post it here and I can try to determine what went wrong. If not, try re-running the commands in each of those scripts corresponding to the two databases (both of those scripts are just lists of calls to database-specific scripts).

The input FASTA file should contain the sequences for which you compute alignments, and so isn't included by default in the downloaded data. If you just want a large database of any alignments, I recommend checking out OpenProteinSet, our database of 4.5 million precomputed MSAs, of which 400k also come with template hits and AF structure predictions: https://registry.opendata.aws/openfold/

@ccsiszer
Copy link
Author

ccsiszer commented Oct 7, 2022

Hi Gahdritz,
Unfortunately I don't have the console output of the download scripts. To give you context, all I am trying to do is see if I can train the model on a small data set as a proof of concept project. Following your advice, I downloaded data from https://registry.opendata.aws/openfold/. Specifically, I created 3 directories in my "data" directory called pdb, uniclust30 and uniclust30_overflow. Inside each, I put 1000 entries from the respective locations in https://registry.opendata.aws/openfold/. However, I am still struggling to get this right. For instance, I am trying to run this code:
python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ output_dir/ \ 2021-10-10 \ --template_release_dates_cache_path mmcif_cache.json \ --precision bf16 \ --gpus 8 --replace_sampler_ddp=True \ --seed 4242022 \ # in multi-gpu settings, the seed must be specified --deepspeed_config_path deepspeed_config.json \ --checkpoint_every_epoch \ --resume_from_ckpt ckpt_dir/ \ --train_chain_data_cache_path chain_data_cache.json \ --obsolete_pdbs_file_path obsolete.dat
Before I can do that, I need the mmcif_cache.json file, as an example, but I can't generate it because I don't seem to have any .cif files. Can you guide me on how to do this with the data I have from https://registry.opendata.aws/openfold/?

@gahdritz
Copy link
Collaborator

gahdritz commented Oct 7, 2022

If you ran the download scripts, you probably already have the Protein Data Bank mmCIF files. If not, you can run scripts/download_pdb_mmcif.sh to fetch them.

The uniclust30 MSAs are bundled with .pdb files---you should use these in place of mmCIF files for those chains, which are from UniProt, not PDB. Just dump the mmCIF files and the .pdb files in to the so-called mmcif_dir/ when you run the training command, making sure that, for every subdirectory of the alignment_dir, each of which should correspond to a single chain, there exists a corresponding structural data file in mmcif_dir.

@ccsiszer
Copy link
Author

ccsiszer commented Oct 7, 2022

This is what I see now:
ubuntu@run-63400feaae186841a97025d6-k2xjc:/opt/openfold$ /opt/conda/bin/python3 train_openfold.py /domino/datasets/local/openfold/mmcif_dir/ /domino/datasets/local/openfold/alignment_dir/ /domino/datasets/local/openfold/mmcif_dir/ /domino/datasets/local/openfold/output_dir/ 2021-10-10 --template_release_dates_cache_path /domino/datasets/local/openfold/mmcif_cache.json --train_chain_data_cache_path /domino/datasets/local/openfold/chain_data_cache.json
WARNING:root:Removing 3 alignment entries (mgnify_hits.a3m, bfd_uniclust_hits.a3m, uniref90_hits.a3m) with no corresponding entries in chain_data_cache (/domino/datasets/local/openfold/chain_data_cache.json).
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py:470: LightningDeprecationWarning: DataModule.setup has already been called, so it will not be called again. In v1.6 this behavior will change to always call DataModule.setup.
f"DataModule.{name} has already been called, so it will not be called again. "

| Name | Type | Params

0 | model | AlphaFold | 93.2 M
1 | loss | AlphaFoldLoss | 0

93.2 M Trainable params
0 Non-trainable params
93.2 M Total params
372.916 Total estimated model params size (MB)
Traceback (most recent call last):
File "train_openfold.py", line 573, in
main(args)
File "train_openfold.py", line 364, in main
ckpt_path=ckpt_path,
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 140, in run
self.on_run_start(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 197, in on_run_start
self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 595, in reset_train_val_dataloaders
self.reset_train_dataloader(model=model)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 365, in reset_train_dataloader
self.train_dataloader = self.request_dataloader(RunningStage.TRAINING, model=model)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 611, in request_dataloader
dataloader = source.dataloader()
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 300, in dataloader
return method()
File "/opt/openfold/openfold/data/data_modules.py", line 726, in train_dataloader
return self._gen_dataloader("train")
File "/opt/openfold/openfold/data/data_modules.py", line 703, in _gen_dataloader
dataset.reroll()
File "/opt/openfold/openfold/data/data_modules.py", line 416, in reroll
datapoint_idx = next(samples)
File "/opt/openfold/openfold/data/data_modules.py", line 369, in looped_samples
candidate_idx = next(idx_iter)
File "/opt/openfold/openfold/data/data_modules.py", line 355, in looped_shuffled_dataset_idx
generator=self.generator,
RuntimeError: cannot sample n_sample <= 0 samples
ubuntu@run-63400feaae186841a97025d6-k2xjc:/opt/openfold$

Here's what my directories look like:
ubuntu@run-63400feaae186841a97025d6-k2xjc:/domino/datasets/local/openfold$ ls -l mmcif_dir/
total 372
-rw-rw-r-- 1 ubuntu ubuntu 377497 Jan 8 2022 11gs.cif
ubuntu@run-63400feaae186841a97025d6-k2xjc:/domino/datasets/local/openfold$

ubuntu@run-63400feaae186841a97025d6-k2xjc:/domino/datasets/local/openfold$ ls -l alignment_dir/
total 3192
-rw-rw-r-- 1 ubuntu ubuntu 425707 Oct 7 14:07 bfd_uniclust_hits.a3m
-rw-rw-r-- 1 ubuntu ubuntu 194812 Oct 7 14:07 mgnify_hits.a3m
-rw-rw-r-- 1 ubuntu ubuntu 2643379 Oct 7 14:07 uniref90_hits.a3m
ubuntu@run-63400feaae186841a97025d6-k2xjc:/domino/datasets/local/openfold$

What am I missing? Thank you so much for your help!

@ccsiszer
Copy link
Author

ccsiszer commented Oct 11, 2022

@gahdritz, can you please help?

Starting from the beginning, I downloaded all the data available here: https://registry.opendata.aws/openfold/

Specifically,
aws s3 ls --no-sign-request s3://openfold/
PRE openfold_params/
PRE pdb/
PRE uniclust30/
PRE uniclust30_overflow/
2022-06-17 03:35:44 18657 LICENSE
2022-08-28 21:57:09 4524064 duplicate_pdb_chains.txt

I downloaded the pdb, uniclust30 and uniclust30_overflow directories. I am just trying to test things out so instead of attempting to train the model on all the data, I moved 1000 directories from each of the directories above (pdb, uniclust30 and uniclust30_overflow) to another location so I have smaller versions of each.

Since then, I have been trying to run the training script.

Following the documentation, I was able to run this:
python3 scripts/generate_mmcif_cache.py
mmcif_dir/
mmcif_cache.json
--no_workers 16

But only after downloading the .cif files corresponding to some entries in the pdb directory. I downloaded those cif files from here: s3://pdbsnapshots/20220103/pub/pdb/data/structures/all/mmCIF/

After generating the mmcif_cache.json file, I was able to run this:
python3 scripts/generate_chain_data_cache.py
mmcif_dir/
chain_data_cache.json
--cluster_file clusters-by-entity-40.txt
--no_workers 16

Now I am trying to run this:
python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ output_dir/
2021-10-10 \
--template_release_dates_cache_path mmcif_cache.json \
--precision bf16
--gpus 8 --replace_sampler_ddp=True
--seed 4242022 \ # in multi-gpu settings, the seed must be specified
--deepspeed_config_path deepspeed_config.json
--checkpoint_every_epoch
--resume_from_ckpt ckpt_dir/
--train_chain_data_cache_path chain_data_cache.json
--obsolete_pdbs_file_path obsolete.dat

But I keep getting the sampling error (RuntimeError: cannot sample n_sample <= 0 samples)

Can you please point me in the right direction, keeping in mind that I am not familiar at all with openfold? Thank you so much.

@vetmax7
Copy link

vetmax7 commented Sep 21, 2023

Hello!

Could you find a solution?

@RJ3
Copy link

RJ3 commented Aug 23, 2024

@vetmax7 @gahdritz did you find the solution?

@vetmax7
Copy link

vetmax7 commented Aug 25, 2024

Hello @RJ3
Yes, and as I remember 2 possible reasons:

  1. Incorrect data structures for datasets.
  2. Check access rights to all used directories. I set: chmod 777 for all.

By the way, do you use the latest version from branch main?
Because I got error: OSError: /opt/conda/lib/python3.9/site-packages/torch/lib/../../../../libcublas.so.11: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

Did you install OpenFold with default environment.yml? What versions for numpy, pandas and pytorch-lightning do you have?

@RJ3
Copy link

RJ3 commented Aug 25, 2024

Thanks, RuntimeError: cannot sample n_sample <= 0 samples was caused by pathname typo.
On to the next error now...

@vetmax7
Copy link

vetmax7 commented Aug 26, 2024

@RJ3 Cuda 12 or Cuda 11 version are you trying to install?

@RJ3
Copy link

RJ3 commented Aug 26, 2024

I'm trying CUDA 12 and the pl_upgrades branch on H100.
I seem to be getting a similar error as #473

@vetmax7
Copy link

vetmax7 commented Aug 26, 2024

@RJ3 I have:

#468 (comment)

@abhinavb22
Copy link

@RJ3 were you able to figure out the out of mem error? I am trying to train on A100 (40GB) GPUs with a crop size of 384 and its crashing with out of mem error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants