-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting large genomes up by chromosome #68
Comments
Hi Sheina -- the software generally expects the genome to be complete, so splitting by chromosome likely would cause issues. Which task(s) are running slow? |
Hello, thanks for your response! Attached is my nextflow.log The process that has the longest run time seems to be "gnomon_plane:chainer:run_chainer." at 16 hours. I'm also trying to annotate a much smaller genome (400 mb) and its "rnaseq_short_plane:star:run_star" took 16 hours. The only genome I've run where egapx finished (500 mb), the same process took ~15-20 minutes for the gnomon_plane:chainer:run_chainer and ~5 hours for the "rnaseq_short_plane:star:run_star." I am not able to upload the run.timeline.html to github, but I can send them by emal if that would be useful. Best wishes, |
Are you running this on a single machine, or on a cluster? If a cluster, which type? run_star is largely dependent on runs and reads. STAR will typically align at 10-30 million reads / hr, so if you have a single run with say 150M reads it might take 5-15 hrs (your mileage may vary). On a cluster it will align multiple runs in parallel. On a single machine, I think STAR is set to take 32 cores, so you might get 2-4 runs aligning in parallel. More runs will add up. I don't know how the performance scales with your 9 Gbp genome, and upping the max_intron size as we discussed. We also run STAR with altered parameters and apply some downstream filtering logic to remove certain types of artifacts, but that adds some overhead that might be affected by your large genome. It's possible the large max_intron size is causing chainer to take longer, or its just the size of the genome. If you're running on a cluster we might be able to split chainer into more jobs. I'm checking if we've exposed a way for you to do that. When we annotated Schistocerca gregaria, with a 600 kb max intron size, STAR took 10 hrs, likely aligning all 29 runs in parallel or nearly so. SRR15423967 is 186M spots so that was probably limiting. Chainer then took 2 hrs, but our RefSeq implementation will split up over more nodes than EGAPx so that might be the difference. Please do e-mail me the run.timeline.html file. That may be informative. |
I'm running on an HPC on a "large memory queue" (default memory per core 16000 MB), requesting 1 node and 96 ntasks per node. The job scheduler is slurm and the egapx.py engine is set to singularity. I did set the max_intron size to 2500000. Will send you that run.timeline.html file now. Here's the run.timeline.html file as a box link. Thanks so much for looking into this! |
Can you provide us with logs for at least one failed job? The logs are in the work directory of the job, it's called .command.log and there maybe one for stderr also, .command.err. To find the work directory for the job you need to look into the run.trace.txt file, find the failed job, it should have 'chainer' in its name. You need to take the corresponding hash (second column) for the job. This hash is the beginning of the work directory name. E.g. if your Nextflow work directory is 'work', and hash is '17/a4d8b3' there should be directory beginning with this hash, e.g. 'work/17/a4d8b3dd1a697cf9d4bdc56017c010'. In this directory there are files which are intermediate output, final output, staged input, logs etc. We're interested in files .command.out and .command.err. |
Hi Victzh, I just "resumed" the job and I don't have any jobs that have "failed" yet. They all say "cached" at the moment. Looking at a different run (~400 mb genome, which had a run time of 16 hours for "rnaseq_short_plane:star:run_star", the .command.out and .command.err are attached. command_err.txt Best wishes, |
nf/subworkflows/ncbi/gnomon/chainer_wnode/main.nf, line 8, ui/assets/config/process_resources.config, line 22, |
Hello! I'm running into an issue with one large genome (~9 gb) with 13 chromosomes and 1738 unplaced contigs that is taking ages to finish. Is splitting the fasta by chromosome and then running egapx on each chromosome separately a fair strategy to employ in this case?
Thanks!
Sheina
The text was updated successfully, but these errors were encountered: