Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting large genomes up by chromosome #68

Open
sheinasim opened this issue Dec 13, 2024 · 7 comments
Open

Splitting large genomes up by chromosome #68

sheinasim opened this issue Dec 13, 2024 · 7 comments

Comments

@sheinasim
Copy link

Hello! I'm running into an issue with one large genome (~9 gb) with 13 chromosomes and 1738 unplaced contigs that is taking ages to finish. Is splitting the fasta by chromosome and then running egapx on each chromosome separately a fair strategy to employ in this case?

Thanks!
Sheina

@murphyte
Copy link

Hi Sheina -- the software generally expects the genome to be complete, so splitting by chromosome likely would cause issues. Which task(s) are running slow?

@sheinasim
Copy link
Author

Hello, thanks for your response!

Attached is my nextflow.log The process that has the longest run time seems to be "gnomon_plane:chainer:run_chainer." at 16 hours.

I'm also trying to annotate a much smaller genome (400 mb) and its "rnaseq_short_plane:star:run_star" took 16 hours.

The only genome I've run where egapx finished (500 mb), the same process took ~15-20 minutes for the gnomon_plane:chainer:run_chainer and ~5 hours for the "rnaseq_short_plane:star:run_star."

I am not able to upload the run.timeline.html to github, but I can send them by emal if that would be useful.

nextflow.log

Best wishes,
Sheina

@murphyte
Copy link

Are you running this on a single machine, or on a cluster? If a cluster, which type?

run_star is largely dependent on runs and reads. STAR will typically align at 10-30 million reads / hr, so if you have a single run with say 150M reads it might take 5-15 hrs (your mileage may vary). On a cluster it will align multiple runs in parallel. On a single machine, I think STAR is set to take 32 cores, so you might get 2-4 runs aligning in parallel. More runs will add up. I don't know how the performance scales with your 9 Gbp genome, and upping the max_intron size as we discussed. We also run STAR with altered parameters and apply some downstream filtering logic to remove certain types of artifacts, but that adds some overhead that might be affected by your large genome.

It's possible the large max_intron size is causing chainer to take longer, or its just the size of the genome. If you're running on a cluster we might be able to split chainer into more jobs. I'm checking if we've exposed a way for you to do that.

When we annotated Schistocerca gregaria, with a 600 kb max intron size, STAR took 10 hrs, likely aligning all 29 runs in parallel or nearly so. SRR15423967 is 186M spots so that was probably limiting. Chainer then took 2 hrs, but our RefSeq implementation will split up over more nodes than EGAPx so that might be the difference.

Please do e-mail me the run.timeline.html file. That may be informative.

@sheinasim
Copy link
Author

I'm running on an HPC on a "large memory queue" (default memory per core 16000 MB), requesting 1 node and 96 ntasks per node. The job scheduler is slurm and the egapx.py engine is set to singularity.

I did set the max_intron size to 2500000.

Will send you that run.timeline.html file now.

Here's the run.timeline.html file as a box link. Thanks so much for looking into this!

@victzh
Copy link

victzh commented Dec 17, 2024

Can you provide us with logs for at least one failed job? The logs are in the work directory of the job, it's called .command.log and there maybe one for stderr also, .command.err. To find the work directory for the job you need to look into the run.trace.txt file, find the failed job, it should have 'chainer' in its name. You need to take the corresponding hash (second column) for the job. This hash is the beginning of the work directory name. E.g. if your Nextflow work directory is 'work', and hash is '17/a4d8b3' there should be directory beginning with this hash, e.g. 'work/17/a4d8b3dd1a697cf9d4bdc56017c010'. In this directory there are files which are intermediate output, final output, staged input, logs etc. We're interested in files .command.out and .command.err.

@sheinasim
Copy link
Author

Hi Victzh,

I just "resumed" the job and I don't have any jobs that have "failed" yet. They all say "cached" at the moment.

Looking at a different run (~400 mb genome, which had a run time of 16 hours for "rnaseq_short_plane:star:run_star", the .command.out and .command.err are attached.

command_err.txt
command_out.txt

Best wishes,
Sheina

@boukn
Copy link

boukn commented Dec 18, 2024

nf/subworkflows/ncbi/gnomon/chainer_wnode/main.nf, line 8, split_count=16 try setting that to 3, or even 1. The job splitting wasnt tuned for the "one very big node" use case.

ui/assets/config/process_resources.config, line 22, time = 16.h is the time limit for known long jobs. You can bump that up to whatever runtime you can tolerate. On line 7 is time = 6.h which is the time limit for the 'not flagged as long jobs'. If your workload is this big, some of the other tasks might need the extra time as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants