Splitting large genomes up by chromosome #68

sheinasim · 2024-12-13T20:12:55Z

Hello! I'm running into an issue with one large genome (~9 gb) with 13 chromosomes and 1738 unplaced contigs that is taking ages to finish. Is splitting the fasta by chromosome and then running egapx on each chromosome separately a fair strategy to employ in this case?

Thanks!
Sheina

murphyte · 2024-12-16T16:26:50Z

Hi Sheina -- the software generally expects the genome to be complete, so splitting by chromosome likely would cause issues. Which task(s) are running slow?

sheinasim · 2024-12-16T20:19:38Z

Hello, thanks for your response!

Attached is my nextflow.log The process that has the longest run time seems to be "gnomon_plane:chainer:run_chainer." at 16 hours.

I'm also trying to annotate a much smaller genome (400 mb) and its "rnaseq_short_plane:star:run_star" took 16 hours.

The only genome I've run where egapx finished (500 mb), the same process took ~15-20 minutes for the gnomon_plane:chainer:run_chainer and ~5 hours for the "rnaseq_short_plane:star:run_star."

I am not able to upload the run.timeline.html to github, but I can send them by emal if that would be useful.

nextflow.log

Best wishes,
Sheina

murphyte · 2024-12-16T22:50:43Z

Are you running this on a single machine, or on a cluster? If a cluster, which type?

run_star is largely dependent on runs and reads. STAR will typically align at 10-30 million reads / hr, so if you have a single run with say 150M reads it might take 5-15 hrs (your mileage may vary). On a cluster it will align multiple runs in parallel. On a single machine, I think STAR is set to take 32 cores, so you might get 2-4 runs aligning in parallel. More runs will add up. I don't know how the performance scales with your 9 Gbp genome, and upping the max_intron size as we discussed. We also run STAR with altered parameters and apply some downstream filtering logic to remove certain types of artifacts, but that adds some overhead that might be affected by your large genome.

It's possible the large max_intron size is causing chainer to take longer, or its just the size of the genome. If you're running on a cluster we might be able to split chainer into more jobs. I'm checking if we've exposed a way for you to do that.

When we annotated Schistocerca gregaria, with a 600 kb max intron size, STAR took 10 hrs, likely aligning all 29 runs in parallel or nearly so. SRR15423967 is 186M spots so that was probably limiting. Chainer then took 2 hrs, but our RefSeq implementation will split up over more nodes than EGAPx so that might be the difference.

Please do e-mail me the run.timeline.html file. That may be informative.

sheinasim · 2024-12-16T23:34:14Z

I'm running on an HPC on a "large memory queue" (default memory per core 16000 MB), requesting 1 node and 96 ntasks per node. The job scheduler is slurm and the egapx.py engine is set to singularity.

I did set the max_intron size to 2500000.

Will send you that run.timeline.html file now.

Here's the run.timeline.html file as a box link. Thanks so much for looking into this!

victzh · 2024-12-17T01:13:44Z

Can you provide us with logs for at least one failed job? The logs are in the work directory of the job, it's called .command.log and there maybe one for stderr also, .command.err. To find the work directory for the job you need to look into the run.trace.txt file, find the failed job, it should have 'chainer' in its name. You need to take the corresponding hash (second column) for the job. This hash is the beginning of the work directory name. E.g. if your Nextflow work directory is 'work', and hash is '17/a4d8b3' there should be directory beginning with this hash, e.g. 'work/17/a4d8b3dd1a697cf9d4bdc56017c010'. In this directory there are files which are intermediate output, final output, staged input, logs etc. We're interested in files .command.out and .command.err.

sheinasim · 2024-12-17T02:05:10Z

Hi Victzh,

I just "resumed" the job and I don't have any jobs that have "failed" yet. They all say "cached" at the moment.

Looking at a different run (~400 mb genome, which had a run time of 16 hours for "rnaseq_short_plane:star:run_star", the .command.out and .command.err are attached.

command_err.txt
command_out.txt

Best wishes,
Sheina

boukn · 2024-12-18T16:44:34Z

nf/subworkflows/ncbi/gnomon/chainer_wnode/main.nf, line 8, split_count=16 try setting that to 3, or even 1. The job splitting wasnt tuned for the "one very big node" use case.

ui/assets/config/process_resources.config, line 22, time = 16.h is the time limit for known long jobs. You can bump that up to whatever runtime you can tolerate. On line 7 is time = 6.h which is the time limit for the 'not flagged as long jobs'. If your workload is this big, some of the other tasks might need the extra time as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting large genomes up by chromosome #68

Splitting large genomes up by chromosome #68

sheinasim commented Dec 13, 2024

murphyte commented Dec 16, 2024

sheinasim commented Dec 16, 2024

murphyte commented Dec 16, 2024

sheinasim commented Dec 16, 2024

victzh commented Dec 17, 2024

sheinasim commented Dec 17, 2024

boukn commented Dec 18, 2024

Splitting large genomes up by chromosome #68

Splitting large genomes up by chromosome #68

Comments

sheinasim commented Dec 13, 2024

murphyte commented Dec 16, 2024

sheinasim commented Dec 16, 2024

murphyte commented Dec 16, 2024

sheinasim commented Dec 16, 2024

victzh commented Dec 17, 2024

sheinasim commented Dec 17, 2024

boukn commented Dec 18, 2024