Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on low performance running of RepeatMasker on HPC cluster #276

Open
manighanipoor opened this issue Sep 10, 2024 · 7 comments
Open
Labels

Comments

@manighanipoor
Copy link

I am running Repeatmasker on a snake genome using a denovo TE library (created by RM2) on a HPC cluster using 20 CPUs. But the speed is very low and I encounter node failure. I contacted HPC support and they believe it happens because of system overhead due to high number of batches. I just realized RepeatMasker performs better if I increase the "-fra" option to 1000000 as it drops number of batches. Do you think it would affect TE identification sensitivity or accuracy?

Cheers,
Mani

@rmhubley
Copy link
Member

We use clusters at UCSC, Texas Tech and Univ of Arizona and I haven't seen an issue with batch overhead, but perhaps your cluster has some restrictive quotas that are interfering with the runs. With any cluster I would recommend making sure you are running on a local disk (local to the machine) for speed, and breaking up your sequence into batches of 50MB (or higher ) and run them independently through RepeatMasker on different nodes ( leaving the default -frag parameter ). We have a Nextflow script that does this for you on Slurm-based clusters ( https://github.com/Dfam-consortium/RepeatMasker_Nextflow ). If you change the -frag parameter, you increase the size in which the GC background value is determined. This is used to select the appropriate scoring matrix used during alignment of consensus sequences. If you increase this too much you will probably lose some lower-scoring annotations in your output.

@manighanipoor
Copy link
Author

Thanks,

How can we configure the RepeatMasker_Nextflow script to run batches on different nodes as it doesn't seem to be preconfigured for that? Shou I ask HPC support to do that?

Cheers,
Mani

@rmhubley
Copy link
Member

That's exactly what it's meant to do. We regularly run it on 100's of nodes. There is an option "--cluster" that currently accepts either "local" or one of several cluster names that we use. You will need to edit the RepeatMasker_Nextflow.nf file and configure it for your needs. For instance, look in the script for where quanah is defined:

///////                
/////// CUSTOMIZE CLUSTER ENVIRONMENT HERE BY ADDING YOUR OWN
/////// CLUSTER NAMES OR USE 'local' TO RUN ON THE CURRENT 
/////// MACHINE.
///////                                              
// No cluster...just local execution
if ( params.cluster == "local" ) {
...
}else if ( params.cluster == "quanah" || params.cluster == "nocona" ){
  thisExecutor = "slurm"
  thisQueue = params.cluster                                                                   
  thisOptions = "--tasks=1 -N 1 --cpus-per-task=${proc} --exclude=cpu-23-1"
  thisAdjOptions = "--tasks=1 -N 1 --cpus-per-task=2 --exclude=cpu-23-1"       
  ucscToolsDir="/lustre/work/daray/software/ucscTools"             
  repeatMaskerDir="/lustre/work/daray/software/RepeatMasker-4.1.2-p1"    
  thisScratch = false                                                
}

You would modify this block to accept the name of your cluster and set it's parameters here. Nextflow supports quite a few cluster job managers. The above example uses SLURM. Once you have made your changes, you simply use the "--cluster myclustername" option when you run.

@manighanipoor
Copy link
Author

Thnaks for your help.

@manighanipoor
Copy link
Author

You mentioned:
With any cluster I would recommend making sure you are running on a local disk (local to the machine) for speed

Just wondering how can we run it locally on the cluster?

@rmhubley
Copy link
Member

rmhubley commented Oct 3, 2024

This depends on your cluster's architecture. Most often the individual compute nodes have hard drive (or SSD) attached. The administrator may have decided not to make that accessible to jobs running on that node. On many clusters, the admins have setup a scratch area on those local drives, where files can be copied and processes create temporary files more efficiently than over NFS. If your cluster supports this, the Nextflow script can take advantage of it.

@manighanipoor
Copy link
Author

Thanks for the comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants