Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test DIAMOND #27

Open
LilyAnderssonLee opened this issue Aug 8, 2023 · 7 comments
Open

Test DIAMOND #27

LilyAnderssonLee opened this issue Aug 8, 2023 · 7 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@LilyAnderssonLee
Copy link

LilyAnderssonLee commented Aug 8, 2023

DIAMOND is a program for finding homologs of protein and DNA sequences in a reference database.

Run DIAMOND and compare with Kraken2 results.

TO DO:
1: build the protein database
2: Run diamond for clinical samples within the #196939

@LilyAnderssonLee
Copy link
Author

LilyAnderssonLee commented Aug 8, 2023

Have built the DIAMOND DB based on refseq complete nonredundant protein sequences, ~87GB

@LilyAnderssonLee
Copy link
Author

LilyAnderssonLee commented Aug 8, 2023

The Taxprofiler process was terminated when --run_diamond was turned on due to a lack of memory on the server.

I suspect this happened because of the usage of BLASTX under DIAMOND, and for some reason, we cannot use blastn/blastx on hasta when the reference is too large.

@sofstam We need to address this issue with scilifelab IT since Blast will be used in validating Taxprofiler results in the future.

@LilyAnderssonLee LilyAnderssonLee self-assigned this Aug 8, 2023
@LilyAnderssonLee LilyAnderssonLee added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Aug 8, 2023
@LilyAnderssonLee
Copy link
Author

LilyAnderssonLee commented Nov 17, 2023

☝️ Memory issue has been resolved.

Some error messages from the tests:

  • Diamond UPPMAX database doesn't work.

    • (Error: Options require taxonomy information included in the database. Please use the respective options to build this information into the database when running diamond makedb: taxonomy mapping information (--taxonmap option), taxonomy nodes information (--taxonnodes option))
  • Diamond version 2.1.8 has an error.

    • Error: Loading query sequences... Error: Unequal number of sequences in paired read files.

Conclusions from standalone tests. Database: mentioned above (complete_nonredundant_protein_db). Diamond version 2.0.15 (the same version as the one in nf-core/taxprofiler v1.1.0) works fine.

Conclusions from nextflow run nf-core/taxprofiler:

  • DIAMOND_BLASTX process was killed due to the max time limit. DIAMOND_BLASTX is labeled as process_medium, and we should increase the max CPU, memory, and time.
  • Clone taxprofiler repo and modify the base.config of DIAMOND_BLASTX process.

withName: 'DIAMOND_BLASTX' {
cpus = { check_max( 36 * task.attempt, 'cpus' ) }
memory = { check_max( 120.GB * task.attempt, 'memory' ) }
time = { check_max( 72.h * task.attempt, 'time' ) }
}

The time taken for this process is determined by the size of the input files. I most our routine cases, the unmapped reads from Bowtie2/align are smaller than 2.5GB.
Here is an reference of my test:

read1/read2 of one sample ~ 13 GB: 2 d 15 h; ~2.5 GB: 6 h; ~9 GB: 24 h

The running time increases with the growing size of the database. For instance, it takes about 28h for read1/read2 of 2.5 GB using refseq protein data.

@sofstam
Copy link

sofstam commented Nov 17, 2023

Shall we update this config and ask in taxprofiler to change the label of the process?

@LilyAnderssonLee
Copy link
Author

LilyAnderssonLee commented Nov 17, 2023

I plan to discuss that in the Slack channel once I finish all tests.

Yes, for us, we need to update the above config.

@sofstam
Copy link

sofstam commented Nov 17, 2023

Sounds great!

@LilyAnderssonLee
Copy link
Author

LilyAnderssonLee commented Nov 23, 2023

So from the practical point of view, we should use a complete non-redundant protein database.
Update the config by adding these lines.

process {
withName: 'DIAMOND_BLASTX' {
cpus = { check_max( 36 * task.attempt, 'cpus' ) }
memory = { check_max( 72.GB * task.attempt, 'memory' ) }
time = { check_max( 72.h * task.attempt, 'time' ) }
}
}

@sofstam What do you think?

@sofstam sofstam added this to the Release 2 milestone Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants