Test DIAMOND #27

LilyAnderssonLee · 2023-08-08T12:51:59Z

DIAMOND is a program for finding homologs of protein and DNA sequences in a reference database.

Run DIAMOND and compare with Kraken2 results.

TO DO:
1: build the protein database
2: Run diamond for clinical samples within the #196939

LilyAnderssonLee · 2023-08-08T13:20:34Z

Have built the DIAMOND DB based on refseq complete nonredundant protein sequences, ~87GB

LilyAnderssonLee · 2023-08-08T13:30:40Z

The Taxprofiler process was terminated when --run_diamond was turned on due to a lack of memory on the server.

I suspect this happened because of the usage of BLASTX under DIAMOND, and for some reason, we cannot use blastn/blastx on hasta when the reference is too large.

@sofstam We need to address this issue with scilifelab IT since Blast will be used in validating Taxprofiler results in the future.

LilyAnderssonLee · 2023-11-17T09:17:25Z

☝️ Memory issue has been resolved.

Some error messages from the tests:

Diamond UPPMAX database doesn't work.
- (Error: Options require taxonomy information included in the database. Please use the respective options to build this information into the database when running diamond makedb: taxonomy mapping information (--taxonmap option), taxonomy nodes information (--taxonnodes option))
Diamond version 2.1.8 has an error.
- Error: Loading query sequences... Error: Unequal number of sequences in paired read files.

Conclusions from standalone tests. Database: mentioned above (complete_nonredundant_protein_db). Diamond version 2.0.15 (the same version as the one in nf-core/taxprofiler v1.1.0) works fine.

Diamond takes significantly longer, approximately 10 times more than Kraken2, as stated in the paper Benchmarking Metagenomics Tools for Taxonomic Classification.

Conclusions from nextflow run nf-core/taxprofiler:

DIAMOND_BLASTX process was killed due to the max time limit. DIAMOND_BLASTX is labeled as process_medium, and we should increase the max CPU, memory, and time.
Clone taxprofiler repo and modify the base.config of DIAMOND_BLASTX process.

withName: 'DIAMOND_BLASTX' {
cpus = { check_max( 36 * task.attempt, 'cpus' ) }
memory = { check_max( 120.GB * task.attempt, 'memory' ) }
time = { check_max( 72.h * task.attempt, 'time' ) }
}

The time taken for this process is determined by the size of the input files. I most our routine cases, the unmapped reads from Bowtie2/align are smaller than 2.5GB.
Here is an reference of my test:

read1/read2 of one sample ~ 13 GB: 2 d 15 h; ~2.5 GB: 6 h; ~9 GB: 24 h

The running time increases with the growing size of the database. For instance, it takes about 28h for read1/read2 of 2.5 GB using refseq protein data.

sofstam · 2023-11-17T09:22:59Z

Shall we update this config and ask in taxprofiler to change the label of the process?

LilyAnderssonLee · 2023-11-17T09:33:09Z

I plan to discuss that in the Slack channel once I finish all tests.

Yes, for us, we need to update the above config.

sofstam · 2023-11-17T09:33:44Z

Sounds great!

LilyAnderssonLee · 2023-11-23T11:52:46Z

So from the practical point of view, we should use a complete non-redundant protein database.
Update the config by adding these lines.

process {
withName: 'DIAMOND_BLASTX' {
cpus = { check_max( 36 * task.attempt, 'cpus' ) }
memory = { check_max( 72.GB * task.attempt, 'memory' ) }
time = { check_max( 72.h * task.attempt, 'time' ) }
}
}

@sofstam What do you think?

LilyAnderssonLee self-assigned this Aug 8, 2023

LilyAnderssonLee added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Aug 8, 2023

sofstam added this to the Release 2 milestone Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test DIAMOND #27

Test DIAMOND #27

LilyAnderssonLee commented Aug 8, 2023 •

edited

Loading

LilyAnderssonLee commented Aug 8, 2023 •

edited

Loading

LilyAnderssonLee commented Aug 8, 2023 •

edited

Loading

LilyAnderssonLee commented Nov 17, 2023 •

edited

Loading

sofstam commented Nov 17, 2023

LilyAnderssonLee commented Nov 17, 2023 •

edited

Loading

sofstam commented Nov 17, 2023

LilyAnderssonLee commented Nov 23, 2023 •

edited

Loading

Test DIAMOND #27

Test DIAMOND #27

Comments

LilyAnderssonLee commented Aug 8, 2023 • edited Loading

LilyAnderssonLee commented Aug 8, 2023 • edited Loading

LilyAnderssonLee commented Aug 8, 2023 • edited Loading

LilyAnderssonLee commented Nov 17, 2023 • edited Loading

sofstam commented Nov 17, 2023

LilyAnderssonLee commented Nov 17, 2023 • edited Loading

sofstam commented Nov 17, 2023

LilyAnderssonLee commented Nov 23, 2023 • edited Loading

LilyAnderssonLee commented Aug 8, 2023 •

edited

Loading

LilyAnderssonLee commented Aug 8, 2023 •

edited

Loading

LilyAnderssonLee commented Aug 8, 2023 •

edited

Loading

LilyAnderssonLee commented Nov 17, 2023 •

edited

Loading

LilyAnderssonLee commented Nov 17, 2023 •

edited

Loading

LilyAnderssonLee commented Nov 23, 2023 •

edited

Loading