Skip to content

Commit

Permalink
Merge pull request #477 from nf-core/improve-database-handling
Browse files Browse the repository at this point in the history
Standardises GTDB execution and allow pre-uncompressed GTDB input
  • Loading branch information
jfy133 authored Aug 10, 2023
2 parents 2627a90 + e6a2c71 commit b004f03
Show file tree
Hide file tree
Showing 20 changed files with 73 additions and 46 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#422](https://github.com/nf-core/mag/pull/422) - Adds support for normalization of read depth with BBNorm (added by @erikrikarddaniel and @fabianegli)
- [#439](https://github.com/nf-core/mag/pull/439) - Adds ability to enter the pipeline at the binning stage by providing a CSV of pre-computed assemblies (by @prototaxites)
- [#459](https://github.com/nf-core/mag/pull/459) - Adds ability to skip damage correction step in the ancient DNA workflow and just run pyDamage (by @jfy133)
- [#364](https://github.com/nf-core/mag/pull/364) - Added geNomad nf-core modules for identifying viruses in assemblies (by @PhilPalmer and @CarsonJM)
- [#364](https://github.com/nf-core/mag/pull/364) - Adds geNomad nf-core modules for identifying viruses in assemblies (by @PhilPalmer and @CarsonJM)
- [#481](https://github.com/nf-core/mag/pull/481) - Adds MetaEuk for annotation of eukaryotic MAGs, and MMSeqs2 to enable downloading databases for MetaEuk (by @prototaxites)
- [#437](https://github.com/nf-core/mag/pull/429) - `--gtdb_db` also now supports directory input of an pre-uncompressed GTDB archive directory (reported by @alneberg, fix by @jfy133)

### `Changed`

Expand All @@ -22,6 +23,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#442](https://github.com/nf-core/mag/pull/442) - Remove warning when BUSCO finds no genes in bins, as this can be expected in some datasets (reported by @Lumimar, fix by @jfy133).
- [#444](https://github.com/nf-core/mag/pull/444) - Moved BUSCO bash code to script (by @jfy133)
- [#428](https://github.com/nf-core/mag/pull/429) - Update to nf-core 2.9 `TEMPLATE` (by @jfy133)
- [#437](https://github.com/nf-core/mag/pull/429) - `--gtdb` parameter is split into `--skip_gtdbtk` and `--gtdb_db` to allow finer control over GTDB database retrieval (fix by @jfy133)

### `Fixed`

Expand Down
2 changes: 1 addition & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,6 @@ params {
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_clean = true
gtdb = false
skip_gtdbtk = true
skip_concoct = true
}
2 changes: 1 addition & 1 deletion conf/test_adapterremoval.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
clip_tool = 'adapterremoval'
skip_concoct = true
bin_domain_classification = true
Expand Down
2 changes: 1 addition & 1 deletion conf/test_ancient_dna.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
ancient_dna = true
binning_map_mode = 'own'
skip_spades = false
Expand Down
2 changes: 1 addition & 1 deletion conf/test_bbnorm.config
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ params {
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
busco_clean = true
gtdb = false
skip_gtdbtk = true
bbnorm = true
coassemble_group = true
}
2 changes: 1 addition & 1 deletion conf/test_binrefinement.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
refine_bins_dastool = true
refine_bins_dastool_threshold = 0
postbinning_input = 'both'
Expand Down
2 changes: 1 addition & 1 deletion conf/test_busco_auto.config
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ params {
skip_spades = true
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
gtdb = false
skip_gtdbtk = true
skip_prokka = true
skip_prodigal = true
skip_quast = true
Expand Down
2 changes: 1 addition & 1 deletion conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ params {
centrifuge_db = "s3://ngi-igenomes/test-data/mag/p_compressed+h+v.tar.gz"
kraken2_db = "s3://ngi-igenomes/test-data/mag/minikraken_8GB_202003.tgz"
cat_db = "s3://ngi-igenomes/test-data/mag/CAT_prepare_20210107.tar.gz"
gtdb = "s3://ngi-igenomes/test-data/mag/gtdbtk_r202_data.tar.gz"
gtdb_db = "s3://ngi-igenomes/test-data/mag/gtdbtk_r202_data.tar.gz"

// reproducibility options for assembly
spades_fix_cpus = 10
Expand Down
2 changes: 1 addition & 1 deletion conf/test_host_rm.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,6 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
skip_concoct = true
}
2 changes: 1 addition & 1 deletion conf/test_hybrid.config
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,6 @@ params {
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_reference = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz"
gtdb = false
skip_gtdbtk = true
skip_concoct = true
}
1 change: 1 addition & 0 deletions conf/test_hybrid_host_rm.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,5 @@ params {
max_unbinned_contigs = 2
skip_binqc = true
skip_concoct = true
skip_gtdbtk = true
}
2 changes: 1 addition & 1 deletion conf/test_nothing.config
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,6 @@ params {
skip_concoct = true
skip_prokka = true
skip_binqc = true
gtdb = false
skip_gtdbtk = true
skip_concoct = true
}
2 changes: 1 addition & 1 deletion conf/test_virus_identification.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ params {
// For computational efficiency
reads_minlength = 150
coassemble_group = true
gtdb = false
skip_gtdbtk = true
skip_binning = true
skip_prokka = true
skip_spades = true
Expand Down
4 changes: 2 additions & 2 deletions lib/WorkflowMag.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -119,8 +119,8 @@ class WorkflowMag {
Nextflow.error('Both --busco_auto_lineage_prok and --busco_reference are specified! Invalid combination, please specify either --busco_auto_lineage_prok or --busco_reference.')
}

if (params.skip_binqc && params.gtdb) {
log.warn '--skip_binqc and --gtdb are specified! GTDB-tk will be omitted because GTDB-tk bin classification requires bin filtering based on BUSCO or CheckM QC results to avoid GTDB-tk errors.'
if (params.skip_binqc && !params.skip_gtdbtk) {
log.warn '--skip_binqc is specified, but --skip_gtdbtk is explictly set to run! GTDB-tk will be omitted because GTDB-tk bin classification requires bin filtering based on BUSCO or CheckM QC results to avoid GTDB-tk errors.'
}

// Check if CAT parameters are valid
Expand Down
2 changes: 1 addition & 1 deletion modules/local/gtdbtk_db_preparation.nf
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ process GTDBTK_DB_PREPARATION {
path(database)

output:
tuple val("${database.toString().replace(".tar.gz", "")}"), path("database/*")
tuple val("${database.toString().replace(".tar.gz", "")}"), path("database/*"), emit: db

script:
"""
Expand Down
3 changes: 2 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,8 @@ params {
cat_db_generate = false
cat_official_taxonomy = false
save_cat_db = false
gtdb = "https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/202.0/auxillary_files/gtdbtk_r202_data.tar.gz"
skip_gtdbtk = false
gtdb_db = "https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz"
gtdbtk_min_completeness = 50.0
gtdbtk_max_contamination = 10.0
gtdbtk_min_perc_aa = 10
Expand Down
12 changes: 8 additions & 4 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -511,11 +511,15 @@
"type": "boolean",
"description": "Only return official taxonomic ranks (Kingdom, Phylum, etc.) when running CAT."
},
"gtdb": {
"skip_gtdbtk": {
"type": "boolean",
"description": "Skip the running of GTDB, as well as the automatic download of the database",
"default": "false"
},
"gtdb_db": {
"type": "string",
"default": "https://data.gtdb.ecogenomic.org/releases/release202/202.0/auxillary_files/gtdbtk_r202_data.tar.gz",
"description": "GTDB database for taxonomic classification of bins with GTDB-tk.",
"help_text": "For information which GTDB reference databases are compatible with the used GTDB-tk version see https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data."
"description": "Specify the location of a GTDBTK database. Can be either an uncompressed directory or a `.tar.gz` archive. If not specified will be downloaded for you when GTDBTK or binning QC is not skipped.",
"default": "https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1/auxillary_files/gtdbtk_r214_data.tar.gz"
},
"gtdbtk_min_completeness": {
"type": "number",
Expand Down
4 changes: 2 additions & 2 deletions subworkflows/local/binning.nf
Original file line number Diff line number Diff line change
Expand Up @@ -130,9 +130,9 @@ workflow BINNING {
ch_versions = ch_versions.mix(GUNZIP_UNBINS.out.versions.first())

emit:
bins = ch_binning_results_gunzipped.dump(tag: "ch_binning_results_gunzipped")
bins = ch_binning_results_gunzipped
bins_gz = ch_binning_results_gzipped_final
unbinned = ch_splitfasta_results_gunzipped.dump(tag: "ch_splitfasta_results_gunzipped")
unbinned = ch_splitfasta_results_gunzipped
unbinned_gz = SPLIT_FASTA.out.unbinned
metabat2depths = METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS.out.depth
versions = ch_versions
Expand Down
18 changes: 16 additions & 2 deletions subworkflows/local/gtdbtk.nf
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,24 @@ workflow GTDBTK {
return [it[0], it[1]]
}

GTDBTK_DB_PREPARATION ( gtdb )
if ( gtdb.extension == 'gz' ) {
// Expects to be tar.gz!
ch_db_for_gtdbtk = GTDBTK_DB_PREPARATION ( gtdb ).db
} else if ( gtdb.isDirectory() ) {
// Make up meta id to match expected channel cardinality for GTDBTK
ch_db_for_gtdbtk = Channel
.of(gtdb)
.map{
[ it.toString().split('/').last(), it ]
}
.collect()
} else {
error("Unsupported object given to --gtdb, database must be supplied as either a directory or a .tar.gz file!")
}

GTDBTK_CLASSIFYWF (
ch_filtered_bins.passed.groupTuple(),
GTDBTK_DB_PREPARATION.out
ch_db_for_gtdbtk
)

GTDBTK_SUMMARY (
Expand Down
49 changes: 27 additions & 22 deletions workflows/mag.nf
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ log.info logo + paramsSummaryLog(workflow) + citation
WorkflowMag.initialise(params, log, hybrid)

// Check input path parameters to see if they exist
def checkPathParamList = [ params.input, params.multiqc_config, params.phix_reference, params.host_fasta, params.centrifuge_db, params.kraken2_db, params.cat_db, params.gtdb, params.lambda_reference, params.busco_reference ]
def checkPathParamList = [ params.input, params.multiqc_config, params.phix_reference, params.host_fasta, params.centrifuge_db, params.kraken2_db, params.cat_db, params.gtdb_db, params.lambda_reference, params.busco_reference ]
for (param in checkPathParamList) { if (param) { file(param, checkIfExists: true) } }

/*
Expand Down Expand Up @@ -205,13 +205,12 @@ if (params.genomad_db){
ch_genomad_db = Channel.empty()
}

gtdb = params.skip_binqc ? false : params.gtdb
gtdb = ( params.skip_binqc || params.skip_gtdbtk ) ? false : params.gtdb_db

if (gtdb) {
ch_gtdb = Channel
.value(file( "${gtdb}" ))
gtdb = file( "${gtdb}", checkIfExists: true)
} else {
ch_gtdb = Channel.empty()
gtdb = []
}

if(params.metaeuk_db && !params.skip_metaeuk) {
Expand Down Expand Up @@ -720,12 +719,12 @@ workflow MAG {


} else {
ch_binning_results_bins = BINNING.out.bins.dump(tag: 'BINNING.out.bins')
ch_binning_results_bins = BINNING.out.bins
.map { meta, bins ->
def meta_new = meta + [domain: 'unclassified']
[meta_new, bins]
}
ch_binning_results_unbins = BINNING.out.unbinned.dump(tag: 'BINNING.out.unbins')
ch_binning_results_unbins = BINNING.out.unbinned
.map { meta, bins ->
def meta_new = meta + [domain: 'unclassified']
[meta_new, bins]
Expand Down Expand Up @@ -877,25 +876,31 @@ workflow MAG {
/*
* GTDB-tk: taxonomic classifications using GTDB reference
*/
ch_gtdbtk_summary = Channel.empty()
if ( gtdb ){

ch_gtdb_bins = ch_input_for_postbinning_bins_unbins
.filter { meta, bins ->
meta.domain != "eukarya"
}
if ( !params.skip_gtdbtk ) {

GTDBTK (
ch_gtdb_bins,
ch_busco_summary,
ch_checkm_summary,
ch_gtdb
)
ch_versions = ch_versions.mix(GTDBTK.out.versions.first())
ch_gtdbtk_summary = GTDBTK.out.summary
ch_gtdbtk_summary = Channel.empty()
if ( gtdb ){

ch_gtdb_bins = ch_input_for_postbinning_bins_unbins
.filter { meta, bins ->
meta.domain != "eukarya"
}

GTDBTK (
ch_gtdb_bins,
ch_busco_summary,
ch_checkm_summary,
gtdb
)
ch_versions = ch_versions.mix(GTDBTK.out.versions.first())
ch_gtdbtk_summary = GTDBTK.out.summary
}
} else {
ch_gtdbtk_summary = Channel.empty()
}

if ( ( !params.skip_binqc ) || !params.skip_quast || gtdb){
if ( ( !params.skip_binqc ) || !params.skip_quast || !params.skip_gtdbtk){
BIN_SUMMARY (
ch_input_for_binsummary,
ch_busco_summary.ifEmpty([]),
Expand Down

0 comments on commit b004f03

Please sign in to comment.