Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longread only functionality #718

Open
wants to merge 35 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5ffb327
Add longread only test config
muabnezor Nov 25, 2024
695da2b
Change samplesheet validation schema to allow for longread-only sampl…
muabnezor Nov 25, 2024
f67b6b3
Check if both short, and long reads are given
muabnezor Nov 25, 2024
bb78aab
fix validation schema for samplesheet
muabnezor Nov 25, 2024
992b01e
Merge remote-tracking branch 'upstream/dev' into longread_only
muabnezor Nov 26, 2024
bfd404c
Add separate wubworkflow for longread host removal
muabnezor Nov 26, 2024
387284d
Add longread meta-assemblers metaflye and metamdbg
muabnezor Nov 26, 2024
6948316
Add longread assemble config
muabnezor Nov 26, 2024
762f12b
Prepare binning for longread assemblies
muabnezor Nov 27, 2024
c7a5cdf
Fix config and how long reads are prepared for binning
muabnezor Nov 28, 2024
18ae00d
change jgi_summarize_bam_contig_depths --percentidentity default from…
muabnezor Nov 28, 2024
8a22939
Fix longread binning preparation from all combination of assemblies, …
muabnezor Nov 28, 2024
038a87b
format
muabnezor Nov 29, 2024
3ff8de2
Fix longread hostremoval, and fix --longread_percentidentiy parameter…
muabnezor Nov 29, 2024
e277290
Add test_longread to profiles, and ci testing
muabnezor Nov 29, 2024
75a36f2
Fix validation schema, and fix assembly channels for longread and sho…
muabnezor Nov 29, 2024
67d44d0
Change logic in samplesheet validation, ch_raw_short_reads should be …
muabnezor Nov 29, 2024
ca4d101
fix custom samtools view module
muabnezor Nov 29, 2024
cfd43e7
Merge branch 'dev' into longread_only
muabnezor Dec 5, 2024
6765ab1
Make sure filtlong works without short reads. the join operator shoul…
muabnezor Dec 6, 2024
e73975c
Make sure FILTLONG is not run when there are no long reads
muabnezor Dec 6, 2024
686fafe
Fix grouping logic for channels in longreads_binning_preparation subw…
muabnezor Dec 11, 2024
a0279fe
make assembly into subworkflow
muabnezor Dec 12, 2024
28f72af
Fix long read assembly input
muabnezor Dec 12, 2024
1c4b364
Update docs
muabnezor Dec 12, 2024
34067db
Update citations for tools
muabnezor Dec 12, 2024
26a0fe7
Fix versions channel in subworkflows
muabnezor Dec 13, 2024
d9564da
Fix bug when running with --keep_phix, make ch_phix_db_file empty Cha…
muabnezor Jan 7, 2025
9087f2f
Merge branch 'dev' into longread_only
muabnezor Jan 7, 2025
ba0f831
fix modules.config
muabnezor Jan 7, 2025
d46aa6d
Fix linting
muabnezor Jan 7, 2025
f90e5f5
Use nf-core official module for samtools fastq
muabnezor Jan 9, 2025
85c2e82
Change modules config
muabnezor Jan 9, 2025
7398ba3
change hybrid logic
muabnezor Jan 10, 2025
ae1953d
fix samplesheet validation
muabnezor Jan 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ jobs:
test_virus_identification,
test_single_end,
test_concoct,
test_longread,
]
steps:
- name: Free some space
Expand Down
23 changes: 16 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#707](https://github.com/nf-core/mag/pull/707) - Make Bin QC a subworkflow (added by @dialvarezs)
- [#707](https://github.com/nf-core/mag/pull/707) - Added CheckM2 as an alternative bin completeness and QC tool (added by @dialvarezs)
- [#708](https://github.com/nf-core/mag/pull/708) - Added `--exclude_unbins_from_postbinning` parameter to exclude unbinned contigs from post-binning processes, speeding up Prokka in some cases (added by @dialvarezs)
- [#718](https://github.com/nf-core/mag/pull/718) - Added metaMDBG and Flye as longread assemblers (added by @muabnezor)
- [#718](https://github.com/nf-core/mag/pull/718) - Added host removal for long reads using minimap2 as aligner (added by @muabnezor)
- [#732](https://github.com/nf-core/mag/pull/732) - Added support for Prokka's compliance mode with `--prokka_with_compliance --prokka_compliance_centre <xyz>` (reported by @audy and @Thomieh73, added by @jfy133)

### `Changed`

- [#718](https://github.com/nf-core/mag/pull/718) - Longread only input (added by @muabnezor)
- [#731](https://github.com/nf-core/mag/pull/731) - Updated to nf-core 3.1.0 `TEMPLATE` (by @jfy133)

### `Fixed`
Expand All @@ -37,16 +40,22 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#716](https://github.com/nf-core/mag/pull/692) - Make short read processing a subworkflow (added by @muabnezor)
- [#708](https://github.com/nf-core/mag/pull/708) - Fixed channel passed as GUNC input (added by @dialvarezs)
- [#729](https://github.com/nf-core/mag/pull/729) - Fixed misspecified multi-FASTQ input for single-end data in MEGAHIT (reported by John Richards, fix by @jfy133)
- [#718](https://github.com/nf-core/mag/pull/718) - refactoring assembly into subworkflow (added by @muabnezor)

### `Dependencies`

| Tool | Previous version | New version |
| ------- | ---------------- | ----------- |
| CheckM | 1.2.1 | 1.2.3 |
| CheckM2 | | 1.0.2 |
| chopper | | 0.9.0 |
| GUNC | 1.0.5 | 1.0.6 |
| nanoq | | 0.10.0 |
| Tool | Previous version | New version |
| -------- | ---------------- | ----------- |
| chopper | | 0.9.0 |
| nanoq | | 0.10.0 |
| flye | | 2.9.5 |
| metamdbg | | 1.0 |
| minimap2 | | 2.28 |
| CheckM | 1.2.1 | 1.2.3 |
| CheckM2 | | 1.0.2 |
| chopper | | 0.9.0 |
| GUNC | 1.0.5 | 1.0.6 |
| nanoq | | 0.10.0 |

### `Deprecated`

Expand Down
12 changes: 12 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,10 @@

- [Filtlong](https://github.com/rrwick/Filtlong)

- [Flye](https://www.nature.com/articles/s41592-020-00971-x)

> Kolmogorov, M., Bickhart, D.M., Behsaz, B. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods 17, 1103–1110 (2020). https://doi.org/10.1038/s41592-020-00971-x

- [Freebayes](https://arxiv.org/abs/1207.3907)

> Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012
Expand Down Expand Up @@ -106,6 +110,14 @@

> Levy Karin, E., Mirdita, M. & Söding, J. MetaEuk—sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. Microbiome 8, 48 (2020). 10.1186/s40168-020-00808-x

- [metaMDBG](https://www.nature.com/articles/s41587-023-01983-6)

> Benoit, G., Raguideau, S., James, R. et al. High-quality metagenome assembly from long accurate reads with metaMDBG. Nat Biotechnol 42, 1378–1383 (2024). https://doi.org/10.1038/s41587-023-01983-6

- [minimap2](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778?login=true)

> Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics , 34(18), 3094–3100. https://doi.org/10.1093/bioinformatics/bty191

- [MMseqs2](https://www.nature.com/articles/nbt.3988)

> Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017).10.1038/nbt.3988
Expand Down
6 changes: 3 additions & 3 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,11 @@
"pattern": "^\\S+\\.f(ast)?q\\.gz$"
}
},
"required": ["sample", "group", "short_reads_1"]
"required": ["sample", "group"],
"anyOf": [{ "required": ["short_reads_1"] }, { "required": ["long_reads"] }]
},
"uniqueEntries": ["sample", "run"],
"dependentRequired": {
"short_reads_2": ["short_reads_1"],
"long_reads": ["short_reads_1", "short_reads_2"]
"short_reads_2": ["short_reads_1"]
}
}
72 changes: 69 additions & 3 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -179,13 +179,13 @@ process {
}

withName: FILTLONG {
ext.args = [
ext.args = { [
"--min_length ${params.longreads_min_length}",
"--keep_percent ${params.longreads_keep_percent}",
"--trim",
shortreads ? "--trim" : "",
"--length_weight ${params.longreads_length_weight}",
params.longreads_min_quality ? "--min_mean_q ${params.longreads_min_quality}" : ''
].join(' ').trim()
].join(' ').trim() }
publishDir = [
path: { "${params.outdir}/QC_longreads/Filtlong" },
mode: params.publish_dir_mode,
Expand Down Expand Up @@ -286,6 +286,60 @@ process {
]
}

withName: MINIMAP2_HOST_INDEX {
ext.args = '-x map-ont'
// publishDir = [
// path: { "${params.outdir}/QC_longreads/minimap2/index" },
// mode: params.publish_dir_mode,
// pattern: '*.mmi',
// enabled: params.save_hostremoved_reads
// ]
}

withName: MINIMAP2_HOST_ALIGN {
ext.prefix = { "${meta.id}_run${meta.run}.host.minimap" }
publishDir = [
path: { "${params.outdir}/QC_longreads/minimap2/align}" },
mode: params.publish_dir_mode,
pattern: "*.bam",
enabled: params.save_hostremoved_reads
]
}

withName: MINIMAP2_ASSEMBLY_ALIGN {
ext.prefix = { "${meta2.assembler}-${meta2.id}-${meta.id}" }
publishDir = [
path: { "${params.outdir}/Assembly/${meta2.assembler}/QC/${meta2.id}" },
mode: params.publish_dir_mode,
pattern: "*.{bam,bai}",
enabled: params.save_assembly_mapped_reads
]
}

withName: SAMTOOLS_HOSTREMOVED_VIEW {
ext.args = '-f 4'
ext.prefix = { "${meta.id}_${meta.run}.hostremoved" }
}

withName: SAMTOOLS_HOSTREMOVED_FASTQ {
ext.prefix = { "${meta.id}_${meta.run}.hostremoved" }
publishDir = [
path: { "${params.outdir}/QC_longreads/samtools/fastq" },
mode: params.publish_dir_mode,
pattern: '*_other.fastq.gz',
enabled: params.save_hostremoved_reads
]
}

withName: SAMTOOLS_HOSTREMOVED_STATS {
ext.prefix = { "${meta.id}_${meta.run_accession}" }
publishDir = [
path: { "${params.outdir}/QC_longreads/samtools/stats" },
mode: params.publish_dir_mode,
pattern: '*stats'
]
}

withName: CENTRIFUGE_CENTRIFUGE {
publishDir = [path: { "${params.outdir}/Taxonomy/centrifuge/${meta.id}" }, mode: params.publish_dir_mode, pattern: "*.txt"]
}
Expand Down Expand Up @@ -330,6 +384,17 @@ process {
publishDir = [path: { "${params.outdir}/Assembly/SPAdesHybrid" }, mode: params.publish_dir_mode, pattern: "*.{fasta.gz,gfa.gz,fa.gz,log}"]
}

withName: FLYE {
ext.args = ' --meta'
ext.prefix = { "FLYE-${meta.id}" }
publishDir = [path: { "${params.outdir}/Assembly/FLYE" }, mode: params.publish_dir_mode, pattern: "*.{fasta.gz,gfa.gz,log}"]
}

withName: METAMDBG_ASM {
ext.prefix = { "METAMDBG-${meta.id}" }
publishDir = [path: { "${params.outdir}/Assembly/METAMDBG" }, mode: params.publish_dir_mode, pattern: "*.{fasta.gz,log}"]
}

withName: QUAST {
publishDir = [path: { "${params.outdir}/Assembly/${meta.assembler}/QC/${meta.id}" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename }]
}
Expand Down Expand Up @@ -580,6 +645,7 @@ process {
}

withName: METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS {
ext.args = { meta.assembler in ['FLYE', 'METAMDBG'] ? "--percentIdentity ${params.longread_percentidentity}" : '' }
publishDir = [path: { "${params.outdir}/GenomeBinning/depths/contigs" }, mode: params.publish_dir_mode, pattern: '*-depth.txt.gz']
ext.prefix = { "${meta.assembler}-${meta.id}-depth" }
}
Expand Down
34 changes: 34 additions & 0 deletions conf/test_longread.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
========================================================================================
Nextflow config file for running minimal tests
========================================================================================
Defines input files and everything required to run a fast and simple pipeline test.

Use as follows:
nextflow run nf-core/mag -profile test_longread,<docker/singularity> --outdir <OUTDIR>

----------------------------------------------------------------------------------------
*/

// Limit resources so that this can run on GitHub Actions
process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test profile'
config_profile_description = 'Minimal test dataset to check pipeline function'

// Input data
input = params.pipelines_testdata_base_path + 'mag/samplesheets/samplesheet.long_read.csv'
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2024-01-08.tar.gz"
skip_gtdbtk = true
gtdbtk_min_completeness = 0.01
skip_concoct = true
}
31 changes: 31 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,37 @@ SPAdesHybrid is a part of the [SPAdes](http://cab.spbu.ru/software/spades/) soft

</details>

</details>

### Flye

[Flye](https://github.com/mikolmogorov/Flye) assembler is optionally run on long reads.

<details markdown="1">
<summary>Output files</summary>

- `Assembly/FLYE/`
- `[sample/group].assembly_graph.gfa.gz`: Compressed assembly graph in gfa format
- `[sample/group].assembly.fa.gz`: Compressed assembled contigs in fasta format
- `[sample/group].flye.log`: Log file
- `QC/[sample/group]/`: Directory containing QUAST files

</details>

### metaMDBG

[metaMDBG](https://github.com/GaetanBenoitDev/metaMDBG) assembler is optionally run on long reads.

<details markdown="1">
<summary>Output files</summary>

- `Assembly/METAMDBG/`
- `[sample/group].contigs.fa.gz`: Compressed assembled contigs in fasta format
- `[sample/group].metaMDBG.log`: Log file
- `QC/[sample/group]/`: Directory containing QUAST files

</details>

### Metagenome QC with QUAST

[QUAST](http://cab.spbu.ru/software/quast/) is a tool that evaluates metagenome assemblies by computing various metrics. The QUAST output is also included in the MultiQC report, as well as in the assembly directories themselves.
Expand Down
11 changes: 10 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,23 @@ sample2,0,0,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,data/sample2.fastq
sample3,1,0,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,
```

If only long read data is available, the columns `short_reads_1` and `short_reads_2` is simply left empty:

```csv title="samplesheet.csv"
sample,run,group,short_reads_1,short_reads_2,long_reads
sample1,1,0,,,data/sample1.fastq.gz
sample1,2,0,,,data/sample1.fastq.gz
sample2,0,0,,,data/sample2.fastq.gz
sample3,1,0,,,data/sample3.fastq.gz
```

Please note the following requirements:

- a minimum 5 of comma-separated columns
- Valid file extension: `.csv`
- Must contain the header `sample,group,short_reads_1,short_reads_2,long_reads` (where `run` can be optionally added)
- Run IDs must be unique within a multi-run sample. A sample with multiple runs will be automatically concatenated.
- FastQ files must be compressed (`.fastq.gz`, `.fq.gz`)
- `long_reads` can only be provided in combination with paired-end short read data
- Within one samplesheet either only single-end or only paired-end reads can be specified
- If single-end reads are specified, the command line parameter `--single_end` must be specified as well

Expand Down
40 changes: 40 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,11 @@
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"flye": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"freebayes": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
Expand Down Expand Up @@ -207,6 +212,21 @@
"git_sha": "30d06da5bd7ae67be32758bf512cd75a4325d386",
"installed_by": ["modules"]
},
"metamdbg/asm": {
"branch": "master",
"git_sha": "7c08494acb5aba0763c5c6db87f82b249de87ea8",
"installed_by": ["modules"]
},
"minimap2/align": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"minimap2/index": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"mmseqs/databases": {
"branch": "master",
"git_sha": "699e078133f580548aeb43114f93ac29928c6143",
Expand Down Expand Up @@ -267,6 +287,26 @@
"git_sha": "fd742419940e01ba1c5ecb172c3e32ec840662fe",
"installed_by": ["modules"]
},
"samtools/fastq": {
"branch": "master",
"git_sha": "b13f07be4c508d6ff6312d354d09f2493243e208",
"installed_by": ["modules"]
},
"samtools/index": {
"branch": "master",
"git_sha": "b13f07be4c508d6ff6312d354d09f2493243e208",
"installed_by": ["modules"]
},
"samtools/stats": {
"branch": "master",
"git_sha": "2d20463181b1c38981a02e90d3084b5f9fa8d540",
"installed_by": ["modules"]
},
"samtools/view": {
"branch": "master",
"git_sha": "2d20463181b1c38981a02e90d3084b5f9fa8d540",
"installed_by": ["modules"]
},
"seqtk/mergepe": {
"branch": "master",
"git_sha": "911696ea0b62df80e900ef244d7867d177971f73",
Expand Down
5 changes: 5 additions & 0 deletions modules/nf-core/flye/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading