diff --git a/README.md b/README.md
index be185169..06e1c8f5 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,6 @@
[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
-[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A522.10.1-23aa62.svg)](https://www.nextflow.io/)
+
+[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A521.10.3-23aa62.svg)](https://www.nextflow.io/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
@@ -7,18 +8,252 @@
## Introduction
-**sanger-tol/genomeassembly** is a bioinformatics pipeline for a genome assembly for HiFi, Illumina 10x (optional), and HiC data. It performs the following steps: raw assembly, purging from haplotigs, optional polishing, and scaffolding.
+
+
+**sanger-tol/genomeassembly** is a bioinformatics pipeline for a genome assembly for HiFi, Illumina 10x (optional), and HiC data. It performs the following steps: raw assembly, purging from haplotigs, optional polishing, and scaffolding.
+
+Original assembly of HiFi reads is performed using [hifiasm](https://hifiasm.readthedocs.io) assembler in two modes - original and using HiC data (optional). Then assembly is purged from alternative haplotigs using [purge_dups](https://github.com/dfguan/purge_dups). Next optional step is polishing of the purged assembly using Illumina 10X read sequencing. 10X reads are mapped to the full assembly (purged + haplotigs) using [Longranger](https://support.10xgenomics.com/genome-exome/software/pipelines/latest/what-is-long-ranger) and polishing is implemented using [Freebayes](https://github.com/freebayes/freebayes). HiC reads are further mapped with [bwamem2](https://github.com/bwa-mem2/bwa-mem2) to the primary contigs, which are further scaffolded with [YaHS](https://github.com/c-zhou/yahs) using the provided Hi-C data.
+Polished and scaffolded assemblies are evaluated using [GFASTATS](https://github.com/vgl-hub/gfastats), [BUSCO](https://busco.ezlab.org/) and [MERQURY.FK](https://github.com/thegenemyers/MERQURY.FK)
The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
-## Usage
+
+
+On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.
+
+## Pipeline summary
+
+1. Parse and transform the input into the required data structures.
+2. Run hifiasm in original mode.
+3. Produce numerical stats, BUSCO score and QV, completeness metrics, and kmer spectra for [2].
+4. If hifiasm_hic_on
option is set, run hifiasm in HiC mode.
+5. If hifiasm_hic_on
option is set, produce numerical stats, BUSCO score and QV, completeness metrics, and kmer spectra for [4].
+6. Purge primary contigs from [2], count produced primary contigs as the primary assembly.
+7. Take haplotigs from [6], merge with haplotigs from [2] and purge, count produced primary contigs as assembly haplotigs.
+8. Produce numerical stats, BUSCO score and QV, completeness metrics, and kmer spectra for primary and haplotigs from [6] and [7].
+9. If polishing_on
option is set, map provided 10X Illumina reads to the joined primary and alternative contigs.
+10. If polishing_on
option is set, polish initial assembly based on aligment produced in [6], then separate polished primary and hapltoigs.
+11. If polishing_on
option is set, produce numerical stats, BUSCO score and QV, completeness metrics, and kmer spectra for [10].
+12. Map HiC data onto primary contigs.
+13. Run scaffolding for primary contigs based on results of [12].
+14. Produce numerical stats, BUSCO score and QV, completeness metrics, and kmer spectra for [13].
+
+
+
+
+
+## Workflow output summary
+```bash
+test
+├── hifiasm
+│ ├── baUndUnlc1.asm.a_ctg.assembly_summary
+│ ├── baUndUnlc1.asm.a_ctg.fa
+│ ├── baUndUnlc1.asm.a_ctg.gfa
+│ ├── baUndUnlc1.asm.ec.bin
+│ ├── baUndUnlc1.asm.ovlp.reverse.bin
+│ ├── baUndUnlc1.asm.ovlp.source.bin
+│ ├── baUndUnlc1.asm.p_ctg.assembly_summary
+│ ├── baUndUnlc1.asm.p_ctg.fa
+│ ├── baUndUnlc1.asm.p_ctg.gfa
+│ ├── baUndUnlc1.asm.p_utg.gfa
+│ ├── baUndUnlc1.asm.r_utg.gfa
+│ ├── baUndUnlc1.p_ctg.bacteria_odb10.busco
+│ │ ├── baUndUnlc1-bacteria_odb10-busco
+│ │ ├── baUndUnlc1-bacteria_odb10-busco.batch_summary.txt
+│ │ ├── short_summary.specific.bacteria_odb10.baUndUnlc1.asm.p_ctg.fa.json
+│ │ └── short_summary.specific.bacteria_odb10.baUndUnlc1.asm.p_ctg.fa.txt
+│ ├── baUndUnlc1.p_ctg.ccs.merquryk
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.a_ctg_only.bed
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.a_ctg.qv
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.a_ctg.spectra-cn.fl.png
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.a_ctg.spectra-cn.ln.png
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.a_ctg.spectra-cn.st.png
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.p_ctg_only.bed
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.p_ctg.qv
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.p_ctg.spectra-cn.fl.png
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.p_ctg.spectra-cn.ln.png
+│ │ ├── baUndUnlc1.baUndUnlc1.asm.p_ctg.spectra-cn.st.png
+│ │ ├── baUndUnlc1.completeness.stats
+│ │ ├── baUndUnlc1.qv
+│ │ ├── baUndUnlc1.spectra-asm.fl.png
+│ │ ├── baUndUnlc1.spectra-asm.ln.png
+│ │ └── baUndUnlc1.spectra-asm.st.png
+│ ├── polishing
+│ │ ├── baUndUnlc1
+│ │ ├── baUndUnlc1.consensus.fa
+│ │ ├── baUndUnlc1.polished.bacteria_odb10.busco
+│ │ ├── baUndUnlc1.polished.ccs.merquryk
+│ │ ├── chunks
+│ │ ├── haplotigs.assembly_summary
+│ │ ├── haplotigs.fa
+│ │ ├── merged.vcf.gz
+│ │ ├── merged.vcf.gz.tbi
+│ │ ├── primary.assembly_summary
+│ │ ├── primary.fa
+│ │ ├── refdata-baUndUnlc1
+│ │ └── vcf
+│ ├── purging
+│ │ ├── baUndUnlc1.purged.bacteria_odb10.busco
+│ │ ├── baUndUnlc1.purged.ccs.merquryk
+│ │ ├── coverage
+│ │ ├── coverage.htigs
+│ │ ├── purged.assembly_summary
+│ │ ├── purged.fa
+│ │ ├── purged.htigs.assembly_summary
+│ │ ├── purged.htigs.fa
+│ │ ├── purge_dups
+│ │ ├── purge_dups.htigs
+│ │ ├── seqs
+│ │ ├── seqs.htigs
+│ │ ├── split_aln
+│ │ └── split_aln.htigs
+│ └── scaffolding
+│ ├── baUndUnlc1.baUndUnlc1_scaffolds_final.ccs.merquryk
+│ ├── baUndUnlc1.cram.crai
+│ ├── baUndUnlc1.flagstat
+│ ├── baUndUnlc1.idxstats
+│ ├── baUndUnlc1.markdup.bam
+│ ├── baUndUnlc1.sorted.bed
+│ ├── baUndUnlc1.stats
+│ └── yahs
+├── hifiasm-hic
+│ ├── baUndUnlc1.asm.ec.bin
+│ ├── baUndUnlc1.asm.hic.a_ctg.assembly_summary
+│ ├── baUndUnlc1.asm.hic.a_ctg.fa
+│ ├── baUndUnlc1.asm.hic.a_ctg.gfa
+│ ├── baUndUnlc1.asm.hic.hap1.p_ctg.gfa
+│ ├── baUndUnlc1.asm.hic.hap2.p_ctg.gfa
+│ ├── baUndUnlc1.asm.hic.p_ctg.assembly_summary
+│ ├── baUndUnlc1.asm.hic.p_ctg.fa
+│ ├── baUndUnlc1.asm.hic.p_ctg.gfa
+│ ├── baUndUnlc1.asm.hic.p_utg.gfa
+│ ├── baUndUnlc1.asm.hic.r_utg.gfa
+│ ├── baUndUnlc1.asm.ovlp.reverse.bin
+│ └── baUndUnlc1.asm.ovlp.source.bin
+├── kmer
+│ ├── baUndUnlc1_fk.hist
+│ ├── baUndUnlc1_fk.ktab
+│ ├── baUndUnlc1.hist
+│ ├── baUndUnlc1_linear_plot.png
+│ ├── baUndUnlc1_log_plot.png
+│ ├── baUndUnlc1_model.txt
+│ ├── baUndUnlc1_summary.txt
+│ ├── baUndUnlc1_transformed_linear_plot.png
+│ └── baUndUnlc1_transformed_log_plot.png
+└── pipeline_info
+ ├── execution_report_2023-05-24_16-09-38.html
+ ├── execution_timeline_2023-05-24_16-09-38.html
+ ├── execution_trace_2023-05-24_16-00-56.txt
+ ├── execution_trace_2023-05-24_16-09-38.txt
+ ├── pipeline_dag_2023-05-24_16-09-38.html
+ └── software_versions.yml
+```
+
+## Subworkflows input summary
+PREPARE_INPUT
+* ch_input
- [YAML file](input_yaml) with definition of the dataset: channel: datafile(yaml)
+
+GENOMESCOPE_MODEL
+* reads
- Paths to reads: channel: [ val(meta), [ datafile(path) ] ]
+
+RAW_ASSEMBLY
+* hifi_reads
- List of files containing paths to HiFi reads: channel: [ val(meta), [ datafile(path) ] ]
+* hic_reads
- List of files containing paths to HiC reads: channel: [ datafile(cram) ]
+* hifiasm_hic_on
- Switch on/off HiC mode val: Boolean
+
+PURGE_DUPS
+* reads_plus_assembly_ch
- Paths to HiFi reads, primary asm, haplotigs, genomescope model: channel: [ val(meta), [ datafile(reads) ], [ datafile(pri), datafile(alt) ], datafile(model) ]
+* prefix
- prefix for the output files: channel: val(prefix)
+
+POLISHING
+* fasta_in
- Assembly in FASTA format with index file: channel: [ val(meta), datafile(fasta), datafile(fai) ]
+* reads_10X
- Path to folder with Illumina 10X FASTQ files and indices: channel: datafile(path)
+* bed_chunks_polishing
- Number of chunks to split fasta: val: Int
+
+ALIGN_SHORT
+* fasta
- Primary contigs: channel: [ val(meta), datafile(fasta) ]
+* reads
- HiC reads in CRAM format channel: [ val(meta), [ datafile(cram) ] ]
+
+SCAFFOLDING
+* bed_in
- Alignments coordinates after markdup: channel: [ val(meta), datafile(bed) ]
+* fasta_in
- Alignments coordinates after markdup: channel: datafile(fasta)
+* cool_bin
- Bin size for cooler: val(cool_bin)
+
+GENOME_STATISTICS
+* assembly
- Primary contigs and haplotigs(optional): channel: [ val(meta), datafile(pri), datafile(alt) ]
+* lineage
- Paths to BUSCO database (optional) and name of the BUSCO dataset: channel: [ val(meta), datafile(path), val(lineage) ]
+* hist
- FASTK .hist file: channel: [ val(meta), datafile(hist) ]
+* ktab
- FASTK .ktab file: channel: [ val(meta), datafile(ktab) ]
-> **Note**
-> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how
-> to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline)
-> with `-profile test` before running the workflow on actual data.
-Currently, it is advised to run the pipeline with docker or singularity as a small number of major modules do not currently have a conda env associated with them.
+## Subworkflows output summary
+PREPARE_INPUT
+* hifi
- paths to HiFi reads
+* hic
- paths to HiC reads, meta contains read group used for reaad mapping
+* illumina_10X
- paths to the folder with 10X reads
+* busco
- path to the BUSCO database (optional), name of the ODB lineage (bacteria_odb10)
+* primary_asm
- primary assembly and its indices in case provided (currently not used down the pipeline)
+* haplotigs_asm
- haplotigs assembly and its indices in case provided (currently not used down the pipeline)
+
+GENOMESCOPE_MODEL
+* model
- genomescope model
+* hist
- FASTK kmer histogram
+* ktab
- FASTK kmer table
+
+RAW_ASSEMBLY
+* raw_unitigs
- hifiasm raw unitigs in GFA format
+* source_overlaps
- hifiasm binary database of overlaps
+* reverse_overlaps
- hifiasm binary database of RC overlaps
+* corrected_reads
- hifiasm binary database of corrected reads
+* primary_contigs_gfa
- hifiasm primary contigs in GFA format
+* alternate_contigs_gfa
- hifiasm haplotigs in GFA format
+* processed_unitigs
- hifiasm processed unitigs in GFA format
+* primary_hic_contigs_gfa
- hifiasm contigs produced with integration of HiC data (in GFA format)
+* alternate_hic_contigs_gfa
- hifiasm haplotigs produced with integration of HiC data (in GFA format)
+* phased_hic_contigs_hap1_gfa
- fully phased first haplotype
+* phased_hic_contigs_hap2_gfa
- fully phased another haplotype
+* primary_contigs
- primary contigs primary_contigs_gfa
in FASTA format
+* alternate_contigs
- haplotigs alternate_contigs_gfa
in FASTA format
+* primary_hic_contigs
- primary contigs primary_hic_contigs_gfa
in FASTA format
+* alternate_hic_contigs
- haplotigs alternate_hic_contigs_gfa
in FASTA format
+
+[Hifiasm documentation](https://hifiasm.readthedocs.io/en/latest/interpreting-output.html) contains more details
+
+PURGE_DUPS
+* pri
- purged primary contigs
+* alt
- purged haplotigs
+
+POLISHING
+* fasta
- polished contigs
+* versions
- versions for software used in analysis
+
+ALIGN_SHORT
+* bed
bed file of alignments after merging and markduplicates
+* cram
cram representation of alignments
+* crai
index for cram
+* stats
output of samtools stats
+* idxstats
output of samtools stats idxstats
+* flagstat
output of samtools stats flagstat
+
+SCAFFOLDING
+* alignments_sorted
output of JUICER_PRE - text file of pairs of alignmnents coordinates suitable for JUICER
+* fasta
- final scaffolds
+* chrom_sizes
- sizes of scaffolds
+* cool
- path to .cool file from COOLER_CLOAD
+* mcool
- path to .mcool file from COOLER_ZOOMIFY
+* snapshots
- image of pretext map
+* hic
- contact map in .hic format
+* versions
- versions for software used in analysis
+
+GENOME_STATISTICS
+* busco
- busco summary in json format
+* merquryk_completeness
text file of merqury completeness score
+* merquryk_qv
text file of merqury qv score
+* assembly_stats_pri
assembly stats for the primary assembly
+* assembly_stats_alt
assembly stats for haplotigs (if provided)
+* versions
- versions for software used in analysis
+
+## Quick Start
1. Install [`Nextflow`](https://www.nextflow.io/docs/latest/getstarted.html#installation) (`>=21.10.3`)
@@ -26,7 +261,7 @@ Currently, it is advised to run the pipeline with docker or singularity as a sma
3. Download the pipeline and test it on a minimal dataset with a single command:
- ```bash
+ ```console
nextflow run sanger-tol/genomeassembly -profile test,YOURPROFILE --outdir
```
@@ -37,21 +272,31 @@ Currently, it is advised to run the pipeline with docker or singularity as a sma
> - If you are using `singularity`, please use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to download images first, before running the pipeline. Setting the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.
> - If you are using `conda`, it is highly recommended to use the [`NXF_CONDA_CACHEDIR` or `conda.cacheDir`](https://www.nextflow.io/docs/latest/conda.html) settings to store the environments in a central location for future pipeline runs.
-> **Warning:**
-> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
-> provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
-> see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
+4. Start running your own analysis!
+
+
+
+ ```console
+ nextflow run sanger-tol/genomeassembly --input sample_dataset.yaml --outdir --polishing_on true -profile
+ ```
## Credits
-sanger-tol/genomeassembly was originally written by @ksenia-krasheninnikova based on the ToL Genome Engine procedures.
+sanger-tol/genomeassembly was originally written by @ksenia-krasheninnikova.
+
We thank the following people for their extensive assistance in the development of this pipeline:
-@mcshane - For the original implementation of the genomeassembly pipeline
-@priyanka-surana - For code reviews and code support
-@mahesh-panchal - For nextflow implementation of the purge_dups pipeline that was re-used here,
- as well as for the implementation of input parsing subworkflow which was further adapted for the current pipeline
+@priyanka-surana for nextflow implementation of HiC mapping pipeline, extensive guidance, code review, and brilliant suggestions.
+
+@mcshane and @c-zhou for designing and implementing original pipelines for purging (@mcshane), polishing (@mcshane) and scaffolding (@c-zhou)
+
+@mahesh-panchal for nextflow implementation of the purging pipeline, code review and the valuable suggestions for the input subworkflow
+
+@muffato for code review and suggestions about versioning
+
+
+
## Contributions and Support
@@ -59,15 +304,14 @@ If you would like to contribute to this pipeline, please see the [contributing g
## Citations
-
-
-If you use sanger-tol/genomeassembly for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX)
+
+
-### Tools
+
An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
-You can cite the `nf-core` publication as follows:
+This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE).
> **The nf-core framework for community-curated bioinformatics pipelines.**
>
diff --git a/docs/README.md b/docs/README.md
index df92463a..2d2bc147 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -5,4 +5,4 @@ The sanger-tol/genomeassembly documentation is split into the following pages:
- [Usage](usage.md)
- An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
- [Output](output.md)
- - An overview of the different results produced by the pipeline and how to interpret them.
+ - An overview of the pipeline structure, the description results produced by it and how to interpret them.
diff --git a/docs/images/mqc_fastqc_adapter.png b/docs/images/mqc_fastqc_adapter.png
deleted file mode 100755
index 361d0e47..00000000
Binary files a/docs/images/mqc_fastqc_adapter.png and /dev/null differ
diff --git a/docs/images/mqc_fastqc_counts.png b/docs/images/mqc_fastqc_counts.png
deleted file mode 100755
index cb39ebb8..00000000
Binary files a/docs/images/mqc_fastqc_counts.png and /dev/null differ
diff --git a/docs/images/mqc_fastqc_quality.png b/docs/images/mqc_fastqc_quality.png
deleted file mode 100755
index a4b89bf5..00000000
Binary files a/docs/images/mqc_fastqc_quality.png and /dev/null differ
diff --git a/docs/images/v1/genome_statistics.drawio b/docs/images/v1/genome_statistics.drawio
new file mode 100644
index 00000000..1c4832f0
--- /dev/null
+++ b/docs/images/v1/genome_statistics.drawio
@@ -0,0 +1,143 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/images/v1/genome_statistics.png b/docs/images/v1/genome_statistics.png
new file mode 100644
index 00000000..1da359b3
Binary files /dev/null and b/docs/images/v1/genome_statistics.png differ
diff --git a/docs/images/v1/genomescope_model.drawio b/docs/images/v1/genomescope_model.drawio
new file mode 100644
index 00000000..b71f5da2
--- /dev/null
+++ b/docs/images/v1/genomescope_model.drawio
@@ -0,0 +1,77 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/images/v1/genomescope_model.png b/docs/images/v1/genomescope_model.png
new file mode 100644
index 00000000..004a180c
Binary files /dev/null and b/docs/images/v1/genomescope_model.png differ
diff --git a/docs/images/v1/hic-mapping.drawio b/docs/images/v1/hic-mapping.drawio
new file mode 100644
index 00000000..7806b879
--- /dev/null
+++ b/docs/images/v1/hic-mapping.drawio
@@ -0,0 +1,271 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/images/v1/hic-mapping.png b/docs/images/v1/hic-mapping.png
new file mode 100644
index 00000000..7a1cf69e
Binary files /dev/null and b/docs/images/v1/hic-mapping.png differ
diff --git a/docs/images/v1/organelles.drawio b/docs/images/v1/organelles.drawio
new file mode 100644
index 00000000..1c4ca31d
--- /dev/null
+++ b/docs/images/v1/organelles.drawio
@@ -0,0 +1,93 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/images/v1/organelles.png b/docs/images/v1/organelles.png
new file mode 100644
index 00000000..36fa4dc7
Binary files /dev/null and b/docs/images/v1/organelles.png differ
diff --git a/docs/images/v1/polishing.drawio b/docs/images/v1/polishing.drawio
new file mode 100644
index 00000000..bfa0e4f5
--- /dev/null
+++ b/docs/images/v1/polishing.drawio
@@ -0,0 +1,312 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/images/v1/polishing.png b/docs/images/v1/polishing.png
new file mode 100644
index 00000000..3f638ce2
Binary files /dev/null and b/docs/images/v1/polishing.png differ
diff --git a/docs/images/v1/purge_dups.drawio b/docs/images/v1/purge_dups.drawio
new file mode 100644
index 00000000..9bd192a8
--- /dev/null
+++ b/docs/images/v1/purge_dups.drawio
@@ -0,0 +1,212 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/images/v1/purge_dups.png b/docs/images/v1/purge_dups.png
new file mode 100644
index 00000000..962dfb0d
Binary files /dev/null and b/docs/images/v1/purge_dups.png differ
diff --git a/docs/images/v1/raw_assembly.drawio b/docs/images/v1/raw_assembly.drawio
new file mode 100644
index 00000000..7d314c47
--- /dev/null
+++ b/docs/images/v1/raw_assembly.drawio
@@ -0,0 +1,108 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/images/v1/raw_assembly.png b/docs/images/v1/raw_assembly.png
new file mode 100644
index 00000000..aa8407b1
Binary files /dev/null and b/docs/images/v1/raw_assembly.png differ
diff --git a/docs/images/v1/scaffolding.drawio b/docs/images/v1/scaffolding.drawio
new file mode 100644
index 00000000..bf6130f8
--- /dev/null
+++ b/docs/images/v1/scaffolding.drawio
@@ -0,0 +1,395 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/docs/images/v1/scaffolding.png b/docs/images/v1/scaffolding.png
new file mode 100644
index 00000000..7fd6b946
Binary files /dev/null and b/docs/images/v1/scaffolding.png differ
diff --git a/docs/output.md b/docs/output.md
index 64968fda..ce904171 100644
--- a/docs/output.md
+++ b/docs/output.md
@@ -2,59 +2,192 @@
## Introduction
-This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
+This document describes the output produced by the genomeassembly pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
-## Pipeline overview
+## Subworkflows
-The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
+The pipeline is built using [Nextflow](https://www.nextflow.io/) DSL2.
-- [FastQC](#fastqc) - Raw read QC
-- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
-- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
+
+### PREPARE_INPUT
+Here the input YAML is being processed. This subworkflow generates the input channels used as by the other subworkflows.
-### FastQC
+### GENOMESCOPE_MODEL
-Output files
+ Output files
+
+ - kmer/*ktab
+ - kmer table file
+ - kmer/*hist
+ - kmer histogram file
+ - kmer/*model.txt
+ - genomescope model in text format
+ - kmer/*[linear,log]_plot.png
+ - genomescope kmer plots
+
+
+
+This subworkflow generates a KMER database and coverage model used in [PURGE_DUPS](#purge_dups) and [GENOME_STATISTICS](#genome_statistics)
-- `fastqc/`
- - `*_fastqc.html`: FastQC report containing quality metrics.
- - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
+![Subworkflow for kmer profile](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/genomescope_model.png)
+
+### RAW_ASSEMBLY
+
+ Output files
+
+ - .\*hifiasm.\*/.*p_ctg.[g]fa
+ - primary assembly in GFA and FASTA format; for more details refer to [hifiasm output](https://hifiasm.readthedocs.io/en/latest/interpreting-output.html)
+ - .\*hifiasm.\*/.*a_ctg.[g]fa
+ - haplotigs in GFA and FASTA format; for more details refer to [hifiasm output](https://hifiasm.readthedocs.io/en/latest/interpreting-output.html)
+ - .\*hifiasm.\*/.*bin
+ - internal binary hifiasm files; for more details refer [here](https://hifiasm.readthedocs.io/en/latest/faq.html#id12)
+
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
+This subworkflow generates a raw assembly(-ies). First, hifiasm is run on the input HiFi reads then raw contigs are converted from GFA into FASTA format, this assembly is due to purging, polishing (optional) and scaffolding further down the pipeline.
+In case hifiasm HiC mode is switched on, it is performed as an extra step with results stored in hifiasm-hic folder.
+
+![Raw assembly subworkflow](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/raw_assembly.png)
-![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
-![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
+### PURGE_DUPS
+
+ Output files
+
+ - \*.hifiasm..\*/purged.fa
+ - purged primary contigs
+ - \*.hifiasm..\*/purged.htigs.fa
+ - haplotigs after purging
+ - other files from the purge_dups pipeline
+ - for details refer [here](https://github.com/dfguan/purge_dups)
+
-![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
+Retained haplotype is identified in primary assembly. The alternate contigs are updated correspondingly.
+The subworkflow relies on kmer coverage model to identify coverage thresholds. For more details see [purge_dups](https://github.com/dfguan/purge_dups)
+
-> **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
+![Subworkflow for purging haplotigs](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/purge_dups.png)
-### MultiQC
+### POLISHING
-Output files
+ Output files
+
+ - \*.hifiasm..\*/polishing/.*consensus.fa
+ - polished joined primary and haplotigs assembly
+ - \*.hifiasm..\*/polishing/merged.vcf.gz
+ - unfiltered variants
+ - \*.hifiasm..\*/polishing/merged.vcf.gz.tbi
+ - index file
+ - \*.hifiasm..\*/polishing/refdata-*
+ - Longranger assembly indices
+
+
+
+
+This subworkflow uses read mapping of the Illumina 10X short read data to fix short errors in primary contigs and haplotigs.
+
+![Subworkflow for purging haplotigs](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/polishing.png)
+
+### HIC_MAPPING
+
+
+ Output files
+
+ - \*.hifiasm..\*/scaffolding/.*_merged_sorted.bed
+ - bed file obtained from merged mkdup bam
+ - \*.hifiasm..\*/scaffolding/.*mkdup.bam
+ - final read mapping bam with mapped reads
+
+
+This subworkflow implements alignment of the Illumina HiC short reads to the primary assembly. Uses [`CONVERT_STATS`](#convert_stats) as internal subworkflow to calculate read mapping stats.
+
+![HiC mapping subworkflow](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/hic-mapping.png)
+
+
+### CONVERT_STATS
-- `multiqc/`
- - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
- - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
- - `multiqc_plots/`: directory containing static images from the report in various formats.
+
+ Output files
+ - \*.hifiasm..\*/scaffolding/.*.stats
+ - output of samtools stats
+ - \*.hifiasm..\*/scaffolding/.*.idxstats
+ - output of samtools idxstats
+ - \*.hifiasm..\*/scaffolding/.*.flagstat
+ - output of samtools flagstat
+
+
+This subworkflow produces statistcs for a bam file containing read mapping. It is executed within [`HIC_MAPPING`](hic_mapping) subworkflow.
+
+### SCAFFOLDING
+
+ Output files
+
+ - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa
+ - scaffolds in FASTA format
+ - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/out_scaffolds_final.agp
+ - coordinates of contigs relative to scaffolds
+ - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/alignments_sorted.txt
+ - Alignments for Juicer in text format
+ - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/yahs_scaffolds.hic
+ - Juicer HiC map
+ - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/*cool
+ - HiC map for cooler
+ - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/*.FullMap.png
+ - Pretext snapshot
+
+
+The subworkflow performs scaffolding of the primary contigs using HiC mapping generated in [`HIC_MAPPING`](hic_mapping). It also performs some postprocessing steps such as generating cooler and pretext files
+
+![Scaffolding subworkflow](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/scaffolding.png)
+
+### GENOME_STATISTICS
+
+
+ Output files
+
+ - .*.assembly_summary
+ - numeric statistics for pri and alt sequences
+ - .*ccs.merquryk
+ - folder with merqury plots and kmer statistics
+ - .*busco
+ - folder with BUSCO results
+
+
+
+This subworkflow is used to evaluate the quality of sequences. It is performed after the intermidate steps, such as raw assembly generation, purging and polishing, and also at the end of the pipeline when scaffolds are produced.
+
+![Genome statistics subworkflow](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/genome_statistics.png)
+
+### ORGANELLES
+
+ Output files
+
+ - \*.hifiasm.\*/mito..*/final_mitogenome.fasta
+ - organelle assembly
+ - \*.hifiasm.\*/mito..*/final_mitogenome.[gb,gff]
+ - organelle gene annotation
+ - \*.hifiasm.\*/mito..*/contigs_stats.tsv
+ - summary of mitochondrial findings
+ - output also includes other output files produced by MitoHiFi
+
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
+This subworkflow implements assembly of organelles. In the main pipeline it is called twice - for assembling mitochondrion from HiFi reads and as an alternative it runs identification of the mitochondrion for the genome assembly
-Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see .
+![Organelles subworkflow](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/organelles.png)
### Pipeline information
+[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
+
Output files
@@ -64,5 +197,3 @@ Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQ
- Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
-
-[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
diff --git a/docs/usage.md b/docs/usage.md
index 62718d03..84afb2c1 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -4,99 +4,127 @@
## Introduction
-
-
-## Samplesheet input
-
-You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row as shown in the examples below.
+## Workflow input
+
+### Parameters summary
+
+ Details
+
+Workflow accepts the following parameters:
+* input
- (required) YAML file containing description of the dataset, incl. ToLID, paths to the raw data etc.
+* bed_chunks_polishing
- a number of chunks to split contigs for polishing (default 100)
+* cool_bin
- a bin size for cooler (default 1000)
+* organelles_on
- set True
for running organelles subworkflow
+* polishing_on
- set True
for polishing
+* hifiasm_hic_on
- set True
to run of hifiasm in HiC mode
+
NB: hifiasm in the original mode is used as the main assembly even if the hifiasm_hic_on
flag is set
+
+
+### Full samplesheet
+The input dataset is described in YAML format which states for "Yet Another Markdown Language". It is a human readable file which contains information
+about location paths for the raw data (HiFi, 10X, HiC) used for the genome assembly. It can also contain meta information such as HiC restriction motifs,
+BUSCO lineage, mitochondrial code etc. For more information see [Input YAML definition](#input_yaml_definition)
+
+### Input YAML definition
+
+- dataset.id
+ - is used as the sample id throughout the pipeline. ToLID should be used in ToL datasets.
+- dataset.illumina_10X.reads
+ - is necessary in case polishing is applied, this field should point to the path of the folder containing 10X reads. Sample identifier in the Illumina reads should coincide with the top level ID. For the use of the Longranger software the reads should follow [the 10X FASTQ file naming convention](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/fastq-input).
+- dataset.pacbio.reads
+ - contains the list (-reads
) of the HiFi reads in FASTA (or gzipped FASTA) format in. The pipeline implementation is based on an assumption that reads have gone through adapter/barcode checks.
+- dataset.HiC.reads
+ - contains the list (-reads
) of the HiC reads in the indexed CRAM format.
+- dataset.hic_motif
+ - is a comma-separated list of restriction sites. The pipeline was tested with the Arima dataset, but it's should be alright to use it with the other HiC libraries
+- dataset.busco.lineage
+ - specifies the name of the BUSCO dataset (i.e. bacteria_odb10).
+- dataset.busco.lineage_path
+ - is an optional field containing the path to the folder with pre-downloaded BUSCO lineages.
+- dataset.mito.species
+ - is the latin name of the species to look for the mitogenome reference in the organelles subworkflow. Normally this parameter will contain the latin name of the species whose genome is being assembled.
+- dataset.mito.min_length
+ - sets the minimal length of the mito, can be 15Kb.
+- dataset.mito.code
+ - is a mitochondrial code for the mitogenome annotation. See [here](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) for reference.
+
+
+### An example of the input YAML
+
+ Details
+
+Example is based on [test.yaml](https://github.com/sanger-tol/genomeassembly/blob/814f6cd30e29a95e1abaea7a434bd1b445f8f63b/assets/test.yaml).
```bash
---input '[path to samplesheet file]'
+dataset:
+ id: baUndUnlc1
+ illumina_10X:
+ reads: /lustre/scratch123/tol/resources/nextflow/test-data/Undibacterium_unclassified/genomic_data/baUndUnlc1/10x/
+ pacbio:
+ reads:
+ - reads: /lustre/scratch123/tol/resources/nextflow/test-data/Undibacterium_unclassified/genomic_data/baUndUnlc1/pacbio/fasta/HiFi.reads.fasta
+ HiC:
+ reads:
+ - reads: /lustre/scratch123/tol/resources/nextflow/test-data/Undibacterium_unclassified/genomic_data/baUndUnlc1/hic-arima2/41741_2#7.sub.cram
+hic_motif: GATC,GANTC,CTNAG,TTAA
+busco:
+ lineage: bacteria_odb10
+mito:
+ species: Caradrina clavipalpis
+ min_length: 15000
+ code: 5
```
+
-### Multiple runs of the same sample
-
-The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes:
-
-```console
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
-CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz
-CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz
-```
+
+## Usage
-### Full samplesheet
+### Local testing
-The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below.
+
+ Details
-A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice.
+The pipeline can be tested locally using a provided small test dataset:
-```console
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
-CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
-CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
-TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz,
-TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz,
-TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz,
-TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz,
```
+cd ${GENOMEASSEMBLY_TEST_DATA}
+curl https://darwin.cog.sanger.ac.uk/genomeassembly_test_data.tar.gz | tar xzf -
-| Column | Description |
-| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). |
-| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
-| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
+git clone git@github.com:sanger-tol/genomeassembly.git
+cd genomeassembly/
+sed -i "s|/home/runner/work/genomeassembly/genomeassembly|${GENOMEASSEMBLY_TEST_DATA}|" assets/test_github.yaml
+nextflow run main.nf -profile test_github,singularity --outdir ${OUTDIR} {OTHER ARGUMENTS}
+```
+These command line steps will download and decompress the test data first, then download the pipeline and modify YAML so that it matches dataset location in your file system.
+The last command line runs the test.
-An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
+You should now be able to run the pipeline as you see fit.
-## Running the pipeline
+
+### Running the pipeline
The typical command for running the pipeline is as follows:
-```bash
-nextflow run sanger-tol/genomeassembly --input samplesheet.csv --outdir --genome GRCh37 -profile docker
+```console
+nextflow run sanger-tol/genomeassembly --input assets/dataset.yaml --outdir -profile docker,sanger
```
-This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
+This will launch the pipeline with the `docker` configuration profile, also using your institution profille if available (see [nf-core/configs](#nf-core_configs)). See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
-```bash
+```console
work # Directory containing the nextflow working files
# Finished results in specified location (defined with --outdir)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
```
-If you wish to repeatedly use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file.
-
-Pipeline settings can be provided in a `yaml` or `json` file via `-params-file `.
-
-> ⚠️ Do not use `-c ` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args).
-> The above pipeline run specified with a params file in yaml format:
-
-```bash
-nextflow run sanger-tol/genomeassembly -profile docker -params-file params.yaml
-```
-
-with `params.yaml` containing:
-
-```yaml
-input: './samplesheet.csv'
-outdir: './results/'
-genome: 'GRCh37'
-input: 'data'
-<...>
-```
-
-You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch).
-
### Updating the pipeline
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
-```bash
+```console
nextflow pull sanger-tol/genomeassembly
```
@@ -104,13 +132,9 @@ nextflow pull sanger-tol/genomeassembly
It is a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since.
-First, go to the [sanger-tol/genomeassembly releases page](https://github.com/sanger-tol/genomeassembly/releases) and find the latest pipeline version - numeric only (eg. `1.3.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.3.1`. Of course, you can switch to another version by changing the number after the `-r` flag.
+First, go to the [sanger-tol/genomeassembly releases page](https://github.com/sanger-tol/genomeassembly/releases) and find the latest version number - numeric only (eg. `1.3.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.3.1`.
-This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. For example, at the bottom of the MultiQC reports.
-
-To further assist in reproducbility, you can use share and re-use [parameter files](#running-the-pipeline) to repeat pipeline runs with the same settings without having to write out a command with every single parameter.
-
-> 💡 If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles.
+This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.
## Core Nextflow arguments
@@ -120,7 +144,7 @@ To further assist in reproducbility, you can use share and re-use [parameter fil
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
-Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Apptainer, Conda) - see below.
+Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Conda) - see below. When using Biocontainers, most of these software packaging methods pull Docker containers from quay.io e.g [FastQC](https://quay.io/repository/biocontainers/fastqc) except for Singularity which directly downloads Singularity images via https hosted by the [Galaxy project](https://depot.galaxyproject.org/singularity/) and Conda which downloads and installs software locally from [Bioconda](https://bioconda.github.io/).
> We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
@@ -129,11 +153,8 @@ The pipeline also dynamically loads configurations from [https://github.com/nf-c
Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.
-If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended, since it can lead to different results on different machines dependent on the computer enviroment.
+If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended.
-- `test`
- - A profile with a complete configuration for automated testing
- - Includes links to test data so needs no other parameters
- `docker`
- A generic configuration profile to be used with [Docker](https://docker.com/)
- `singularity`
@@ -144,10 +165,11 @@ If `-profile` is not specified, the pipeline will run locally and expect all sof
- A generic configuration profile to be used with [Shifter](https://nersc.gitlab.io/development/shifter/how-to-use/)
- `charliecloud`
- A generic configuration profile to be used with [Charliecloud](https://hpc.github.io/charliecloud/)
-- `apptainer`
- - A generic configuration profile to be used with [Apptainer](https://apptainer.org/)
- `conda`
- - A generic configuration profile to be used with [Conda](https://conda.io/docs/). Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer.
+ - A generic configuration profile to be used with [Conda](https://conda.io/docs/). Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud.
+- `test`
+ - A profile with a complete configuration for automated testing
+ - Includes links to test data so needs no other parameters
### `-resume`
@@ -163,23 +185,11 @@ Specify the path to a specific config file (this is a core Nextflow command). Se
### Resource requests
-Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified [here](https://github.com/nf-core/rnaseq/blob/4c27ef5610c87db00c3c5a3eed10b1d161abf575/conf/base.config#L18) it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.
+Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified [here](https://github.com/sanger-tol/blobtoolkit/blob/56906ffb5737e4b985797bb5fb4b9c94cfe69600/conf/base.config#L18) it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.
To change the resource requests, please see the [max resources](https://nf-co.re/docs/usage/configuration#max-resources) and [tuning workflow resources](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources) section of the nf-core website.
-### Custom Containers
-
-In some cases you may wish to change which container or conda environment a step of the pipeline uses for a particular tool. By default nf-core pipelines use containers and software from the [biocontainers](https://biocontainers.pro/) or [bioconda](https://bioconda.github.io/) projects. However in some cases the pipeline specified version maybe out of date.
-
-To use a different container from the default container or conda environment specified in a pipeline, please see the [updating tool versions](https://nf-co.re/docs/usage/configuration#updating-tool-versions) section of the nf-core website.
-
-### Custom Tool Arguments
-
-A pipeline might not always support every possible argument or option of a particular tool used in pipeline. Fortunately, nf-core pipelines provide some freedom to users to insert additional parameters that the pipeline does not include by default.
-
-To learn how to provide additional arguments to a particular tool of the pipeline, please see the [customising tool arguments](https://nf-co.re/docs/usage/configuration#customising-tool-arguments) section of the nf-core website.
-
-### nf-core/configs
+### nf-core/configs
In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter. You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile.
@@ -187,14 +197,6 @@ See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config
If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack) on the [`#configs` channel](https://nfcore.slack.com/channels/configs).
-## Azure Resource Requests
-
-To be used with the `azurebatch` profile by specifying the `-profile azurebatch`.
-We recommend providing a compute `params.vm_type` of `Standard_D16_v3` VMs by default but these options can be changed if required.
-
-Note that the choice of VM size depends on your quota and the overall workload during the analysis.
-For a thorough list, please refer the [Azure Sizes for virtual machines in Azure](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes).
-
## Running in the background
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
@@ -209,6 +211,6 @@ Some HPC setups also allow you to run nextflow within a cluster job submitted yo
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
We recommend adding the following line to your environment to limit this (typically in `~/.bashrc` or `~./bash_profile`):
-```bash
+```console
NXF_OPTS='-Xms1g -Xmx4g'
```