Skip to content

Commit

Permalink
Merge branch 'documentation' of github.com:sanger-tol/genomeassembly …
Browse files Browse the repository at this point in the history
…into documentation
  • Loading branch information
Ksenia Krasheninnikova committed Nov 15, 2023
2 parents 28c718d + 4fa07c4 commit 8bc2264
Showing 1 changed file with 61 additions and 41 deletions.
102 changes: 61 additions & 41 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,21 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) DSL2.


### PREPARE_INPUT
Here the input YAML is being processed. Thr subworkflow generate the input channels used as by the other subworkflows.</p>
Here the input YAML is being processed. This subworkflow generates the input channels used as by the other subworkflows.</p>


### GENOMESCOPE_MODEL
<details markdown="1">
<summary>Output files</summary>

- <code>model</code>
- kmer coverage model
- <code>ktab</code>
- <code>kmer/*ktab</code>
- kmer table file
- <code>hist</code>
- <code>kmer/*hist</code>
- kmer histogram file
- <code>kmer/*model.txt</code>
- genomescope model in text format
- <code>kmer/*[linear,log]_plot.png</code>
- genomescope kmer plots

</details>

Expand All @@ -39,19 +41,17 @@ This subworkflow generates a KMER database and coverage model used in [PURGE_DUP
<details markdown="1">
<summary>Output files</summary>

- <code>primary_contigs</code>
- primary assembly in FASTA format
- <code>alternate_contigs</code>
- haplotigs in FASTA format
- <code>primary_hic_contigs</code>
- primary assembly in FASTA format for hifiasm-hic mode
- <code>alternate_hic_contigs</code>
- haplotigs in FASTA format for hifiasm-hic mode
- <code>.\*hifiasm.\*/.*p_ctg.[g]fa</code>
- primary assembly in GFA and FASTA format; for more details refer to [hifiasm output](https://hifiasm.readthedocs.io/en/latest/interpreting-output.html)
- <code>.\*hifiasm.\*/.*a_ctg.[g]fa</code>
- haplotigs in GFA and FASTA format; for more details refer to [hifiasm output](https://hifiasm.readthedocs.io/en/latest/interpreting-output.html)
- <code>.\*hifiasm.\*/.*bin</code>
- internal binary hifiasm files; for more details refer [here](https://hifiasm.readthedocs.io/en/latest/faq.html#id12)

</details>

Raw assembly(-ies) is generated here. hifiasm is run on the input HiFi reads then raw contigs are converted from GFA into FASTA format.
In case hifiasm HiC mode is switched on tun hifiasm with HiC data</p>
This subworkflow generates a raw assembly(-ies). First, hifiasm is run on the input HiFi reads then raw contigs are converted from GFA into FASTA format, this assembly is due to purging, polishing (optional) and scaffolding further down the pipeline.
In case hifiasm HiC mode is switched on, it is performed as an extra step with results stored in hifiasm-hic folder.</p>

![Raw assembly subworkflow](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/raw_assembly.png)

Expand All @@ -60,11 +60,12 @@ In case hifiasm HiC mode is switched on tun hifiasm with HiC data</p>
<details markdown="1">
<summary>Output files</summary>

- <code>pri</code>
- <code>\*.hifiasm..\*/purged.fa</code>
- purged primary contigs
- <code>alt</code>
- <code>\*.hifiasm..\*/purged.htigs.fa</code>
- haplotigs after purging

- other files from the purge_dups pipeline
- for details refer [here](https://github.com/dfguan/purge_dups)
</details>

Retained haplotype is identified in primary assembly. The alternate contigs are updated correspondingly.
Expand All @@ -74,26 +75,35 @@ The subworkflow relies on kmer coverage model to identify coverage thresholds. F
![Subworkflow for purging haplotigs](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/purge_dups.png)

### POLISHING
Uses Illumina 10X reads to fix short errors in primary contigs and haplotigs.</p>

<details markdown="1">
<summary>Output files</summary>

- <code>\*.hifiasm..\*/polishing/.*consensus.fa</code>
- polished joined primary and haplotigs assembly
- <code>\*.hifiasm..\*/polishing/merged.vcf.gz</code>
- unfiltered variants
- <code>\*.hifiasm..\*/polishing/merged.vcf.gz.tbi</code>
- index file
- <code>\*.hifiasm..\*/polishing/refdata-*</code>
- Longranger assembly indices


</details>

This subworkflow uses read mapping of the Illumina 10X short read data to fix short errors in primary contigs and haplotigs.</p>

![Subworkflow for purging haplotigs](https://raw.githubusercontent.com/sanger-tol/genomeassembly/documentation/docs/images/v1/polishing.png)

### HIC_MAPPING

<details markdown="1">
<summary>Output files</summary>

- <code>bed</code>
- <code>\*.hifiasm..\*/scaffolding/.*_merged_sorted.bed</code>
- bed file obtained from merged mkdup bam
- <code>cram</code>
- reads mapped to the reference
- <code>crai</code>
- index file for the mapped cram
- <code>stats</code>
- see [`CONVERT_STATS`](#convert_stats) output section
- <code>idxstats</code>
- output of samtools stats
- <code>flagstat</code>
- output of samtools flagstat

- <code>\*.hifiasm..\*/scaffolding/.*mkdup.bam</code>
- final read mapping bam with mapped reads
</details>

This subworkflow implements alignment of the Illumina HiC short reads to the primary assembly. Uses [`CONVERT_STATS`](#convert_stats) as internal subworkflow to calculate read mapping stats.</p>
Expand All @@ -105,11 +115,11 @@ This subworkflow implements alignment of the Illumina HiC short reads to the pri

<details markdown="1">
<summary>Output files</summary>
- <code>stats</code>
- <code>\*.hifiasm..\*/scaffolding/.*.stats</code>
- output of samtools stats
- <code>idxstats</code>
- <code>\*.hifiasm..\*/scaffolding/.*.idxstats</code>
- output of samtools idxstats
- <code>flagstat</code>
- <code>\*.hifiasm..\*/scaffolding/.*.flagstat</code>
- output of samtools flagstat
</details>

Expand All @@ -119,8 +129,18 @@ This subworkflow produces statistcs for a bam file containing read mapping. It i
<details markdown="1">
<summary>Output files</summary>

- <code>scaffolds</code>
- <code>\*.hifiasm..\*/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa</code>
- scaffolds in FASTA format
- <code>\*.hifiasm..\*/scaffolding/yahs/out.break.yahs/out_scaffolds_final.agp</code>
- coordinates of contigs relative to scaffolds
- <code>\*.hifiasm..\*/scaffolding/yahs/out.break.yahs/alignments_sorted.txt</code>
- Alignments for Juicer in text format
- <code>\*.hifiasm..\*/scaffolding/yahs/out.break.yahs/yahs_scaffolds.hic</code>
- Juicer HiC map
- <code>\*.hifiasm..\*/scaffolding/yahs/out.break.yahs/*cool</code>
- HiC map for cooler
- <code>\*.hifiasm..\*/scaffolding/yahs/out.break.yahs/*.FullMap.png</code>
- Pretext snapshot

</details>
The subworkflow performs scaffolding of the primary contigs using HiC mapping generated in [`HIC_MAPPING`](hic_mapping). It also performs some postprocessing steps such as generating cooler and pretext files</p>
Expand All @@ -132,11 +152,11 @@ The subworkflow performs scaffolding of the primary contigs using HiC mapping ge
<details markdown="1">
<summary>Output files</summary>

- <code>*.assembly_summary</code>
- <code>.*.assembly_summary</code>
- numeric statistics for pri and alt sequences
- <code>*ccs.merquryk</code>
- <code>.*ccs.merquryk</code>
- folder with merqury plots and kmer statistics
- <code>*busco</code>
- <code>.*busco</code>
- folder with BUSCO results

</details>
Expand All @@ -150,11 +170,11 @@ This subworkflow is used to evaluate the quality of sequences. It is performed a
<details markdown="1">
<summary>Output files</summary>

- <code>final_mitogenome.fasta</code>
- <code>\*.hifiasm.\*/mito..*/final_mitogenome.fasta</code>
- organelle assembly
- <code>final_mitogenome.[gb,gff]</code>
- <code>\*.hifiasm.\*/mito..*/final_mitogenome.[gb,gff]</code>
- organelle gene annotation
- <code>contigs_stats.tsv</code>
- <code>\*.hifiasm.\*/mito..*/contigs_stats.tsv</code>
- summary of mitochondrial findings
- output also includes other output files produced by MitoHiFi

Expand Down

0 comments on commit 8bc2264

Please sign in to comment.