Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Workflow] Adding the TheiaMeta_Panel_Illumina_PE Workflow #656

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
7b265b2
make theiameta_panel
sage-wright Sep 24, 2024
b3bd529
rename taxon id vars in org param
sage-wright Sep 24, 2024
593e5b5
language
sage-wright Sep 24, 2024
01c1223
progress
sage-wright Sep 26, 2024
2c63022
notes
sage-wright Oct 9, 2024
83e8add
finish
sage-wright Oct 10, 2024
e8e757d
does this work?
sage-wright Oct 10, 2024
8416a8e
set required for now
sage-wright Oct 10, 2024
8ddf11b
correct terrible spelling
sage-wright Oct 10, 2024
236230f
add runtime
cimendes Oct 11, 2024
3d23bce
start documentation
sage-wright Oct 11, 2024
75c7224
add information on workflow tasks to documentation
sage-wright Oct 15, 2024
c177267
Merge branch 'main' into smw-theiameta-panel-dev
sage-wright Oct 15, 2024
e8312a5
remove krona
sage-wright Oct 15, 2024
b5260e4
add + to everything????
sage-wright Oct 16, 2024
aad8ec4
remove from one array
sage-wright Oct 16, 2024
991d540
also remove from that one too
sage-wright Oct 16, 2024
f56ca68
trying something cRaZy
sage-wright Oct 16, 2024
a04af99
it doesn't work
sage-wright Oct 17, 2024
498d07c
more crazy ideas?
cimendes Oct 17, 2024
688f89b
maybe basename is a good idea
cimendes Oct 17, 2024
fce8abb
change to json
cimendes Oct 17, 2024
cc97d70
sort of works but is ugly
cimendes Oct 17, 2024
9a7086a
IT WORKS
cimendes Oct 17, 2024
bc96474
clean up
cimendes Oct 17, 2024
93bb88b
add dummy genome length & logic block consensus qc
sage-wright Oct 21, 2024
c4cf61b
remove null values from identified_organisms otuput
sage-wright Oct 21, 2024
4e1c373
add versioning
sage-wright Oct 21, 2024
148cb9d
up to 1000
sage-wright Oct 21, 2024
8c7de78
make theiameta_panel fault-resistant, has impacts on theiameta_illumi…
cimendes Oct 21, 2024
365837e
add catch if assembly file is empty
cimendes Oct 21, 2024
5bcb25b
remove exit 1 because it's causing task to fail
sage-wright Oct 21, 2024
166e9fb
update contributions
sage-wright Oct 21, 2024
05d55f7
add warnings to gathered output
sage-wright Oct 21, 2024
ff73187
bump up al
sage-wright Oct 21, 2024
2e25834
work on inputs
sage-wright Oct 22, 2024
c04fa48
hide some optional inputs
sage-wright Oct 23, 2024
43e0efd
add inputs and outputs to docs
sage-wright Oct 23, 2024
4c8d373
Merge branch 'main' into smw-theiameta-panel-dev
sage-wright Oct 23, 2024
547a920
enable searchable
sage-wright Oct 23, 2024
82695a7
set default, expand docs
sage-wright Oct 24, 2024
033cbc0
update contributors
sage-wright Oct 28, 2024
c8b658a
input explosion
sage-wright Oct 28, 2024
92acb24
make good
sage-wright Oct 28, 2024
a3c7c52
document the explosion
sage-wright Oct 28, 2024
b0498ae
optionalize extracted reads
sage-wright Oct 28, 2024
48a26a2
add flu outputs to gather scatter
sage-wright Oct 28, 2024
fcc17d9
finish documentation
sage-wright Oct 28, 2024
8bc64fb
clean up docs
sage-wright Oct 28, 2024
931e815
update md5sums
sage-wright Oct 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,11 @@ workflows:
primaryDescriptorPath: /workflows/theiameta/wf_theiameta_illumina_pe.wdl
testParameterFiles:
- /tests/inputs/empty.json
- name: TheiaMeta_Panel_Illumina_PE_PHB
subclass: WDL
primaryDescriptorPath: /workflows/theiameta/wf_theiameta_panel_illumina_pe.wdl
testParameterFiles:
- /tests/inputs/empty.json
- name: Snippy_Streamline_PHB
subclass: WDL
primaryDescriptorPath: /workflows/phylogenetics/wf_snippy_streamline.wdl
Expand Down
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,14 @@ You can expect a careful review of every PR and feedback as needed before mergin
* **Sage Wright** ([@sage-wright](https://github.com/sage-wright)) - Conceptualization, Software, Validation, Supervision
* **Inês Mendes** ([@cimendes](https://github.com/cimendes)) - Software, Validation
* **Curtis Kapsak** ([@kapsakcj](https://github.com/kapsakcj)) - Conceptualization, Software, Validation
* **James Otieno** ([@jrotieno](https://github.com/jrotieno)) - Software, Validation
* **Frank Ambrosio** ([@frankambrosio3](https://github.com/frankambrosio3)) - Conceptualization, Software, Validation
* **Michelle Scribner** ([@michellescribner](https://github.com/michellescribner)) - Software, Validation
* **Kevin Libuit** ([@kevinlibuit](https://github.com/kevinlibuit)) - Conceptualization, Project Administration, Software, Validation, Supervision
* **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty)) - Software, Validation
* **Andrew Page** ([@andrewjpage](https://github.com/andrewjpage)) - Project Administration, Software, Supervision
* **Fraser Combe** ([@fraser-combe](https://github.com/fraser-combe)) - Software, Validation
* **Michal Babinski** ([@Michal-Babins](https://github.com/Michal-Babins)) - Software, Validation
* **Andrew Lang** ([@AndrewLangVt](https://github.com/AndrewLangVt)) - Software, Supervision
* **Andrew Page** ([@andrewjpage](https://github.com/andrewjpage)) - Project Administration, Software, Supervision
* **Kelsey Kropp** ([@kelseykropp](https://github.com/kelseykropp)) - Validation
* **Emily Smith** ([@emily-smith1](https://github.com/emily-smith1)) - Validation
* **Joel Sevinsky** ([@sevinsky](https://github.com/sevinsky)) - Conceptualization, Project Administration, Supervision
Expand All @@ -62,7 +63,9 @@ You can expect a careful review of every PR and feedback as needed before mergin

We would like to gratefully acknowledge the following individuals from the public health community for their contributions to the PHB repository:

* **James Otieno** ([@jrotieno](https://github.com/jrotieno))
* **Robert Petit** ([@rpetit3](https://github.com/rpetit3))
* **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty))
* **Ash O'Farrel** ([@aofarrel](https://github.com/aofarrel))
* **Sam Baird** ([@sam-baird](https://github.com/sam-baird))
* **Holly Halstead** ([@HNHalstead](https://github.com/HNHalstead))
Expand Down
2 changes: 1 addition & 1 deletion docs/contributing/doc_contribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ A brief description of the documentation structure is as follows:
- `assets/` - Contains images and other files used in the documentation.
- `figures/` - Contains images, figures, and workflow diagrams used in the documentation. For workflows that contain many images (such as BaseSpace_Fetch), it is recommended to create a subdirectory for the workflow.
- `files/` - Contains files that are used in the documentation. This may include example outputs or templates. For workflows that contain many files (such as TheiaValidate), it is recommended to create a subdirectory for the workflow.
- `logos/` - Contains Theiagen logos and symbols used int he documentation.
- `logos/` - Contains Theiagen logos and symbols used in the documentation.
- `metadata_formatters/` - Contains the most up-to-date metadata formatters for our submission workflows.
- `new_workflow_template.md` - A template for adding a new workflow page to the documentation.
- `contributing/` - Contains the Markdown files for our contribution guides, such as this file
Expand Down
8 changes: 5 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,13 +65,13 @@ You can expect a careful review of every PR and feedback as needed before mergin
- **Sage Wright** ([@sage-wright](https://github.com/sage-wright)) - Conceptualization, Software, Validation, Supervision
- **Inês Mendes** ([@cimendes](https://github.com/cimendes)) - Software, Validation
- **Curtis Kapsak** ([@kapsakcj](https://github.com/kapsakcj)) - Conceptualization, Software, Validation
- **James Otieno** ([@jrotieno](https://github.com/jrotieno)) - Software, Validation
- **Frank Ambrosio** ([@frankambrosio3](https://github.com/frankambrosio3)) - Conceptualization, Software, Validation
- **Michelle Scribner** ([@michellescribner](https://github.com/michellescribner)) - Software, Validation
- **Kevin Libuit** ([@kevinlibuit](https://github.com/kevinlibuit)) - Conceptualization, Project Administration, Software, Validation, Supervision
- **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty)) - Software, Validation
- **Andrew Page** ([@andrewjpage](https://github.com/andrewjpage)) - Project Administration, Software, Supervision
- **Fraser Combe** ([@fraser-combe](https://github.com/fraser-combe)) - Software, Validation
- **Michal Babinski** ([@Michal-Babins](https://github.com/Michal-Babins)) - Software, Validation
- **Andrew Lang** ([@AndrewLangVt](https://github.com/AndrewLangVt)) - Software, Supervision
- **Andrew Page** ([@andrewjpage](https://github.com/andrewjpage)) - Project Administration, Software, Supervision
- **Kelsey Kropp** ([@kelseykropp](https://github.com/kelseykropp)) - Validation
- **Emily Smith** ([@emily-smith1](https://github.com/emily-smith1)) - Validation
- **Joel Sevinsky** ([@sevinsky](https://github.com/sevinsky)) - Conceptualization, Project Administration, Supervision
Expand All @@ -80,7 +80,9 @@ You can expect a careful review of every PR and feedback as needed before mergin

We would like to gratefully acknowledge the following individuals from the public health community for their contributions to the PHB repository:

- **James Otieno** ([@jrotieno](https://github.com/jrotieno))
- **Robert Petit** ([@rpetit3](https://github.com/rpetit3))
- **Emma Doughty** ([@emmadoughty](https://github.com/emmadoughty))
- **Ash O'Farrel** ([@aofarrel](https://github.com/aofarrel))
- **Sam Baird** ([@sam-baird](https://github.com/sam-baird))
- **Holly Halstead** ([@HNHalstead](https://github.com/HNHalstead))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,4 +65,8 @@ This workflow runs on the sample level.
| **pangolin_updates** | String | Result of Pangolin Update (lineage changed versus unchanged) with lineage assignment and date of analysis |
| **pangolin_versions** | String | All Pangolin software and database versions |

</div>
</div>

## References

> **Pangolin**: RRambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020 Nov;5(11):1403-1407. doi: 10.1038/s41564-020-0770-5. Epub 2020 Jul 15. PMID: 32669681; PMCID: PMC7610519.
3 changes: 2 additions & 1 deletion docs/workflows/genomic_characterization/theiacov.md
Original file line number Diff line number Diff line change
Expand Up @@ -900,6 +900,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
| Task | [task_pangolin.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/species_typing/betacoronavirus/task_pangolin.wdl) |
| Software Source Code | [Pangolin on GitHub](https://github.com/cov-lineages/pangolin) |
| Software Documentation | [Pangolin website](https://cov-lineages.org/resources/pangolin.html) |
| Original Publication(s) | [A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology](https://doi.org/10.1038/s41564-020-0770-5) |

??? task "`nextclade`"

Expand Down Expand Up @@ -1138,7 +1139,7 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| nextclade_json_flu_ha | File | Nextclade output in JSON file format, specific to Flu HA segment | ONT, PE |
| nextclade_json_flu_na | File | Nextclade output in JSON file format, specific to Flu NA segment | ONT, PE |
| nextclade_lineage | String | Nextclade lineage designation | CL, FASTA, ONT, PE, SE |
| nextclade_qc | String | QC metric as determined by Nextclade. (For Flu, this output will be specific to HA segment) | CL, FASTA, ONT, PE, SE |
| nextclade_qc | String | QC metric as determined by Nextclade. Will be blank for Flu | CL, FASTA, ONT, PE, SE |
| nextclade_qc_flu_ha | String | QC metric as determined by Nextclade, specific to Flu HA segment | ONT, PE |
| nextclade_qc_flu_na | String | QC metric as determined by Nextclade, specific to Flu NA segment | ONT, PE |
| nextclade_tsv | File | Nextclade output in TSV file format. (For Flu, this output will be specific to HA segment) | CL, FASTA, ONT, PE, SE |
Expand Down
50 changes: 45 additions & 5 deletions docs/workflows/genomic_characterization/theiameta.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,22 +241,62 @@ The TheiaMeta_Illumina_PE workflow processes Illumina paired-end (PE) reads ge
#### Assembly

??? task "`metaspades`: _De Novo_ Metagenomic Assembly"
While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes.

While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging. A dedicated metagenomic assembly algorithm is necessary to circumvent the challenge of interpreting variation. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes.
`metaspades` is a _de novo_ assembler that first constructs a de Bruijn graph of all the reads using the SPAdes algorithm. Through various graph simplification procedures, paths in the assembly graph are reconstructed that correspond to long genomic fragments within the metagenome. For more details, please see the original publication.

!!! techdetails "MetaSPAdes Technical Details"

| | Links |
| --- | --- |
| Task | [task_metaspades.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/assembly/task_metaspades.wdl) |
| Software Source Code | [SPAdes on GitHub](https://github.com/ablab/spades) |
| Software Documentation | <https://github.com/ablab/spades/blob/spades_3.15.5/README.md> |
| Original Publication(s) | [metaSPAdes: a new versatile metagenomic assembler](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/) |
| Software Documentation | [SPAdes Manual](https://ablab.github.io/spades/index.html) |
| Original Publication(s) | [metaSPAdes: a new versatile metagenomic assembler](http://www.genome.org/cgi/doi/10.1101/gr.213959.116) |

??? task "`minimap2`: Assembly Alignment and Contig Filtering (if a reference is provided)"
??? task "`minimap2`: Assembly Alignment and Contig Filtering"

If a reference genome is provided through the **`reference`** optional input, the assembly produced with `metaspades` will be mapped to the reference genome with `minimap2`. The contigs which align to the reference are retrieved and returned in the **`assembly_fasta`** output.

`minimap2` is a popular aligner that is used for correcting the assembly produced by metaSPAdes. This is done by aligning the reads back to the generated assembly or a reference genome.

In minimap2, "modes" are a group of preset options. Two different modes are used in this task depending on whether a reference genome is provided.

If a reference genome is _not_ provided, the only mode used in this task is `sr` which is intended for "short single-end reads without splicing". The `sr` mode indicates the following parameters should be used: `-k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -b0 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no`. The output file is in SAM format.

If a reference genome is provided, then after the draft assembly polishing with `pilon`, this task runs again with the mode set to `asm20` which is intended for "long assembly to reference mapping". The `asm20` mode indicates the following parameters should be used: `-k19 -w10 -U50,500 --rmq -r100k -g10k -A1 -B4 -O6,26 -E2,1 -s200 -z200 -N50`. The output file is in PAF format.

For more information, please see the [minimap2 manpage](https://lh3.github.io/minimap2/minimap2.html)

!!! techdetails "minimap2 Technical Details"
| | Links |
|---|---|
| Task | [task_minimap2.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/alignment/task_minimap2.wdl) |
| Software Source Code | [minimap2 on GitHub](https://github.com/lh3/minimap2) |
| Software Documentation | [minimap2](https://lh3.github.io/minimap2) |
| Original Publication(s) | [Minimap2: pairwise alignment for nucleotide sequences](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778) |

??? task "`samtools`: SAM File Conversion "
This task converts the output SAM file from minimap2 and converts it to a BAM file. It then sorts the BAM based on the read names, and then generates an index file.

!!! techdetails "samtools Technical Details"
| | Links |
|---|---|
| Task | [task_samtools.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/utilities/data_handling/task_parse_mapping.wdl) |
| Software Source Code | [samtools on GitHub](https://github.com/samtools/samtools) |
| Software Documentation | [samtools](https://www.htslib.org/doc/samtools.html) |
| Original Publication(s) | [The Sequence Alignment/Map format and SAMtools](https://doi.org/10.1093/bioinformatics/btp352)<br>[Twelve Years of SAMtools and BCFtools](https://doi.org/10.1093/gigascience/giab008) |

??? task "`pilon`: Assembly Polishing"
`pilon` is a tool that uses read alignment to correct errors in an assembly. It is used to polish the assembly produced by metaSPAdes. The input to Pilon is the sorted BAM file produced by `samtools`, and the original draft assembly produced by `metaspades`.

!!! techdetails "pilon Technical Details"
| | Links |
|---|---|
| Task | [task_pilon.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/read_filtering/task_pilon.wdl) |
| Software Source Code | [Pilon on GitHub](https://github.com/broadinstitute/pilon) |
| Software Documentation | [Pilon Wiki](https://github.com/broadinstitute/pilon/wiki) |
| Original Publication(s) | [Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement](https://doi.org/10.1371/journal.pone.0112963) |

#### Assembly QC

??? task "`quast`: Assembly Quality Assessment"
Expand Down
Loading