-
Notifications
You must be signed in to change notification settings - Fork 4
Home
(serious constellations of reoccurring phylogenetically-independent origin)
Scorpio is a tool for classifying, haplotyping and defining Variants of Concern or Variants of Interest for a species. It was designed in the context of SARS-CoV-2, but is not species specific - all SARS-CoV-2 specific information can be installed via constellations.
It currently includes the following commands:
-
classify
- takes a set of lineage-defining constellations with rules and classifies sequences by them. -
haplotype
- takes a set of constellations and writes haplotypes (either as strings or individual columns). -
list
- print themrca_lineage
andoutput_name
of constellations as a single column to stdout. -
define
- takes a CSV with a group column and a mutations column and extracts the common mutations within the group, optionally with reference to a specified outgroup
It takes as input a ref-coordinate based multiple sequence alignment FASTA. For this reason it currently only supports typing SNP mutations and deletions (not insertions). This style of MSA has been commonly used within the SARS-CoV-2 pandemic as it can be generated by combining consensus-to-reference mappings instead of all-against-all mappings and therefore scales much better with millions of sequences. This MSA can be generated from unaligned reads using the following command:
minimap2 -t <threads> -a --secondary=no -x asm20 --score-N=0 <reference_fasta> <sequence_fasta> \
| gofasta sam toMultiAlign -t <threads> --reference <reference_fasta> --pad -o alignment.fasta
Or potentially using MAFFT with the --keeplength option ("Keep alignment length" in the web app).
Classify counts up the number of reference, alternative, ambiguous and other alleles at each of the defining sites of each constellation, and summarizes whether each sequence can be classified as belonging to each constellation based on sets of rules.
If it meets the criteria set in the rules for several constellations, a winning constellation is chosen by default as the constellation with the most rules met and with the best support (#alt/#sites). The default output is a single summary file, with optional additional columns. Individual counts and True/False classifications for each constellation can be output in individual CSV files.
- Create individual count files for each of the Omicron and Delta constellations. Note that the
-n
flag specifies a list of names in the format specified by thelabel
in the constellation JSON files.
scorpio classify -i alignment.fa --prefix scorpio_classify --output-counts -n "Delta (B.1.617.2-like)" "Omicron (B.1.1.529-like)" "Omicron (BA.1-like)" "Omicron (BA.2-like)" "Omicron (BA.3-like)" "Omicron (Unassigned)"
- Create a single file with the winning classification for each sample, but include count information for the winner using
--long
.
scorpio classify -i alignment.fa --prefix scorpio_classify --long
- View the output as constellations are loaded but stop before classifying samples.
scorpio classify -i alignment.fa --prefix scorpio_classify --long
Create barcode strings for each sample for each constellation - these strings are ordered by position in the definition files and can help to resolve why a sample is failing to be classified as a given constellation: amplicon dropout, potential recombination or contamination.
Options include combining constellations and creating a single barcode/set of haplotypes for the ordered list of defining sites of all constellations, splitting barcodes into a column per site, and outputting a file per constellation containing counts of ref, alt, ambig and other alleles.
- Create a single summary file with a haplotype barcodes for each of the Omicron and Delta constellations for each sample. Note that the
-n
flag specifies a list of names in the format specified by thelabel
in the constellation JSON files.
scorpio haplotype -i alignment.fa --prefix scorpio_haplotype -n "Delta (B.1.617.2-like)" "Omicron (B.1.1.529-like)" "Omicron (BA.1-like)" "Omicron (BA.2-like)" "Omicron (BA.3-like)"
- Create a file per constellation with a column containing the genotype call for each defining mutation site, and a summary of the counts of ref, alt, ambig and other alleles.
scorpio haplotype -i alignment.fa --prefix scorpio_haplotype --append-genotypes --output-counts
- Create a single file with a barcode representing the union of Delta and the Omicron parent lineage (B.1.1.529)
scorpio haplotype -i alignment.fa --prefix scorpio_haplotype --combination -n "Omicron (Unassigned)" "Delta (B.1.617.2-like)"
Prints to stdout a single column list of the mrca_lineage
and output_name
for each constellation. This can then be parsed for downstream analysis e.g. this is used by Pangolin to get a list of the lineages we have constellations for in order to remove false positive lineage assignments. The output_name
corresponds to the label
in the constellation JSON unless another field is specified with --label
.
Identify the common mutations within a group of sequences. This command assumes that the mutations for each sample have already been found and are provided as a pipe-separated list in a column called nucleotide_mutations
. If required, the user can specify an outgroup, and mutations which are common to this outgroup are placed in a separate ancestral site list which is used by classify
but not haplotype
in order to retain sensitivity whilst removing noise from haplotype barcodes.
The following two examples show different ways to create a constellation definition file from a GISAID download named sequences.fasta
. Please first check that your FASTA has no spaces or weird symbols in header names ([A-Za-z0-9_-|] are fine).
If it is not already installed, please install gofasta using conda install bioconda::gofasta
.
You will also require local reference files e.g. MN908947.fa and MN908947.gff are the reference genome files for SARS-CoV-2 (here is its Genbank accession and note that the downloads have an extra newline at the end which has to be deleted)
minimap2 -a -x asm20 --sam-hit-only --secondary=no --score-N=0 MN908947.fa sequences.fasta -o aligned.sam
gofasta sam variants -a MN908947.gff -r MN908947.fa -s aligned.sam -o variants.csv
Download the COG-UK datapipe and install its dependancies using conda install -f environment.yml && conda activate datapipe
. We will be using nextflow to run the variant calling module (https://github.com/COG-UK/datapipe/blob/main/modules/align_and_variant_call.nf).
- We need a basic CSV for datapipe including a sequence_name column with names which correspond to the FASTA.
Run e.g.
cat sequences.fasta | grep ">" | cut -f1 > sequences.csv
, manually add a lineage column and strip ">" using find and replace to add a header row. - Create the nucleotide_mutations column using the following datapipe command:
NXF_VER=20.10.0 nextflow run modules/align_and_variant_call.nf --uk_fasta sequences.fasta --uk_metadata sequences.csv
and find the output from align_and_variant_call:add_nucleotide_mutations_to_metadata
ending .with_nuc_mutations.csv
.
- Generate a
new_constellation.json
file based on the mutations invariants.csv
excluding those already defined in parent constellationBA.2
(already installed in constellations) OR in local constellation filecBA.2.json
.
scorpio define -i variants.csv --outgroup-json BA.2
scorpio define -i variants.csv --outgroup-json cBA.2.json
- Generate constellation files for all groups defined by a
lineage
column invariants.csv
. e.g. the output of datapipe.
scorpio define -i variants.csv