Infer a succinct tree sequence from SARS-COV-2 variation data
This is an early alpha version not intended for production use!!
If you are interested in helping to develop sc2ts or would like to work with the inferred ARGS, please get in touch.
To run the downstream analysis utilties, install from pip using
python3 -m pip install sc2ts[analysis]
This installs matplotlib and some other heavyweight dependencies.
For just running the inference tools, use
python3 -m pip install sc2ts
Inference is intended to be run from the command-line primarily, and most likely orchestrated via a shell script or Snakemake file, etc.
The CLI is split into subcommands. Get help by running the CLI without arguments:
python3 -m sc2ts
Metadata for all samples must be available, and provided in a tab-separated file. We need to convert from a standard text file to a SQLite database so that we can quickly search for strains collected on a given day, without loading the entire set each time.
python3 -m sc2ts import-metadata data/metadata.tsv data/metadata.db
TODO: Document required fields
To provide fast access to the individual alignments, we store them in a local database file. These must be imported before inference can be performed.
The basic approach is to use the import-alignments
command, with a
path to a alignments.db
file which we are creating, and one or more
FASTA files that we are importing into it.
python3 -m sc2ts import-alignments data/alignments.db data/alignments/.fasta
By default the database file is updated each time, so this can be done in stages.
TODO discuss the storage and time requirements for this step!
The basic approach is to run the daily-extend
command which runs the
basic extension operation day-by-day using the information
in the metadata DB.
python3 -m sc2ts daily-extend data/alignments.db data/metadata.db results/output-prefix
Here is a script used to run the inference for the Long ARG in the preprint:
#!/bin/bash
set -e
precision=12
mismatches=3
max_submission_delay=30
max_daily_samples=1000
num_threads=40
datadir=data
run_id=upgma-mds-$max_daily_samples-md-$max_submission_delay-mm-$mismatches
resultsdir=results/$run_id
results_prefix=$resultsdir/$run_id-
logfile=logs/$run_id.log
# Setup the options
options="--num-threads $num_threads -vv -l $logfile "
options+="--max-submission-delay $max_submission_delay "
options+="--max-daily-samples $max_daily_samples "
options+="--precision $precision --num-mismatches $mismatches"
# Create the results dir and data paths
mkdir -p $resultsdir
alignments=$datadir/alignments2.db
metadata=$datadir/metadata.filtered.db
# NOTE: we can start from a given data also with the -b option
# basets="$results_prefix"2022-01-24.ts
# options+=" -b $basets"
python3 -m sc2ts daily-extend $alignments $metadata $results_prefix $options
The code is marked as licensed under the MIT license, but because the current implementation is used the matching engine from tsinfer (which is GPL licensed) this code is therefore also GPL.
However, we plan to switch out the matching engine for an implementation provided by tskit, which is MIT licensed. This will be done before the first official release.