Skip to content

Finding subclonal variants

Caleb Lareau edited this page Jun 27, 2022 · 7 revisions

Finding informative variants using mgatk

Current best practices

The Seurat/Signac packages provide compatible interactive workflows for mtDNA variant analysis with mtscATAC-seq. Specifically, we recommend these functions:

  • ReadMGATK imports files from the mgatk execution and stores them in the Seurat object.
  • IdentifyVariants utilizes the strand concordance and VMR statistics over an mtscATAC-seq library to identify high-quality subclonal variants.
  • FindClonotypes takes the high-confidence variants from the preceding function to then infer clones via a cell-cell neighbor graph construction in heteroplasmy space.
  • AlleleFreq then enables computing allele frequencies per cell/variant

Automatic variant calling in mgatk

As of version 0.6.0, we've implemented automated sub-clonal variant calling into the standard execution for the mgatk tenx mode, which should be the go-to for mtscATAC-seq libraries (noting that this mode of variant calling isn't applicable for scRNA-derived libraries; see note below).

A plot of stand correlation and variance-mean ratio is the most informative to identify informative mtDNA variants and is reported in the “.vmr_strand_plot.png” plot as part of the default output. Specifically, the x-axis represents the Pearson correlation between a variant's forward and reverse strand read counts across cells. This metric effectively separates low quality variants from high quality ones based on the overall concordance of heteroplasmy between strands. Overall, we expect to identify a pattern of substitutions where some variants are more common than others (specifically transitions rather than transversions). This variant signature plot can be generated rapidly from the “.variant_stats.tsv.gz” and "refAllele.txt" files returned in the mgatk output.

Workflow used in mtscATAC-seq paper (Lareau et al. 2020 nature Biotechnology)

An example of this workflow is provided at the vignette here: CRC tumor vignette. The vignette contains several sections specifically related to the dataset at hand, but skipping to the Find mtDNA variants section will get you going from base mgatk execution -> high quality variants most quickly.

The core function for performing variant calling is available as a source-able Rscript. You can quickly stream this file like so:

wget https://gist.githubusercontent.com/caleblareau/baee9629b9bf4c8ada1a833174ddef3e/raw/7e280c170128404789e0e62a1e1ef0dce1bdb09b/variant_calling.R

Important: this approach doesn't really work with droplet scRNA-seq since only one strand is being sequenced. Thus, the whole philosophy of strand concordance dissolves!! I don't know of the best way to de novo find subclonal variants in droplet-based single-cell RNA-seq, and generally would advise against even trying... it's cost me more hours of pain than I care to admit!