Skip to content

Large deletion calling and heteroplasmy estimation

Caleb Lareau edited this page Mar 27, 2023 · 10 revisions

Overview

This page describes the additional modules dedicated to detecting and quantifying mtDNA deletions from sequencing data. We've most thoroughly evaluated our computational approach using the mscATAC-seq approach, but the general approach can be applied in other settings, such as exome sequencing data. Overall, three main execution steps are required: 1) deletion detection (mgatk-del find); 2) single-cell genotyping (using the mtscATAC-seq | mgatk workflow; and 3) quantification of deletion heteroplasmy in single cells. Here's a graphical overview of this workflow:

Graphical overview of mgatk-del workflow

We note that we've explicitly validated this tool using mtscATAC-seq data from patients with KSS, CPEO, and Pearson Syndrome, where all deletions occur between Oh and OL. However, through simulated data, we verified that deletion junctions spanning the ends of the reference genome can also be detected using this workflow and quantified with no issues.


Step 1. Detect deletions in bulk .bam file

The first step is to identify deletions. The mgatk-del-find executable takes in a .bam file (presumably containing many single-cells for a single-cell assay) and returns the most likely positions that are participating in a deletion. An example of this functionality is available in tests from this repository. Execute the command like so:

mgatk-del-find -i pearsonbam/CACCACTAGGAGGCGA-1.qc.bam

This will produce five output files in the same file path as the original .bam file, including three plots and two raw data tables. First, let's look at the output data table:

head mgatkdel_find.clip.tsv

position	coverage	clip_count	SA
16569	83	82	6
13095	208	62	20
1	65	61	5
6073	182	54	17
10934	131	41	0
3571	172	37	0
301	164	31	0
13753	230	22	0
296	216	21	0

This file shows the number of reads that cover each position as well as the number of reads that are clipped at the particular position, sorted by the number of clipped reads. Unsurprisingly, positions 1 and 16569 come near the top, which is to be expected as the mtDNA is circular and is linked at these two positions. Next, the SA column shows the number of times that clipped position is supported by a secondary alignment. For true deletions, the other half of the alignment will align to chrM, which is recorded by this column. While this table is good for parsing and downstream analyses, we include a summary visualization in the *.clipped_viz.pdf file to rapidly determine potential mtDNA deletion points:

clips

Here, the plot shows the positions of the mtDNA genome along the x-axis and the number of clipped reads where a clip occurs at the particular position. The top ten most clipped loci are shown with the position annotated (excluding the circular genome junction in the d-loop). After evaluating several different datasets for putative mtDNA deletions, were observed recurrent positions that are nominated, which are shown in grey. In other words, we've identified several positions that correspond to a "filter list" that we de-emphasize with this color scheme. From this visualization, we can see that 6073-13095 is the most likely deletion for this sample.

We can verify that the purported clips correspond to a loss of coverage in the putative deleted area by looking at the *.coverage_viz.pdf file:

coverage

So, these images strongly implicate the chrM:6073-13095 region as an mtDNA deletion in this sample. When we combine these altogether to make a call:

sa_viz

Note that these "split alignments" can be sourced via this file: (head mgatkdel_find.SA.tsv) where each row is a different read that shows the spliced alignment (clip) on either column.

out1	out2
16569	1
16569	1
16569	1
16569	1
16569	1
16569	1
1684	685
1684	685
652	1721

Step 2. Process mtDNA sequencing data using mgatk (optional)

To quantify heteroplasmy in individual samples/cells for specific deletions, we utilize the mgatk-del module, which is shown in Step 3. The input that module is a folder of quality-controlled .bam files, where it is assumed that one cell is present in each .bam file, and we further recommend removing PCR duplicates. If your sample comes from a 10x mscATAC-seq run, we generate these files as an intermediate, which can be accessible using the regular mgatk execution (see here in the wiki for more information). Critically, one must throw the -qc or --keep-qc-bams flag to save these intermediate files as they are ordinary deleted in the standard execution. An example execution is shown below:

mgatk bcall -i path_to_full_10x_run.bam -o regular_mgatk_call -n regular_mgatk_call -c 4 \
     -g rCRS -qc -bt CB -b barcode/path_to_full_10x_runbarcodes.tsv 

In brief, this produces all of the standard output for an mgatk execution but further results in a folder (regular_mgatk_call/qc_bam) that can be used as the input for the final tool.

For clarity, this step is optional as one may prepare .bam files in other workflows, but for genotyping with mscATAC-seq assay, we recommend executing the mgatk command similar to that shown above.


Step 3. Genotype single cells with mgatk-del

Finally, after the mtDNA deletions are identified (Step 1) and the input files are pre-processed (Step 2), we can execute mgatk-del. In brief, the tool requires an input (-i) of a folder of .bam files and the left (-lc) and right (-rc) coordinates of putative mtDNA deletions for quantification. For our example on this page (in tests), we recommend the following execution:

mgatk-del -i pearsonbam -lc 6073 -rc 13095

By specifying the output folder (regular_mgatk_call) to be the same as what was declared in Step 2, the .deletion_heteroplasmy.tsv file will be present in the /final/ folder with the other genotyping files. We recommend this for convenience in organization, but the -o argument can be any output folder destination.

Here, mgatk-del produces only a single output file, which is shown here:

cat mgatk_out/final/mgatk_del.deletion_heteroplasmy.tsv
cell_id	heteroplasmy	reads_del	reads_wt	reads_all	total_clipped_bases	deletion	version
CTAACTTAGAGCCACA-1	42.24	68	93	161	1718	del6073-13095	improved
CTAACTTAGAGCCACA-1	42.24	68	93	161	NA	del6073-13095	naive
GCCTAGGCAGTTCGGC-1	51.2	64	61	125	1375	del6073-13095	improved
GCCTAGGCAGTTCGGC-1	51.59	65	61	126	NA	del6073-13095	naive
CACCACTAGGAGGCGA-1	34.05	63	122	185	1412	del6073-13095	improved
CACCACTAGGAGGCGA-1	34.05	63	122	185	NA	del6073-13095	naive

Each line shows a unique cell/deletion combination and includes all relevant information about the deletion heteroplasmy, including the number of reads supporting a "wild-type" allele and the number supporting a deletion allele. The heteroplasmy is defined as the ratio of reads_del to reads_all multiplied by 100%. Finally, after additional simulations, we report an "improved" heteroplasmy tab that provides further stringency for estimation, particularly of low effect (rare) deletions.

We can also specify the quantification of multiple deletions by supplying multiple coordinates to the -lc and -rc flags like so:

mgatk-del -i pearsonbam -z -lc 6073,6000 -rc 13095,14000

This is particularly useful for complex single-cell experiment settings or when evaluating many different potential deletions. We can then examine the output:

cat mgatk_out/final/mgatk_del.deletion_heteroplasmy.tsv | grep improved
CTAACTTAGAGCCACA-1	42.24	68	93	161	1718	del6073-13095	improved
CTAACTTAGAGCCACA-1	0	0	171	171	0	del6000-14000	improved
GCCTAGGCAGTTCGGC-1	51.2	64	61	125	1375	del6073-13095	improved
GCCTAGGCAGTTCGGC-1	0	0	165	165	0	del6000-14000	improved
CACCACTAGGAGGCGA-1	34.05	63	122	185	1412	del6073-13095	improved
CACCACTAGGAGGCGA-1	0	0	208	208	0	del6000-14000	improved

Again, we see that each line is a unique cell/deletion combination with all appropriate summary statistics.

Here, we verify that the heteroplasmy is identically zero for all cells for the del6000-14000, which is expected as these positions had no evidence of contributing to a deletion from the analyses upstream (we note however that coverage ased estimates would suggest that the deletion would be supported. We emphasize that the deletions are parsed from -lc and -rc such that the order of the arguments for both parameters correspond to one deletion (e.g. a chrM:6073-14000 deletion was not considered since 6073 was first in the -lc comma-separated string but 14000 was second in the -rc string). Thus, a parameterization of -lc 6073,6073 -rc 13095,14000 would be required to compute both of the chrM:6073-13095 and chrM:6073-14000 heteroplasmies.

Coverage-based deletion calling

In the original development of mgatk-del, we advocate for a clipped-read-based means of estimating deletion heteroplasmy. However, after feedback from various individuals, we note that there are a few scenarios where it may be advantageous to estimate deletion heteroplasmy using a coverage-based estimation

Here are settings where computing a coverage-based heteroplasmy may be advantageous compared to a clipped read version:

  • If you have many (i.e. >20,000 cells) per experiment that deletion heteroplasmy should be quantified
  • You have full confidence in the exact breakpoints (e.g. from mgatk-del-find or a similar workflow)
  • You are more interested in the relative heteroplasmy of cells rather than if it's identically zero, especially in low-coverage (i.e. <20x coverage / cell).

If your experiment doesn't satisfy one or more of these criteria, we recommend running the mgatk-del pipeline as instructed above utilizing clipped reads. With these caveats in mind, here's how we recommend most efficiently quantifying deletion heteroplasmy in single-cells. First, process your sample with the standard mgatk tenx workflow that will produce a variety of plaintext and binarized .rds files that summarize the per-base, per-cell coverage. Next, utilize this function in R to estimate the coverage-based heteroplasmy:

library(SummarizedExperiment)
library(Matrix)

estimate_coverage_heteroplasmy <- function(coord1, coord2, mat){
  idx <- 1:dim(mat)[1]
  in_del_boo <- (coord1 <= idx & coord2 >= idx)
  in_del <- colSums(mat[in_del_boo,])/sum(in_del_boo)
  out_del <- colSums(mat[!in_del_boo,])/sum(!in_del_boo)
  x <- in_del/out_del
  ifelse(x >1, 0, 1-x ) *100
}
se1 <- readRDS("sample_mgatk.rds")
het1 <- estimate_coverage_heteroplasmy(6073,13095, assays(se1)[["coverage"]])
summary(het1)

Here, het1 will be a vector of heteroplasmy (between 0 and 100%) inferred using the ratio of bases within the called deletion (e.g. chrM:6073-13095 in this example) compared to outside the region. One other note: we have not benchmarked this approach without prior alignment to a hard-masked reference genome, so be sure that your input into the mgatk workflow has accounted for NUMTs-- the expectation is that the coverage results will be errant without properly accounting for multi-mapping reads in these data.

Our simulations show that certain deletions may have different properties when using these estimators. However, there are three major considerations: precision near 0%, sensitivity near 0%, and absolute accuracy in terms of the fraction of deleted molecules. For 3 deletions, we show these simulated properties here:


Simulations comparing coverage and clipped-based heteroplasmy estimation

Reproduced from the original report of the mgatk-del tool, we note the performance of these different methods of heteroplasmy estimation for 3 deletions derived from individuals with Pearson Syndrome:

Additional performance metrics for clipped and coverage heteroplasmy

Based on these simulations and our other analyses, we currently recommend the following:

  • If you are interested in studying low deletion heteroplasmy (e.g. purifying selection or alleles near 0) and want to minimize the possibility of false positive heteroplasmy, we recommend the clip-based heteroplasmy metric. However, if you are okay with some false positives and want the most sensitive measure, particularly at low coverages (<20x), a coverage-based estimator is a reasonable approach.
  • If you are interested in the most accurate quantification metrics at a range of values (e.g. fraction of deleted molecules in a cybrid experiment), we generally recommend the coverage-based heteroplasmy estimate as we find it to work off the shelf most consistently (i.e. lowest mean absolute error, typically, and at lower coverages, as shown above).
  • If you have to process lots of cells within an experiment (i.e. >20,000 cells), the coverage-based estimation is more computationally efficient.

We note that one can utilize our published simulation framework to explicitly model your scenario (which deletion; sequencing chemistry read lengths) to explicitly quantify pros and cons of either method for heteroplasmy estimation.


Other notes

Deletion spanning the d-loop

  • During our evaluation and benchmarking of the mgatk-del, we examined whether our workflow could detect deletions that span the ends of the mtDNA reference contig (in the d-loop). Indeed, we were sensitive to detect this junction and verified that the standard workflow could be applied.
span-0

Shown are the results of the simulated del15000-1000 mapped to the default hg38 genome and then quantified with mgatk-del. The red bars represent the abundance of clipped reads at the simulated deletion junctions, verifying that our deletion method would identify the junctions.

Thus, while we expect most deletions to occur between the origins of replication (Oh and Ol), we anticipate that mgatk-del can accurately identify and quantify deletions wherever they may occur.