-
Notifications
You must be signed in to change notification settings - Fork 27
Large deletion calling and heteroplasmy estimation
This page describes the additional modules dedicated to detecting and quantifying mtDNA deletions from sequencing data. We've most thoroughly evaluated our computational approach using the mscATAC-seq approach, but the general approach can be applied in other settings, such as exome sequencing data. Overall, three main execution steps are required: 1) deletion detection (mgatk-del find
); 2) single-cell genotyping (using the mtscATAC-seq | mgatk
workflow; and 3) quantification of deletion heteroplasmy in single cells. Here's a graphical overview of this workflow:
We note that we've explicitly validated this tool using mtscATAC-seq data from patients with KSS, CPEO, and Pearson Syndrome, where all deletions occur between Oh and OL. However, through simulated data, we verified that deletion junctions spanning the ends of the reference genome can also be detected using this workflow and quantified with no issues.
The first step is to identify deletions. The mgatk-del-find
executable takes in a .bam
file (presumably containing many single-cells for a single-cell assay) and returns the most likely positions that are participating in a deletion. An example of this functionality is available in tests
from this repository. Execute the command like so:
mgatk-del-find -i pearsonbam/CACCACTAGGAGGCGA-1.qc.bam
This will produce five output files in the same file path as the original .bam
file, including three plots and two raw data tables. First, let's look at the output data table:
head mgatkdel_find.clip.tsv
position coverage clip_count SA
16569 83 82 6
13095 208 62 20
1 65 61 5
6073 182 54 17
10934 131 41 0
3571 172 37 0
301 164 31 0
13753 230 22 0
296 216 21 0
This file shows the number of reads that cover each position as well as the number of reads that are clipped at the particular position, sorted by the number of clipped reads. Unsurprisingly, positions 1 and 16569 come near the top, which is to be expected as the mtDNA is circular and is linked at these two positions. Next, the SA column shows the number of times that clipped position is supported by a secondary alignment. For true deletions, the other half of the alignment will align to chrM, which is recorded by this column. While this table is good for parsing and downstream analyses, we include a summary visualization in the *.clipped_viz.pdf
file to rapidly determine potential mtDNA deletion points:
Here, the plot shows the positions of the mtDNA genome along the x-axis and the number of clipped reads where a clip occurs at the particular position. The top ten most clipped loci are shown with the position annotated (excluding the circular genome junction in the d-loop). After evaluating several different datasets for putative mtDNA deletions, were observed recurrent positions that are nominated, which are shown in grey. In other words, we've identified several positions that correspond to a "filter list" that we de-emphasize with this color scheme. From this visualization, we can see that 6073-13095
is the most likely deletion for this sample.
We can verify that the purported clips correspond to a loss of coverage in the putative deleted area by looking at the *.coverage_viz.pdf
file:
So, these images strongly implicate the chrM:6073-13095
region as an mtDNA deletion in this sample. When we combine these altogether to make a call:
Note that these "split alignments" can be sourced via this file: (head mgatkdel_find.SA.tsv
) where each row is a different read that shows the spliced alignment (clip) on either column.
out1 out2
16569 1
16569 1
16569 1
16569 1
16569 1
16569 1
1684 685
1684 685
652 1721
To quantify heteroplasmy in individual samples/cells for specific deletions, we utilize the mgatk-del
module, which is shown in Step 3. The input that module is a folder of quality-controlled .bam
files, where it is assumed that one cell is present in each .bam
file, and we further recommend removing PCR duplicates. If your sample comes from a 10x mscATAC-seq run, we generate these files as an intermediate, which can be accessible using the regular mgatk
execution (see here in the wiki for more information). Critically, one must throw the -qc
or --keep-qc-bams
flag to save these intermediate files as they are ordinary deleted in the standard execution. An example execution is shown below:
mgatk bcall -i path_to_full_10x_run.bam -o regular_mgatk_call -n regular_mgatk_call -c 4 \
-g rCRS -qc -bt CB -b barcode/path_to_full_10x_runbarcodes.tsv
In brief, this produces all of the standard output for an mgatk
execution but further results in a folder (regular_mgatk_call/qc_bam
) that can be used as the input for the final tool.
For clarity, this step is optional as one may prepare .bam
files in other workflows, but for genotyping with mscATAC-seq
assay, we recommend executing the mgatk
command similar to that shown above.
Finally, after the mtDNA deletions are identified (Step 1) and the input files are pre-processed (Step 2), we can execute mgatk-del
. In brief, the tool requires an input (-i
) of a folder of .bam
files and the left (-lc
) and right (-rc
) coordinates of putative mtDNA deletions for quantification. For our example on this page (in tests), we recommend the following execution:
mgatk-del -i pearsonbam -lc 6073 -rc 13095
By specifying the output folder (regular_mgatk_call
) to be the same as what was declared in Step 2, the .deletion_heteroplasmy.tsv
file will be present in the /final/
folder with the other genotyping files. We recommend this for convenience in organization, but the -o
argument can be any output folder destination.
Here, mgatk-del
produces only a single output file, which is shown here:
cat mgatk_out/final/mgatk_del.deletion_heteroplasmy.tsv
cell_id heteroplasmy reads_del reads_wt reads_all total_clipped_bases deletion version
CTAACTTAGAGCCACA-1 42.24 68 93 161 1718 del6073-13095 improved
CTAACTTAGAGCCACA-1 42.24 68 93 161 NA del6073-13095 naive
GCCTAGGCAGTTCGGC-1 51.2 64 61 125 1375 del6073-13095 improved
GCCTAGGCAGTTCGGC-1 51.59 65 61 126 NA del6073-13095 naive
CACCACTAGGAGGCGA-1 34.05 63 122 185 1412 del6073-13095 improved
CACCACTAGGAGGCGA-1 34.05 63 122 185 NA del6073-13095 naive
Each line shows a unique cell/deletion combination and includes all relevant information about the deletion heteroplasmy, including the number of reads supporting a "wild-type" allele and the number supporting a deletion allele. The heteroplasmy is defined as the ratio of reads_del
to reads_all
multiplied by 100%
. Finally, after additional simulations, we report an "improved" heteroplasmy tab that provides further stringency for estimation, particularly of low effect (rare) deletions.
We can also specify the quantification of multiple deletions by supplying multiple coordinates to the -lc
and -rc
flags like so:
mgatk-del -i pearsonbam -z -lc 6073,6000 -rc 13095,14000
This is particularly useful for complex single-cell experiment settings or when evaluating many different potential deletions. We can then examine the output:
cat mgatk_out/final/mgatk_del.deletion_heteroplasmy.tsv | grep improved
CTAACTTAGAGCCACA-1 42.24 68 93 161 1718 del6073-13095 improved
CTAACTTAGAGCCACA-1 0 0 171 171 0 del6000-14000 improved
GCCTAGGCAGTTCGGC-1 51.2 64 61 125 1375 del6073-13095 improved
GCCTAGGCAGTTCGGC-1 0 0 165 165 0 del6000-14000 improved
CACCACTAGGAGGCGA-1 34.05 63 122 185 1412 del6073-13095 improved
CACCACTAGGAGGCGA-1 0 0 208 208 0 del6000-14000 improved
Again, we see that each line is a unique cell/deletion combination with all appropriate summary statistics.
Here, we verify that the heteroplasmy is identically zero for all cells for the del6000-14000
, which is expected as these positions had no evidence of contributing to a deletion from the analyses upstream (we note however that coverage ased estimates would suggest that the deletion would be supported. We emphasize that the deletions are parsed from -lc
and -rc
such that the order of the arguments for both parameters correspond to one deletion (e.g. a chrM:6073-14000 deletion was not considered since 6073 was first in the -lc
comma-separated string but 14000 was second in the -rc
string). Thus, a parameterization of -lc 6073,6073 -rc 13095,14000
would be required to compute both of the chrM:6073-13095 and chrM:6073-14000 heteroplasmies.
In the original development of mgatk-del
, we advocate for a clipped-read-based means of estimating deletion heteroplasmy. However, after feedback from various individuals, we note that there are a few scenarios where it may be advantageous to estimate deletion heteroplasmy using a coverage-based estimation
Here are settings where computing a coverage-based heteroplasmy may be advantageous compared to a clipped read version:
- If you have many (i.e. >20,000 cells) per experiment that deletion heteroplasmy should be quantified
- You have full confidence in the exact breakpoints (e.g. from
mgatk-del-find
or a similar workflow) - You are more interested in the relative heteroplasmy of cells rather than if it's identically zero, especially in low-coverage (i.e. <20x coverage / cell).
If your experiment doesn't satisfy one or more of these criteria, we recommend running the mgatk-del
pipeline as instructed above utilizing clipped reads. With these caveats in mind, here's how we recommend most efficiently quantifying deletion heteroplasmy in single-cells. First, process your sample with the standard mgatk tenx
workflow that will produce a variety of plaintext and binarized .rds
files that summarize the per-base, per-cell coverage. Next, utilize this function in R
to estimate the coverage-based heteroplasmy:
library(SummarizedExperiment)
library(Matrix)
estimate_coverage_heteroplasmy <- function(coord1, coord2, mat){
idx <- 1:dim(mat)[1]
in_del_boo <- (coord1 <= idx & coord2 >= idx)
in_del <- colSums(mat[in_del_boo,])/sum(in_del_boo)
out_del <- colSums(mat[!in_del_boo,])/sum(!in_del_boo)
x <- in_del/out_del
ifelse(x >1, 0, 1-x ) *100
}
se1 <- readRDS("sample_mgatk.rds")
het1 <- estimate_coverage_heteroplasmy(6073,13095, assays(se1)[["coverage"]])
summary(het1)
Here, het1
will be a vector of heteroplasmy (between 0 and 100%) inferred using the ratio of bases within the called deletion (e.g. chrM:6073-13095
in this example) compared to outside the region. One other note: we have not benchmarked this approach without prior alignment to a hard-masked reference genome, so be sure that your input into the mgatk workflow has accounted for NUMTs-- the expectation is that the coverage results will be errant without properly accounting for multi-mapping reads in these data.
Our simulations show that certain deletions may have different properties when using these estimators. However, there are three major considerations: precision near 0%, sensitivity near 0%, and absolute accuracy in terms of the fraction of deleted molecules. For 3 deletions, we show these simulated properties here:
Reproduced from the original report of the mgatk-del tool, we note the performance of these different methods of heteroplasmy estimation for 3 deletions derived from individuals with Pearson Syndrome:
Based on these simulations and our other analyses, we currently recommend the following:
- If you are interested in studying low deletion heteroplasmy (e.g. purifying selection or alleles near 0) and want to minimize the possibility of false positive heteroplasmy, we recommend the clip-based heteroplasmy metric. However, if you are okay with some false positives and want the most sensitive measure, particularly at low coverages (<20x), a coverage-based estimator is a reasonable approach.
- If you are interested in the most accurate quantification metrics at a range of values (e.g. fraction of deleted molecules in a cybrid experiment), we generally recommend the coverage-based heteroplasmy estimate as we find it to work off the shelf most consistently (i.e. lowest mean absolute error, typically, and at lower coverages, as shown above).
- If you have to process lots of cells within an experiment (i.e. >20,000 cells), the coverage-based estimation is more computationally efficient.
We note that one can utilize our published simulation framework to explicitly model your scenario (which deletion; sequencing chemistry read lengths) to explicitly quantify pros and cons of either method for heteroplasmy estimation.
- During our evaluation and benchmarking of the
mgatk-del
, we examined whether our workflow could detect deletions that span the ends of the mtDNA reference contig (in the d-loop). Indeed, we were sensitive to detect this junction and verified that the standard workflow could be applied.
Shown are the results of the simulated del15000-1000
mapped to the default hg38 genome and then quantified with mgatk-del. The red bars represent the abundance of clipped reads at the simulated deletion junctions, verifying that our deletion method would identify the junctions.
Thus, while we expect most deletions to occur between the origins of replication (Oh and Ol), we anticipate that mgatk-del
can accurately identify and quantify deletions wherever they may occur.
Please raise an issue here