Skip to content

Latest commit

 

History

History
313 lines (217 loc) · 30 KB

05.methods.md

File metadata and controls

313 lines (217 loc) · 30 KB

Materials and Methods

Identifying de novo germline mutations in the BXDs

The BXD resource currently comprises a total of 152 recombinant inbred lines (RILs). BXDs were derived from either F2 or advanced intercrosses, and subsequently inbred by brother-sister mating for up to 180 generations [@PMID:33472028]. BXDs were generated in distinct breeding "epochs," which were each initiated with a distinct cross of C57BL/6J and DBA/2J parents; epochs 1, 2, 4, and 6 were derived from F2 crosses, while epochs 3 and 5 were derived from advanced intercrosses [@PMID:33472028]. Previously, we analyzed whole-genome sequencing data from the BXDs and identified candidate de novo germline mutations in each line [@PMID:35545679]. A detailed description of the methods used for DNA extraction, sequencing, alignment, and variant processing, as well as the characteristics of the de novo mutations, are available in a previous manuscript [@PMID:35545679].

Briefly, we identified private single-nucleotide mutations in each BXD that were absent from all other BXDs, as well as from the C57BL/6J and DBA/2J parents. We required each private variant to be meet the following criteria:

  • genotyped as either homozygous or heterozygous for the alternate allele, with at least 90% of sequencing reads supporting the alternate allele

  • supported by at least 10 sequencing reads

  • Phred-scaled genotype quality of at least 20

  • must not overlap regions of the genome annotated as segmental duplications or simple repeats in GRCm38/mm10

  • must occur on a parental haplotype that was inherited by at least one other BXD at the same locus; these other BXDs must be homozygous for the reference allele at the variant site

A new approach to discover germline mutator alleles

Calculating aggregate mutation spectrum distance

We developed a new approach to discover loci that affect the germline de novo mutation spectrum in biparental RILs (Figure @fig:distance-method).

We assume that a collection of haplotypes has been genotyped at informative markers, and that de novo germline mutations have been identified on each haplotype.

At each informative marker, we divide haplotypes into two groups based on the parental allele that they inherited. We then compute a $k$-mer mutation spectrum using the aggregate mutation counts in each haplotype group. The $k$-mer mutation spectrum contains the frequency of every possible $k$-mer mutation type in a collection of mutations, and can be represented as a vector of size $6 \times 4^{k - 1}$ after collapsing by strand complement. For example, the 1-mer mutation spectrum is a 6-element vector that contains the frequencies of C>T, C>G, C>A, A>G, A>T, and A>C mutations. Since C>T transitions at CpG nucleotides are often caused by a distinct mechanism (spontaneous deamination of methylated cytosine), we expand the 1-mer mutation spectrum to include a separate category for CpG>TpG mutations [@PMID:19488047].

At each marker, we then compute the cosine distance between the two aggregate spectra. The cosine distance between two vectors $\mathbf{A}$ and $\mathbf{B}$ is defined as

$$D^C = 1 - \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \ ||\mathbf{B}||}$$

where $||\mathbf{A}||$ and $||\mathbf{B}||$ are the $L^2$ (or Euclidean) norms of $\mathbf{A}$ and $\mathbf{B}$, respectively. The cosine distance metric has a number of favorable properties for comparing mutation spectra. Since it adjusts for the magnitude of the two input vectors, cosine distance can be used to compare two spectra with unequal total mutation counts (even if those total counts are relatively small). Additionally, by calculating the cosine distance between mutation spectra, we avoid the need to perform separate comparisons of mutation counts at each individual $k$-mer mutation type.

Inspired by methods from QTL mapping [@PMID:7851788;@PMID:30591514], we use permutation tests to establish genome-wide cosine distance thresholds. In each of $N$ permutation trials, we randomly shuffle the per-haplotype mutation data such that haplotype labels no longer correspond to the correct mutation counts. Using the shuffled mutation data, we perform a genome-wide scan as described above, and record the maximum cosine distance observed at any locus. After $N$ permutations (usually 10,000), we compute the $1 - p$ percentile of the distribution of maximum statistics, and use that percentile value as a genome-wide significance threshold (for example, at $p = 0.05$).

Estimating confidence intervals around AMSD peaks

If we identified an adjusted cosine distance peak on a particular chromosome, we used a bootstrap resampling approach [@PMID:8725246] to estimate confidence intervals. In each of $N = 10,000$ trials, we resampled the mutation spectrum data and corresponding marker genotypes (on the chromosome of interest) with replacement. Using those resampled spectra and genotypes, we performed an aggregate mutation spectrum distance scan on the chromosome of interest and recorded the position of the marker with the largest adjusted cosine distance value. We then defined a 90% confidence interval by finding two marker locations between which 90% of all $N$ bootstrap samples produced a peak cosine distance value. In other words, we estimated the bounds of the 90% confidence interval by finding the markers that defined the 5th and 95th percentiles of the distribution of maximum adjusted cosine distance values across $N$ bootstrap trials. We note, however, that the bootstrap can exhibit poor performance in QTL mapping studies [@PMID:16783000]; namely, bootstrap confidence intervals tend to be larger than those estimated using either a "LOD drop" method or a Bayes credible interval, and can exhibit poorer-than-expected coverage (a measure of whether the confidence interval contains the true QTL location).

Accounting for relatedness between strains

We expect each BXD to derive approximately 50% of its genome from C57BL/6J and 50% from DBA/2J. As a result, every pair of BXDs will likely have identical genotypes at a fraction of markers. Pairs of more genetically similar BXDs may also have more similar mutation spectra, potentially due to shared polygenic effects on the mutation process. Therefore, at a given marker, if the BXDs that inherited D alleles are more genetically dissimilar from those that inherited B alleles (considering all loci throughout the genome in our measurement of genetic similarity), we might expect the aggregate mutation spectra in the two groups to also be more dissimilar.

We implemented a simple approach to account for these potential issues of relatedness. At each marker $g_i$, we divide BXD haplotypes into two groups based on the parental allele they inherited. As before, we first compute the aggregate mutation spectrum in each group of haplotypes and calculate the cosine distance between the two aggregate spectra ($D^{C}_{i}$). Then, within each group of haplotypes, we calculate the allele frequency of the D allele at every marker along the genome to obtain a vector of length $n$, where $n$ is the number of genotyped markers. To quantify the genetic similarity between the two groups of haplotypes, we calculate the Pearson correlation coefficient $r_i$ between the two vectors of marker-wide D allele frequencies.

Put another way, at every marker $g_i$ along the genome, we divide BXD haplotypes into two groups and compute two metrics: $D^{C}{i}$ (the cosine distance between the two groups' aggregate spectra) and $r_i$ (the correlation between genome-wide D allele frequencies in the two groups). To control for the potential effects of genetic similarity on cosine distances, we regress $\left(D^C{1}, D^C_{2}, \ldots D^C_{n} \right)$ on $\left( r_1, r_2, \ldots r_n \right)$ for all $n$ markers using an ordinary least-squares model. We then use the residuals from the fitted model as the "adjusted" cosine distance values for each marker. If genome-wide genetic similarity between haplotypes perfectly predicts cosine distances at each marker, these residuals will all be 0 (or very close to 0). If genome-wide genetic similarity has no predictive power, the residuals will simply represent the difference between the observed cosine distance at a single marker and the marker-wide mean of cosine distances.

Accounting for BXD population structure due to breeding epochs

The current BXD family was generated in six breeding "epochs." As discussed previously, each epoch was initiated with a distinct cross of C57BL/6J and DBA/2J parents; BXDs in four of the epochs were generated following F2 crosses of C57BL/6J and DBA/2J, and BXDs in the other two were generated following advanced intercrosses. Due to this breeding approach the BXD epochs differ from each other in a few important ways. For example, BXDs derived in epochs 3 and 5 (i.e., from advanced intercross) harbor larger numbers of fixed recombination breakpoints than those from epochs 1, 2, 4, and 6 [@PMID:33472028]. Although the C57BL/6J and DBA/2J parents used to initialize each epoch were completely inbred, they each possessed a small number unique de novo germline mutations that were subsequently inherited by many of their offspring. A number of these "epoch-specific" variants have also been linked to phenotypic variation observed between BXDs from different epochs [@PMID:33472028;@PMID:31274109;@PMID:30984232;@PMID:31057322].

To account for potential population structure, as well as these epoch-specific effects, we introduced the ability to perform stratified permutation tests in the aggregate mutation spectrum distance approach. Normally, in each of N permutations we shuffle the per-haplotype mutation spectrum data such that haplotype labels no longer correspond to the correct mutation spectra (i.e., shuffle mutation spectra across epochs). In the stratified approach, we instead shuffle per-haplotype mutation data within epochs, preserving epoch structure while still enabling mutation spectra permutations.

We used this epoch-aware approach for all permutation tests presented in this manuscript.

Implementation and source code

The aggregate mutation spectrum distance method was implemented in Python, and relies heavily on the following Python libraries: numpy, pandas, matplotlib, scikit-learn, pandera, seaborn, and numba [@doi:10.1038/s41586-020-2649-2;@doi:10.5281/zenodo.3509134;@doi:10.1109/MCSE.2007.55;@url:https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html;@doi:10.25080/Majora-342d178e-010;@doi:10.21105/joss.03021;@doi:10.1145/2833157.2833162].

The code underlying AMSD, as well as documentation of the method, is available on GitHub (https://github.com/quinlan-lab/proj-mutator-mapping). We have also deposited a reproducible Snakemake [@doi:10.12688/f1000research.29032.1] workflow for running reproducing all analyses and figures presented in the manuscript.

Simulations to assess the power of the aggregate mutation spectrum distance approach

We performed a series of simple simulations to estimate our power to detect alleles that affect the germline mutation spectrum using the aggregate mutation spectrum distance method.

Simulating genotypes

First, we simulate genotypes on a population of haplotypes at a collection of sites. We define a matrix $G$ of size $(s, h)$, where $s$ is the number of sites and $h$ is the number of haplotypes. We assume that every site is biallelic, and that the minor allele frequency at every site is 0.5. For every entry $G_{i,j}$, we take a single draw from a uniform distribution in the interval $[0.0, 1.0)$. If the value of that draw is less than 0.5, we assign the value of $G_{i,j}$ to be $1$. Otherwise, we assign the value of $G_{i,j}$ to be $0$.

Defining expected mutation type probabilities

Next, we define a vector of 1-mer mutation probabilities:

$$P = \left( 0.29, \ 0.17, \ 0.12, \ 0.075, \ 0.1, \ 0.075, \ 0.17 \right)$$

These probabilities sum to 1 and roughly correspond to the expected frequencies of C>T, CpG>TpG, C>A, C>G, A>T, A>C, and A>G de novo germline mutations in mice, respectively [@PMID:31492841]. If we are simulating the 3-mer mutation spectrum, we modify the vector of mutation probabilities $P$ to be length 96, and assign every 3-mer mutation type a value of $\frac{P_c}{16}$, where $P_c$ is the probability of the "central" mutation type associated with the 3-mer mutation type. In other words, each of the 16 possible NCN>NTN 3-mer mutation types would be assigned a mutation probability of $\frac{P_c}{16} = \frac{0.46}{16} = 0.02875$. We then generate a vector of lambda values by scaling the mutation probabilities by the number of mutations we wish to simulate ($m$):

$$\lambda = Pm$$

We also create a second vector of lambda values ($\lambda^{\prime}$), in which we multiply the $\lambda$ value of a single mutation type by the mutator effect size $e$.

Rather than simulate the same mean number of mutations ($m$) on every haplotype, we also performed a series of simulations in which the mean number of mutations on each haplotype was allowed to vary. The BXD RILs were inbred for variable numbers of generations, and each BXD therefore accumulated a variable number of de novo germline mutations [@PMID:35545679]. To more closely approximate the BXD haplotypes, we performed simulations in which the number of mutations ($m$) on each haplotype was drawn from a uniform distribution from $m$ to $20m$. In other words, we created a vector of mutation counts $M$ containing $h$ evenly-spaced integers from $m$ to $20m$, where $h$ is the number of simulated haplotypes. Thus, if we simulated between 100 and 2,000 mutations on 50 haplotypes, the $i$th entry of $M$ would be $100 + \frac{(2,000 - 100)}{50}i$. Each haplotype's mean number of mutations was then assigned by looking up the haplotype's index $i$ in $M$.

In our simulations, we assume that genotypes at a single site (the "mutator locus") are associated with variation in the mutation spectrum. That is, at a single site $s_i$, all of the haplotypes with $1$ alleles should have elevated rates of a particular mutation type and draw their mutation counts from $\lambda^{\prime}$, while all of the haplotypes with $0$ alleles should have "wild-type" rates of that mutation type and draw their mutation counts from $\lambda$. We therefore pick a random site $s_i$ to be the "mutator locus," and identify the indices of haplotypes in $G$ that were assigned $1$ alleles at $s_i$. We call these indices $h_{mut}$.

Simulating mutation spectra

To simulate the mutation spectrum on our toy population of haplotypes, we define a matrix $C$ of size $(h, n)$, where $n = 6 \times 4^{k - 1}$ (or if $k = 1$ and we include CpG>TpG mutations, $6 \times 4^{k - 1} + 1$).

Then, we populate the matrix $C$ separately for mutator and wild-type haplotypes. For every row $i$ in the matrix (i.e., for every haplotype), we first ask if $i$ is in $h_{mut}$ (that is, if the haplotype at index $i$ was assigned a $1$ allele at the "mutator locus"). If so, we set the values of $C_i$ to be the results of a single Poisson draw from $\lambda^{\prime}$. If row $i$ is not in $h_{mut}$, we set the values of $C_i$ to be the results of a single Poisson draw from $\lambda$.

Assessing power to detect a simulated mutator allele using AMSD

For each combination of parameters (number of simulated haplotypes, number of simulated markers, mutator effect size, etc.), we run 100 independent trials. In each trial, we simulate the genotype matrix $G$ and the mutation counts $C$. We calculate a "focal" cosine distance as the cosine distance between the aggregate mutation spectra of haplotypes with either genotype at $s_i$ (the site at which we artificially simulated an association between genotypes and mutation spectrum variation). We then perform an aggregate mutation spectrum distance scan using $N = 1,000$ permutations. If fewer than 5% of the $N$ permutations produced a cosine distance greater than or equal to the focal distance, we say that the approach successfully identified the mutator allele in that trial.

Assessing power to detect a simulated mutator allele using quantitative trait locus (QTL) mapping

Using simulated data, we also assessed the power of traditional quantitative trait locus (QTL) mapping to detect a locus associated with mutation spectrum variation. As described above, we simulated both genotype and mutation spectra for a population of haplotypes under various conditions (number of mutations per haplotype, mutator effect size, etc.). Using those simulated data, we used R/qtl2 [@PMID:30591514] to perform a genome scan for significant QTL as follows; we assume that the simulated genotype markers are evenly spaced (in physical Mbp coordinates) on a single chromosome. First, we calculate the fraction of each haplotype's de novo mutations that belong to each of the $6 \times 4^{k-1}$ possible $k$-mer mutation types. We then convert the simulated genotypes at each marker to genotype probabilities using the calc_genoprob function in R/qtl2, with map_function = "c-f" and error_prob = 0. For every $k$-mer mutation type, we use genotype probabilities and per-haplotype mutation fractions to perform a scan for QTL with the scan1 function; to make the results more comparable to those from the AMSD method, we do not include any covariates or kinship matrices in these QTL scans. We then use the scan1perm function to perform 1,000 permutations of the per-haplotype mutation fractions and calculate log-odds (LOD) thresholds for significance. We consider the QTL scan to be "successful" if it produces a LOD score above the significance threshold (defined using $\alpha = \frac{0.05}{7}$) for the marker at which we simulated an association with mutation spectrum variation.

Note: In our simulations, we augment the mutation rate of a single $k$-mer mutation type on haplotypes carrying the simulated mutator allele. However, in an experimental setting, we would not expect to have a priori knowledge of the mutation type affected by the mutator. Thus, by using an alpha threshold of 0.05 in our simulations, we would likely over-estimate the power of QTL mapping for detecting the mutator. Since we would need to perform 7 separate QTL scans (one for each 1-mer mutation type plus CpG>TpG) in an experimental setting, we calculate QTL LOD thresholds at a Bonferroni-corrected alpha value of $\alpha = \frac{0.05}{7}$.

Applying the aggregate mutation spectrum distance method to the BXDs

We downloaded previously-generated BXD de novo germline mutation data from the GitHub repository associated with our previous manuscript, which was also archived at Zenodo [@url:https://github.com/tomsasani/bxd_mutator_manuscript;@doi:10.5281/zenodo.5941048;@PMID:35545679], and downloaded a CSV file of BXD genotypes at ~7,300 informative markers from GeneNetwork [@url:http://gn1.genenetwork.org/dbdoc/BXDGeno.html;@PMID:27933521]. We also downloaded relevant metadata about each BXD from the manuscript describing the updated BXD resource [@PMID:33472028]. These files are included in the GitHub repository associated with this manuscript.

As in our previous manuscript [@PMID:35545679], we included mutation data from a subset of the 152 BXDs in our aggregate mutation spectrum distance scans. Specifically, we removed BXDs that were backcrossed to a C57BL/6J or DBA/2J parent at any point during the inbreeding process (usually, in order to rescue that BXD from inbreeding depression [@PMID:33472028]). We also removed BXD68 from our genome-wide scans, since we previously discovered a hyper-mutator phenotype in that line; the C>A germline mutation rate in BXD68 is over 5 times the population mean, likely due to a private deleterious nonsynonymous mutation in Mutyh [@PMID:35545679]. In our previous manuscript, we removed any BXDs that had been inbred for fewer than 20 generations, as it takes approximately 20 generations of strict brother-sister mating for an RIL genome to become >98% homozygous [@url:https://link.springer.com/book/10.1007/978-1-349-04904-2]. As a result, any potential mutator allele would almost certainly be either fixed or lost after 20 generations; if fixed, the allele would remain linked to any excess mutations it causes for the duration of subsequent inbreeding. In other words, the de novo mutations present in the genome of a "young" BXD (i.e., a BXD that was inbred for fewer than 20 generations) would not reflect a mutator allele's activity as strongly as the mutations present in the genome of a much older BXD. This presented a challenge when we used quantitative trait locus mapping to discover mutator alleles in our previous manuscript, since the phenotypes (i.e., C>A mutation rates) of young and old BXDs were weighted equally; thus, we simply removed the younger BXDs from our analysis to avoid using their especially noisy mutation spectra. Since AMSD computes an aggregate mutation spectrum using all BXDs that inherited a particular allele at a locus, and can overcome the sparsity and noise of individual mutation spectra, we chose to include these younger BXDs in our genome-wide scans in this study.

In total, we included 117 BXDs in our genome-wide scans.

Identifying candidate single-nucleotide mutator alleles overlapping the chromosome 6 peak

We investigated the region implicated by our aggregate mutation spectrum distance approach on chromosome 6 by subsetting the joint-genotyped BXD VCF file (European Nucleotide Archive accession PRJEB45429 [@url:https://www.ebi.ac.uk/ena/browser/view/PRJEB45429]) using bcftools [@PMID:33590861]. We defined the candidate interval surrounding the cosine distance peak on chromosome 6 as the 90% bootstrap confidence interval (extending from approximately 95 Mbp to 114 Mbp). To predict the functional impacts of both single-nucleotide variants and indels on splicing, protein structure, etc., we annotated variants in the BXD VCF using the following snpEff [@PMID:22728672] command:

 java -Xmx16g -jar /path/to/snpeff/jarfile GRCm38.75 /path/to/bxd/vcf > /path/to/uncompressed/output/vcf

and used cyvcf2 [@PMID:28165109] to iterate over the annotated VCF file in order to identify nonsynonymous fixed differences between the parental C57BL/6J and DBA/2J strains.

Identifying candidate structural variant alleles overlapping the chromosome 6 peak

We downloaded summary VCFs containing insertion, deletion and inversion structural variants (identified via high-quality, long-read assembly of inbred laboratory mouse strains [@doi:10.1016/j.xgen.2023.100291]) from the Zenodo link associated with the Ferraj et al. manuscript: https://doi.org/10.5281/zenodo.7644286.

We then downloaded a TSV file containing RefSeq gene predictions in GRCm39/mm39 from the UCSC Table Browser [@PMID:14681465], and used the bx-python library [@url:https://github.com/bxlab/bx-python] to intersect the interval spanned by each structural variant with the intervals spanned by the txStart and txEnd of every RefSeq entry.

We queried all structural variants within the 90% bootstrap confidence interval on chromosome 6.

Extracting mutation signatures

We used SigProfilerExtractor (v.1.1.21) [@PMID:30371878] to extract mutation signatures from the BXD mutation data. After converting the BXD mutation data to the "matrix" input format expected by SigProfilerExtractor, we ran the sigProfilerExtractor method as follows:

# install the mm10 mouse reference data
genInstall.install('mm10')

# run mutation signature extraction
sig.sigProfilerExtractor(
    'matrix',
    /path/to/output/directory,
    /path/to/input/mutations,
    maximum_signatures=10,
    nmf_replicates=100,
    opportunity_genome="mm10",
)

Comparing mutation spectra between Mouse Genomes Project strains

We downloaded mutation data from a previously published analysis [@PMID:30753674] (Supplementary File 1, Excel Table S3) that identified strain-private mutations in 29 strains that were originally whole-genome sequenced as part of the Sanger Mouse Genomes (MGP) project [@PMID:21921910]. When comparing counts of each mutation type between MGP strains that harbored either D or B alleles at the chromosome 4 or chromosome 6 mutator loci, we adjusted mutation counts by the number of callable A, T, C, or G nucleotides in each strain as described previously [@PMID:35545679].

Querying GeneNetwork for eQTLs at the mutator locus

We used the online GeneNetwork resource [@PMID:27933521], which contains array- and RNA-seq-derived expression measurements in a wide variety of tissues, to find cis-eQTLs for the DNA repair genes we implicated under the cosine distance peak on chromosome 6. On the GeneNetwork homepage (genenetwork.org), we selected the "BXD Family" Group and used the Type dropdown menu to select each of the specific expression datasets described in Table @tbl:eqtl-provenance. In the Get Any text box, we then entered the listed gene name and clicked Search. After selecting the appropriate trait ID on the next page, we used the Mapping Tools dropdown to run Hayley-Knott regression [@PMID:16718932] with default parameters: 1,000 permutations, interval mapping, no cofactors, and WGS-based genotypes (2022).

If we discovered a significant cis-eQTL for the gene of interest (that is, a locus on chromosome 6 with an LRS greater than or equal to the "significant LRS" genome-wide threshold), we then performed a second genome-wide association test for the trait of interest using GEMMA [@PMID:2453419] with the following parameters: WGS-based marker genotypes, a minor allele frequency threshold of 0.05, and leave-one-chromosome-out (LOCO). By using both Haley-Knott regression and GEMMA, we could first discover loci that exceeded a genome-wide LRS threshold, and then more precisely estimate the effect of those loci on gene expression [@doi:10.1101/2020.12.23.424047].

The exact names of the expression datasets we used for each tissue are shown in Table @tbl:eqtl-provenance below:

Tissue name Complete name of GeneNetwork expression data
Kidney Mouse kidney M430v2 Sex Balanced (Aug06) RMA
Gastrointestinal UTHSC Mouse BXD Gastrointestinal Affy MoGene 1.0 ST Gene Level (Apr14) RMA
Hematopoetic stem cells UMCG Stem Cells ILM6v1.1 (Apr09) transformed
Spleen UTHSC Affy MoGene 1.0 ST Spleen (Dec10) RMA
Liver UTHSC BXD Liver RNA-Seq Avg (Oct19) TPM Log2
Heart NHLBI BXD All Ages Heart RNA-Seq (Nov20) TMP Log2 **
Hippocampus Hippocampus Consortium M430v2 (Jun06) RMA

Table: Names of gene expression datasets used for each tissue type on GeneNetwork {#tbl:eqtl-provenance}

Calculating the frequencies of candidate mutator alleles in wild mice

To determine the frequencies of the Ogg1 and Setmar nonsynonymous mutations in other populations of mice, we queried a VCF file containing genome-wide variation in 67 wild-derived mice from four species of Mus [@PMID:27622383]. We calculated the allele frequency of each nonsynonymous mutation in each of the four species or subspecies (Mus musculus domesticus, Mus musculus musculus, Mus musculus castaneus, and Mus spretus), including genotypes that met the following criteria:

  • supported by at least 10 sequencing reads

  • Phred-scaled genotype quality of at least 20

Testing for epistasis between the two mutator loci

To test for statistical epistasis between the mutator loci on chromosome 4 and chromosome 6, we modeled C>A mutation rates in the BXDs as a function of genotypes at either locus. Specifically, we tested for statistical interaction between genotypes by fitting a generalized linear model in the R statistical language as follows:

m1 <- glm(Count ~ offset(log(ADJ_AGE)) + Genotype_A * Genotype_B, data = data, family=poisson())

In this model, Count is the count of C>A de novo mutations observed in each BXD. ADJ_AGE is the product of the number of "callable" cytosine/guanine nucleotides in each BXD (i.e., the total number of cytosines/guanines covered by at least 10 sequencing reads) and the number of generations for which the BXD was inbred. We included the logarithm of ADJ_AGE as an "offset" in order to model the response variable as a rate (expressed per base-pair, per generation) rather than an absolute count; the BXDs differ in both their durations of inbreeding and the proportions of their genomes that were sequenced to sufficient depth, which influences the number of mutations we observe in each BXD. The Genotype_A and Genotype_B terms represent the genotypes of BXDs at markers rs27509845 and rs46276051 (the markers with peak cosine distances on chromosomes 4 and 6 in the two aggregate mutation spectrum distance scans). We limited our analysis to the n = 108 BXDs that were homozygous at both sites, allowing us to model genotypes at either locus as binary variables ("B" or "D"). Using analysis of variance (ANOVA), we then compared the model including an interaction effect to a model including only additive effects:

m2 <- glm(Count ~ offset(log(ADJ_AGE)) + Genotype_A + Genotype_B, data = data, family=poisson())
anova(m1, m2, test="Chisq")

If model m1 is a significantly better fit to the data than m2, we can reject the null hypothesis that the effect of D genotypes at both markers is equal to the sum of the marginal effects of D genotypes at either rs27509845 or rs46276051. In other words, if m1 is a better fit than m2, then the combined effect of D genotypes at both markers is non-additive, and indicative of statistical epistasis.

We tested for epistasis in the Sanger Mouse Genomes Project (MGP) strains using a nearly-identical approach. In this analysis, we fit two models as follows:

m1 <- glm(Count ~ offset(log(CALLABLE_C)) + Genotype_A * Genotype_B, data = data, family=poisson())

m2 <- glm(Count ~ offset(log(CALLABLE_C)) + Genotype_A + Genotype_B, data = data, family=poisson())

where Count is the count of strain-private C>A mutations observed in each MGP strain [@PMID:30753674]. The CALLABLE_C term represents the total number of cytosine and guanine nucleotides that were accessible for mutation calling in each strain, and the Genotype_A and Genotype_B terms represent MGP genotypes at the chromosome 4 and chromosome 6 mutator loci, respectively. We compared the two models using ANOVA as described above.