Fitness effects of SARS-CoV-2 amino-acid mutations estimated from observed versus expected mutation counts
This repository estimates the fitness effects of mutations to all SARS-CoV-2 proteins as described in this paper. It does this by analyzing mutations in millions human SARS-CoV-2 sequences, and quantifying how the observed counts of each mutation compares to the expected counts from the underlying neutral mutation rate.
The analysis is by Jesse Bloom and Richard Neher, and makes use of the SARS-CoV-2 mutation-annotated tree provided by the developers of UShER.
For details, see the following references:
- The main paper describing this work is Bloom and Neher (2023).
- Secondary references:
- The approach builds on ideas initially described in Neher (2022).
- The estimation of the neutral mutation rate is done as described in Bloom, Beichman, Neher, & Harris (2023).
- The mutational data are extracted from publicly available SARS-CoV-2 sequences using the UShER package described by Turakhia et al (2021).
The easiest way to access the results is through a set of interactive plots available at https://jbloomlab.github.io/SARS2-mut-fitness. You can also visualize the fitness effects of mutations in the context of a 3D protein structure here using the web-tool dms-viz
.
Here are links to files with major numerical results:
- Final estimates of relative fitnesses of each amino acid. For most purposes, you probably want to use this file for the final "best" estimate of the amino-acid fitnesses. Note that estimates are more accurate for amino acids with larger values of expected_count. Sites in ORF1ab / nsp proteins are listed both with the ORF1ab and nsp numbering.
- Estimates of the effects of amino-acid mutations aggregated across all clades, for each individual clade, and for just subsets of sequences from specific countries.
- Estimates of the relative neutral mutation rates for different types of mutations for each clade, as estimated at four-fold degenerate synonymous sites, are here.
- The counts of each mutation at each site in each clade are here; note this file also contains excluded sites.
- The observed counts of each mutation versus the counts expected from the underlying neutral mutation rate are here; note that this file include excluded sites.
- The nucleotide identities at each position in the "founder" sequence for each viral clade are here. This file also indicates which sites are four-fold degenerate.
- The deep mutational scanning results processed and aggregated from published experimental studies are in this subdirectory.
You can run the pipeline for multiple mutation-annotated tree datasets.
Specify the dataset to show by default in the interactive plots and to store results in ./results/ for as the current_mat
in config.yaml.
Other datasets specifies under the mat_trees
key in config.yaml will also be analyzed, and their results go in subdirectories called ./results_{mat}/
and their interactive plots are available via the GitHub pages interactive rendering at https://jbloomlab.github.io/SARS2-mut-fitness in a list at the bottom of the page.
The analysis is entirely reproducible from the code provided in this GitHub repository.
First build the conda environment in environment.yml. This requires you to install conda, and then run:
conda env create -f environment.yml
That command will create a conda
environment named SARS2-mut-fitness
which you can activate with:
conda activate SARS2-mut-fitness
Then run the snakemake pipeline in Snakefile, which reads its configuration from config.yaml by running:
snakemake -j <n_cpus> --use-conda
where <n_cpus>
is the number of CPUs to use.
Note that the pipeline uses Python scripts in ./scripts/ and Jupyter notebooks in ./notebooks/.
The created files are placed in ./results/. Only some of those results files are tracked in this repo (others are too large or numerous to track). The output interactive HTML altair plots are placed in ./docs/ where they are displayed via GitHub pages at https://jbloomlab.github.io/SARS2-mut-fitness
In addition to the configuration in config.yaml, there is also some input data / specifications in ./data/. This includes the file[data/docs_plot_annotations.yaml usher_masked_sites.yaml](data/docs_plot_annotations.yaml usher_masked_sites.yaml) that specifies how to layout and label the plots rendered on GitHub pages.
Basic steps, performed for each Nextstrain clade:
-
Count occurrences of all mutations along the UShER tree with some filtering to remove spurious ones.
-
Estimate the underlying mutation rates using 4-fold degenerate synonymous sites.
-
Based on the mutation rates and total number of observed synonymous mutations at those sites, calculate expected number of occurrences of each nucleotide mutation.
-
Compare observed to expected mutation counts to estimate the fitness effects of amino-acid mutations.
-
Make overall estimates of amino-acid fitnesses by aggregating mutation-effect estimates across clades.
-
Some additional plotting and analyses, as well as comparison to experimental measurements from deep mutational scanning studies.
Mutation counts are extracted from the pre-built mutation-annotated tree that is made available for use with the UShER package. This tree contains all public access SARS-CoV-2 sequences, with mutations annotated on branches. For the specific version of the tree used here, see the config.yaml file.
We analyze mutations grouping sequences at the level of Nextstrain clades, which are already annotated on the pre-built mutation-annotated tree. For each Nextstrain clade, we use the clade founder genotype manually defined by Neher (2022) and available at the URL indicated in config.yaml, or for newer clades the sequences from Cornelius Roemer here.
We perform some crucial filtering to remove spurious mutations as can arise from bad sequencing, base calling to reference, etc:
-
We ignore all mutations on any branches with high numbers of total mutations, or mutations to either the reference or the founder sequence of the clade in question. The exact settings for this filtering are in config.yaml.
-
We ignore any branches with multiple mutations at the same codon.
-
If a switch is set in config.yaml (it currently is), we specify to exclude any mutations that are reversions from the clade founder to the reference, and also the reverse complement of these mutations. This is designed to remove missing bases called to reference, and also complements of those mutations induced by spurious nodes with such miscalls on downstream branches in the tree.
-
We specify to exclude all mutations at error-prone or problematic sites as manually specified in config.yaml. We also exclude the masked sites in the
UShER
pre-built tree specified here. We also ignore for specific clades any mutations at sites that are masked in theUShER
pipeline for those clades as specified in data/usher_masked_sites.yaml. -
We ignore any clades with small numbers of sequence samples as indicated in config.yaml as these are expected to have too much noise.
-
Note also that indels are ignored, as they are not captured in the mutation-annotated tree.
-
Although the main analysis here uses the total counts of each mutation, we also keep track of how many of these counts are on non-terminal (interior) branches of the tree versus terminal (tip) branches, and also the mean log descendants defined as the log of the number of leaves sharing mutations from each mutated branch in the UShER tree (with log zero values set to one).
The above mutation counts both for all sequences for a clade, and for the sample subsets defined in config.yaml are stored in results/mutation_counts/aggregated.csv. Note that mutations are annotated by the protein(s) they affect, if they are synonymous, if they are at 4-fold degenerate sites, if they are at an excluded site, etc.
We then analyze the mutation spectra and rate of mutations. For this analysis, we only consider synonymous mutations at sites (third codon positions) that are four-fold degenerate in the founder sequence for each clade. The file results/clade_founder_nts/clade_founder_nts.csv specifies which these sites are for each clade.
Specifically, we determine the relative fraction of all 4-fold synonymous mutations that are each type of nucleotide change, and also the relative rates of the different types of mutations, which are just computed as the fraction of all mutations of that type normalized by the composition of the sequence at these 4-fold degenerate sites. These rates for each clade with sufficient counts are written to the file results/synonymous_mut_rates/rates_by_clade.csv.
We also perform analyses for subsets of sequences from different regions (as specified in config.yaml) as well as for the the genome partitioned into halves--these analyses are designed to check that results are not due to artifacts related to sequencing pipelines or hotspots in the genome. For all of these analyses, we only include subsets/partitions with at least the minimum number of mutations indicated in config.yaml.
Most of the analysis of the synonymous mutation spectrum is done by notebooks/synonymous_mut_rates.ipynb.
A plot of the neutral mutation rates is available at https://jbloomlab.github.io/SARS2-mut-fitness
We next compute the "expected" number of observations of each mutation in the absence of selection based on the underlying mutation rates.
Specifically, above we have computed the relative rates of each type of mutation at 4-fold degenerate sites (they are normalized to the frequency of the parent nucleotide).
To get the expected numbers, for each clade, we first compute
The expected number of mutations at each site (under neutrality) from the parental identity of
We compute these expected numbers of mutations versus the actual numbers of mutations at each site, only considering actual mutations that are single nucleotide changes from the clade founder codon.
The expected and actual number of nucleotide mutation counts at each site are in results/expected_vs_actual_mut_counts/expected_vs_actual_mut_counts.csv.
We then collapse the expected and actual counts for each amino-acid mutation, excluding the small number of sites that are in overlapping reading frames and are specified to exclude in config.yaml under the gene_overlaps
key.
Note that in this aggregation, we exclude any amino acids with a codon for which at least one constituent nucleotide site is masked in UShER
.
We estimate the fitness effect fitness_pseudocount
, and we are using the natural log.
So mutations with more counts than expected will have positive fitness effects, and those with less counts than expected will have negative fitness effects.
Note that these fitness effects will only be accurate of the number of expected counts is reasonably high.
The resulting fitness effect estimates are written to the following files:
- results/aa_fitness/aamut_fitness_all.csv: estimates aggregating across all clades.
- results/aa_fitness/aamut_fitness_by_clade.csv: estimates for each individual clade.
- results/aa_fitness/aamut_fitness_by_subset.csv: estimates splitting out by sequence subset (country).
Note also that the above files contain mutations in both numbering of the ORF1ab polypeptide and the nsp proteins contained within it. The nsp protein mutations are a subset of the ORF1ab mutations, so if you examine both you would be double counting mutations. The config.yaml file specifies the conversion from ORF1ab to nsp numbering.
Summaries are also plotted and are available
For each clade have estimated the change in fitness
However, we would like to aggregate the data across multiple clades to estimate amino-acid fitness values at a site under the assumption that these are constant across clades.
Now things get more complicated.
For instance, let's say at our site of interest, the clade founder amino acid is
From these sets of mutation fitness changes, we'd like to estimate the fitness
First, we choose one amino acid to have a fitness value of zero, since the scale of the
Next, we choose the
Finally, we would still like to report an equivalent of the
The resulting amino-acid fitness values (aggregated across all clades) are in the following file:
The are also plotted in the heat maps at https://jbloomlab.github.io/SARS2-mut-fitness
We compare the estimated fitness values to those extracted from a set of deep mutational scanning studies as specified under dms_datasets
in config.yaml.
The processed deep mutational scanning mutation effects are in processed.csv
files in subdirectories of ./results/dms/.
The correlation of the fitness estimates to the deep mutational scanning are plotted at https://jbloomlab.github.io/SARS2-mut-fitness
We compare the fitness effects of mutations to how often the mutation is observed on non-terminal versus terminal branches of the tree, and the mean log tip descendants sharing the mutations for each mutated branch. See the plot linked at https://jbloomlab.github.io/SARS2-mut-fitness
None of these are expected to seriously affect the accuracy of the current analysis, but they could become problematic if the same analysis is applied to substantially more diverged clades:
-
For computing the rates of mutations from 4-fold degenerate synonymous sites, we do not exclude sites / mutations to exclude from the computation of the sequence composition that is used to normalize the counts of different mutations to rates. This would become a problem if you start to specify a very large number of sites to exclude in config.yaml, or if the sequences become highly diverged from the reference (as we exclude reversions to reference and their complement).
-
Multiple mutations in same codon on a branch are excluded from analysis. This is not expected to have much effect, as this is rare.
-
Indels are ignored.
-
The way that codon positions are assigned to identify synonymous sites will fail if there are non-in-frame indels.
-
Four-fold synonymous sites are identified in the clade founder, which could lead to mis-identification if seuqences in a clade become highly diverged from the founder.
-
Multiple mutations in the same codon in a clade can violate the assumptions about how sites are defined as synonymous, etc. For this reason they are excluded, and so we only include mutations that are from the clade founder amino acid identity.
-
We don't consider non-uniformity in mutation rate across the primary sequence.
-
If there are partial sequences in the tree (such that some sites are observed more than others), that is not accounted for.