Skip to content

Commit

Permalink
Add specific recommendations to use methods
Browse files Browse the repository at this point in the history
Reorganizes the discussion to include a new paragraph of specific
recommendations about how to use embedding methods for specific types of
problems and which parameters to use.

Closes #85
  • Loading branch information
huddlej committed Feb 3, 2024
1 parent dcb2c0a commit 4cada3f
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 8 deletions.
11 changes: 11 additions & 0 deletions manuscript/cartography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -885,3 +885,14 @@ @Article{Lees2019
Pages="304--316",
Month="Feb"
}

@Article{Rambaut2016,
Author="Rambaut, A. and Lam, T. T. and Max Carvalho, L. and Pybus, O. G. ",
Title="{{E}xploring the temporal structure of heterochronous sequences using {T}emp{E}st (formerly {P}ath-{O}-{G}en)}",
Journal="Virus Evol",
Year="2016",
Volume="2",
Number="1",
Pages="vew007",
Month="Jan"
}
22 changes: 14 additions & 8 deletions manuscript/cartography.tex
Original file line number Diff line number Diff line change
Expand Up @@ -600,17 +600,23 @@ \section*{Discussion}
% What we learned and pros of methods
We applied four standard dimensionality reduction methods to simulated and natural genome sequences of two relevant human pathogenic viruses and found that the resulting embeddings could reflect pairwise genetic relationships between samples and capture previously identified genetic groups.
From our analysis of simulated influenza- and coronavirus-like sequences, we found that each method produced consistent embeddings of genetic sequences for two distinct pathogens, more than 55 years of evolution, and a wide range of practical method parameters.
These results suggest that researchers could apply these biologically-uninformed methods to a broad range of human pathogenic viruses with minimal tuning of the method parameters.
Of the four methods, MDS most accurately reflected pairwise genetic distances between simulated samples in its embeddings.
From our analysis of natural populations of seasonal influenza H3N2 HA and SARS-CoV-2 sequences, we confirmed that MDS most reliably reflected pairwise genetic distances.
We found that clusters from t-SNE embeddings most accurately recapitulated previously defined genetic groups at the resolution of WHO variants and Nextstrain clades and consistently produced clusters that corresponded to monophyletic groups in phylogenies.
These differences highlight that embedding selection should be informed by the question under investigation.
In some applications, the amount of genetic distance separating samples provides important proxy information about epidemiological dynamics.
In such cases, an embedding such as MDS that not only clusters similar samples, but meaningfully communicates the relative difference between clusters, may be most appropriate.
In other scenarios, the clear assignment of samples into high-level, easily differentiable groups may be the primary task, and an embedding such as t-SNE may be preferable.
Clusters from both MDS and t-SNE embeddings of H3N2 HA and NA sequences accurately matched reassortment clades identified by a biologically-informed model based on ancestral reassortment graphs.
MDS embeddings consistently placed known recombinant lineages of SARS-CoV-2 between their parental lineages, while t-SNE clusters most accurately captured recombinant lineages.
From these results, we conclude that tree-free dimensionality reduction methods can provide valuable biological insights for human pathogenic viruses through easily interpretable visualizations of genetic relationships and the ability to account for genetic variation that phylogenetic methods cannot use, including indels, reassortment, and recombination.
These results show that tree-free dimensionality reduction methods can provide valuable biological insights for human pathogenic viruses through easily interpretable visualizations of genetic relationships and the ability to account for genetic variation that phylogenetic methods cannot use, including indels, reassortment, and recombination.

% Recommendations for application to new pathogens
From these results, we can also make the following recommendations about how to apply these methods to other viral pathogens.
First, evenly sample the available genome sequences across time and geography, to minimize bias in embeddings.
Then, choose which embedding method to use based on the question under investigation.
For analyses that require the most accurate low-dimensional representation of pairwise genetic distances across local and global scales, use MDS with 3 dimensions.
For analyses that need to find clusters of closely related samples, use t-SNE with a perplexity of 100 (or less, if using fewer than 100 samples) and a learning rate that scales with the number of samples in the data.
In all cases, plot the relationship between pairwise genetic distances and Euclidean distances in each embedding.
These plots reveal the range of genetic distances that an embedding can represent linearly and act as a sanity check akin to plotting the temporal signal present in samples prior to inferring a time-scaled phylogeny \cite{Rambaut2016,Sagulenko2018}.
Before finding clusters in the t-SNE embedding, determine the minimum genetic distance desired between clusters, and use the pairwise genetic and Euclidean distance plot to find the corresponding Euclidean distance to use as a threshold for HDBSCAN.
While HDBSCAN clusters require this pathogen-specific tuning, the linear relationship between Euclidean and genetic distance remains robust to changes in method parameters.

% Limitations of methods and analysis
Despite the promise of these simple methods to answer important public health questions about human pathogenic viruses, these methods and our analyses suffer from inherent limitations.
Expand All @@ -623,7 +629,7 @@ \section*{Discussion}
As a result, viewers cannot know that samples mapping far apart in a t-SNE or UMAP embedding are as genetically distant as they appear.
In maintaining a linear relationship between Euclidean and genetic distances, MDS sacrifices the ability to form more accurate genetic clusters for viruses with large genomes like SARS-CoV-2.
Given these limitations of these methods, we do not expect them to replace biologically-informed methods that provide more meaningful parameters to tune their algorithms.
Instead, we expect that researchers can use these methods for rapid visualization and clustering of their genome sequences as the first step prior to analysis with more sophisticated and computationally intensive methods.
Instead, these methods provide an easy first step to produce interpretable visualizations and clusters of genome sequences, prior to analysis with more sophisticated methods with biological models.

We note that our analysis reflects a small subset of human pathogenic viruses and dimensionality reduction methods.
We focused on analysis of two respiratory RNA viruses that contribute dramatically to seasonal human morbidity and mortality, but numerous alternative pathogens would also have been relevant subjects.
Expand All @@ -639,7 +645,7 @@ \section*{Discussion}
In the short term, researchers can immediately apply the methods we describe here to seasonal influenza and SARS-CoV-2 genomes to identify biologically relevant clusters.
Researchers can also apply these methods to find relevant clusters for other viruses by evaluating the pairwise Euclidean and genetic distances for each virus and tuning the Euclidean distance thresholds for HDBSCAN to capture the desired granularity of genetic clusters.
In the long term, we expect researchers will benefit from expanding the breadth of dimensionality reduction methods applied to viruses and the breadth of viral diversity assessed by these methods.
Additionally, the combination of dimensionality reduction methods and clustering with HDBSCAN provides the foundation for future methods to automatically identify reassortment groups and recombinant lineages.
Additionally, the combination of dimensionality reduction methods and clustering with HDBSCAN provides the foundation for future methods to automatically identify reassortant and recombinant lineages.

\section*{Conclusion}

Expand Down

0 comments on commit 4cada3f

Please sign in to comment.