From e4b95614ea7715fbf8054f08b6c5d14be12e0c69 Mon Sep 17 00:00:00 2001 From: John Huddleston Date: Wed, 14 Aug 2024 14:35:25 -0700 Subject: [PATCH] Clarify the discussion re: within-clade results --- manuscript/cartography.tex | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/manuscript/cartography.tex b/manuscript/cartography.tex index 9b366b13..85a276b6 100644 --- a/manuscript/cartography.tex +++ b/manuscript/cartography.tex @@ -444,7 +444,7 @@ \subsection{SARS-CoV-2 clusters recapitulate broad genetic groups corresponding The VI distance between Pango lineages and clusters without the unclustered sequences was 0.13, confirming that one quarter of the distance between t-SNE clusters and Pango lineages above came from unclustered sequences. Of the 38 Pango lineages with a t-SNE cluster, 30 lineages (79\%) had a single corresponding t-SNE cluster, seven lineages (18\%) had two or three t-SNE clusters, and one lineage (B.1.617.2) had five t-SNE clusters (Supplementary Fig.~\ref{S_Fig_sarscov2_single_clade_embeddings_tsne_counts}). Of the 28 t-SNE clusters, 21 clusters (75\%) had a single corresponding Pango lineage, six (21\%) mapped to two or three Pango lineages, and one (cluster 27) mapped to 18 Pango lineages with most sequences from B.1.617.2 and AY.4. -These results suggest that clusters from t-SNE embeddings can capture Pango-resolution genetic groups when analyzing sequences within a specific Nextstrain clade. +These results suggest that clusters from t-SNE embeddings can capture more Pango-resolution genetic groups by analyzing sequences within a specific Nextstrain clade. \subsection{Distance-based embeddings reflect SARS-CoV-2 recombination events} @@ -492,7 +492,8 @@ \subsection{Limitations of methods and analysis} Despite the promise of these simple methods to answer important public health questions about human pathogenic viruses, these methods and our analyses suffer from inherent limitations. The lack of an underlying biological model is both a strength and the clearest limitation of the dimensionality reduction methods we considered here. -For example, embeddings of SARS-CoV-2 genomes cannot capture the same fine-grained genetic resolution as Pango lineage annotations. +For example, embeddings of SARS-CoV-2 genomes that represent the complete circulating diversity at a given time cannot capture the same fine-grained genetic resolution as Pango lineage annotations. +Only t-SNE clusters of SARS-CoV-2 genomes within a single Nextstrain clade get close to defining Pango-resolution genetic groups. Each method provides only a few parameters to tune its embeddings and these parameters have little effect on the qualitative outcome. In maintaining a linear relationship between Euclidean and genetic distances, MDS sacrifices the ability to form more accurate genetic clusters for viruses with large genomes like SARS-CoV-2. Neither t-SNE nor UMAP maintain a linear relationship between pairwise Euclidean and genetic distances across the observed range of genetic distances.