Skip to content

Commit

Permalink
Clarify the discussion re: within-clade results
Browse files Browse the repository at this point in the history
  • Loading branch information
huddlej committed Aug 14, 2024
1 parent adbab82 commit e4b9561
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions manuscript/cartography.tex
Original file line number Diff line number Diff line change
Expand Up @@ -444,7 +444,7 @@ \subsection{SARS-CoV-2 clusters recapitulate broad genetic groups corresponding
The VI distance between Pango lineages and clusters without the unclustered sequences was 0.13, confirming that one quarter of the distance between t-SNE clusters and Pango lineages above came from unclustered sequences.
Of the 38 Pango lineages with a t-SNE cluster, 30 lineages (79\%) had a single corresponding t-SNE cluster, seven lineages (18\%) had two or three t-SNE clusters, and one lineage (B.1.617.2) had five t-SNE clusters (Supplementary Fig.~\ref{S_Fig_sarscov2_single_clade_embeddings_tsne_counts}).
Of the 28 t-SNE clusters, 21 clusters (75\%) had a single corresponding Pango lineage, six (21\%) mapped to two or three Pango lineages, and one (cluster 27) mapped to 18 Pango lineages with most sequences from B.1.617.2 and AY.4.
These results suggest that clusters from t-SNE embeddings can capture Pango-resolution genetic groups when analyzing sequences within a specific Nextstrain clade.
These results suggest that clusters from t-SNE embeddings can capture more Pango-resolution genetic groups by analyzing sequences within a specific Nextstrain clade.

\subsection{Distance-based embeddings reflect SARS-CoV-2 recombination events}

Expand Down Expand Up @@ -492,7 +492,8 @@ \subsection{Limitations of methods and analysis}

Despite the promise of these simple methods to answer important public health questions about human pathogenic viruses, these methods and our analyses suffer from inherent limitations.
The lack of an underlying biological model is both a strength and the clearest limitation of the dimensionality reduction methods we considered here.
For example, embeddings of SARS-CoV-2 genomes cannot capture the same fine-grained genetic resolution as Pango lineage annotations.
For example, embeddings of SARS-CoV-2 genomes that represent the complete circulating diversity at a given time cannot capture the same fine-grained genetic resolution as Pango lineage annotations.
Only t-SNE clusters of SARS-CoV-2 genomes within a single Nextstrain clade get close to defining Pango-resolution genetic groups.
Each method provides only a few parameters to tune its embeddings and these parameters have little effect on the qualitative outcome.
In maintaining a linear relationship between Euclidean and genetic distances, MDS sacrifices the ability to form more accurate genetic clusters for viruses with large genomes like SARS-CoV-2.
Neither t-SNE nor UMAP maintain a linear relationship between pairwise Euclidean and genetic distances across the observed range of genetic distances.
Expand Down

0 comments on commit e4b9561

Please sign in to comment.