Skip to content

Commit

Permalink
Refine abstract
Browse files Browse the repository at this point in the history
  • Loading branch information
huddlej committed Oct 25, 2023
1 parent d5a87a1 commit b1a24b5
Showing 1 changed file with 7 additions and 11 deletions.
18 changes: 7 additions & 11 deletions docs/cartography.tex
Original file line number Diff line number Diff line change
Expand Up @@ -225,22 +225,18 @@
\end{flushleft}
% Please keep the abstract below 300 words
\section*{Abstract}
\jhc{At 350 words, 50 words over the limit}
\jhc{274 words, limit is 300}
Public health studies commonly infer phylogenies from viral genomes to understand transmission dynamics and identify clusters of genetically-related samples.
However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated network-based methods.
Even when phylogenies are appropriate, they can be unnecessary; pairwise distances between genomes in multiple sequence alignments can identify clusters of related genomes or assign new genomes to existing phylogenetic clusters.
Here, we tested whether dimensionality reduction methods could be applied to viral genomes as alternatives when phylogenetic methods are not appropriate or necessary.
Specifically, we sought to understand how well the resulting embeddings captured known genetic distances and groups.
We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to genome sequences for two viruses with well-defined phylogenetic clades and either reassortment (seasonal influenza A/H3N2) or recombination (SARS-CoV-2).
However, viruses that reassort or recombine violate phylogenetic assumptions and require more sophisticated methods.
Even when phylogenies are appropriate, they can be unnecessary; pairwise distances between sequences can identify clusters of related samples or assign new samples to existing phylogenetic clusters.
Here, we tested whether dimensionality reduction methods could capture known genetic distances and groups of two human pathogenic viruses that cause substantial human morbidity and mortality: seasonal influenza A/H3N2 and SARS-CoV-2.
We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2).
For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding.
We measured the accuracy of these clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages.
First, we applied all methods to H3N2 hemagglutinin sequences for which high-quality expert clade annotations from the World Health Organization exist.
Next, we applied the same methods to concatenated H3N2 hemagglutinin and neuraminidase sequences and compared the resulting clusters to reassortment groups identified by an ancestral reassortment graph method, TreeKnit.
Finally, we applied each method to SARS-CoV-2 sequences and compared the resulting clusters to previously defined lineages with and without recombination.
We found that MDS maintained the strongest correlation between pairwise genetic and Euclidean distances between sequences and best captured the intermediate placement of recombinant lineages between parental lineages.
However, clusters from t-SNE and UMAP most accurately recapitulated known phylogenetic clades and reassortment groups.
We show that simple statistical methods without an underlying biological model can accurately represent known genetic relationships for relevant human pathogenic viruses.
Our open source implementation of these methods for analysis of viral genome sequences ("pathogen-embed") can be easily integrated into other research projects and used for analyses where phylogenetic methods are either unnecessary or inappropriate.
We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses.
Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate.

% Please keep the Author Summary between 150 and 200 words
% Use first person. PLOS ONE authors please skip this step.
Expand Down

0 comments on commit b1a24b5

Please sign in to comment.