Skip to content

Commit

Permalink
Update abstract, fig captions, and some supp figs
Browse files Browse the repository at this point in the history
Need to finish updating supp fig captions in the README.
  • Loading branch information
huddlej committed Aug 28, 2024
1 parent 4d71090 commit 7a42428
Showing 1 changed file with 41 additions and 12 deletions.
53 changes: 41 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,8 @@ In this work, we tested whether dimensionality reduction methods could capture k
We applied principal component analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP) to sequences with well-defined phylogenetic clades and either reassortment (H3N2) or recombination (SARS-CoV-2).
For each low-dimensional embedding of sequences, we calculated the correlation between pairwise genetic and Euclidean distances in the embedding and applied a hierarchical clustering method to identify clusters in the embedding.
We measured the accuracy of clusters compared to previously defined phylogenetic clades, reassortment clusters, or recombinant lineages.
We found that MDS maintained the strongest correlation between pairwise genetic and Euclidean distances between sequences and best captured the intermediate placement of recombinant lineages between parental lineages
Clusters from t-SNE most accurately recapitulated known phylogenetic clades and recombinant lineages.
Both MDS and t-SNE accurately identified reassortment groups.
We found that MDS embeddings accurately represented pairwise genetic distances including the intermediate placement of recombinant SARS-CoV-2 lineages between parental lineages.
Clusters from t-SNE embeddings accurately recapitulated known phylogenetic clades, H3N2 reassortment groups, and SARS-CoV-2 recombinant lineages.
We show that simple statistical methods without a biological model can accurately represent known genetic relationships for relevant human pathogenic viruses.
Our open source implementation of these methods for analysis of viral genome sequences can be easily applied when phylogenetic methods are either unnecessary or inappropriate.

Expand Down Expand Up @@ -73,18 +72,48 @@ Explore the phylogenetic trees and embeddings on Nextstrain.

### Main figures

- [Fig 2. **Phylogeny of early (2016–2018) influenza H3N2 HA sequences plotted by nucleotide substitutions per site on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).**](https://blab.github.io/cartography/flu-2016-2018-ha-embeddings-by-clade.html) Tips in the tree and embeddings are colored by their Nextstrain clade assignment. Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods). Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment. Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [Fig 4. **Phylogenetic trees (left) and embeddings (right) of early (2016–2018) influenza H3N2 HA sequences colored by HDBSCAN cluster.**](https://blab.github.io/cartography/flu-2016-2018-ha-embeddings-by-cluster.html) Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades). Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods). Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [Fig 5. **Phylogenetic trees (left) and embeddings (right) of late (2018–2020) H3N2 HA sequences colored by HDBSCAN cluster.**](https://blab.github.io/cartography/flu-2018-2020-ha-embeddings-by-cluster.html) Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades). Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods). Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [Fig 6. **Phylogeny of early (2016–2018) influenza H3N2 HA sequences plotted by nucleotide substitutions per site on the x-axis (top) and low-dimensional embeddings of the same HA sequences concatenated with matching NA sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).**](https://blab.github.io/cartography/flu-2016-2018-ha-na-embeddings-by-mcc.html) Tips in the tree and embeddings are colored by their TreeKnit Maximally Compatible Clades (MCCs) label which represents putative HA/NA reassortment groups. The first normalized VI values per embedding reflect the distance between HA/NA clusters and known genetic groups (MCCs). VI values in parentheses reflect the distance between HA-only clusters and known genetic groups. "A2" and "A2/re" labels indicate a known reassortment event ([Potter et al. 2019](https://doi.org/10.1093/ve/vez046)).
- [Fig 7. **Phylogeny of early (2020–2022) SARS-CoV-2 sequences plotted by number of nucleotide substitutions from the most recent common ancestor on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).**](https://blab.github.io/cartography/sarscov2-embeddings-by-Nextstrain_clade-clade.html) Tips in the tree and embeddings are colored by their Nextstrain clade assignment. Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods). Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [Fig 9. **Phylogenetic trees (left) and embeddings (right) of early (2020–2022) SARS-CoV-2 sequences colored by HDBSCAN cluster.**](https://blab.github.io/cartography/sarscov2-embeddings-by-cluster-vs-Nextstrain_clade.html) Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades). Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods). Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [Fig 10. **Phylogenetic trees (left) and embeddings (right) of late (2022–2023) SARS-CoV-2 sequences colored by HDBSCAN cluster.**](https://blab.github.io/cartography/sarscov2-test-embeddings-by-cluster-vs-Nextstrain_clade.html) Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades).
- [Fig 2. **Phylogeny of early (2016--2018) influenza H3N2 HA sequences plotted by nucleotide substitutions per site on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).**](https://blab.github.io/cartography/flu-2016-2018-ha-embeddings-by-clade.html) Tips in the tree and embeddings are colored by their Nextstrain clade assignment.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment.
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels appear in the tree at the earliest ancestral node of the tree for each clade.
Clade labels appear in each embedding at the average position on the x and y axis for sequences in a given clade.
- [Fig 4. **Phylogenetic trees (left) and embeddings (right) of early (2016--2018) influenza H3N2 HA sequences colored by HDBSCAN cluster.**](https://blab.github.io/cartography/flu-2016-2018-ha-embeddings-by-cluster.html) Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades).
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [Fig 5. **Phylogenetic trees (left) and embeddings (right) of late (2018--2020) H3N2 HA sequences colored by HDBSCAN cluster.**](https://blab.github.io/cartography/flu-2018-2020-ha-embeddings-by-cluster.html) Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades).
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [Fig 6. **Phylogeny of early (2016--2018) influenza H3N2 HA sequences plotted by nucleotide substitutions per site on the x-axis (top) and low-dimensional embeddings of the same HA sequences concatenated with matching NA sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).**](https://blab.github.io/cartography/flu-2016-2018-ha-na-embeddings-by-mcc.html) Tips in the tree and embeddings are colored by their TreeKnit Maximally Compatible Clades (MCCs) label which represents putative HA/NA reassortment groups.
Tips from MCCs with fewer than 10 sequences are colored as ``unassigned''.
The first normalized VI values per embedding reflect the distance between HA/NA clusters and known genetic groups (MCCs).
VI values in parentheses reflect the distance between HA-only clusters and known genetic groups.
MCC labels appear in the tree and each embedding for larger pairs of reassortment events.
MCC 9 represents two Nextstrain clades, so its labels appear twice in the tree.
MCCs 14 and 11 represent a previously published reassortment event within Nextstrain clade A2 ([Potter et al. 2019](https://doi.org/10.1093/ve/vez046)).
Labels for MCC 14 represent the subset of its sequences from clade A2.
- [Fig 7. **Phylogeny of early (2020--2022) SARS-CoV-2 sequences plotted by number of nucleotide substitutions from the most recent common ancestor on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).**](https://blab.github.io/cartography/sarscov2-embeddings-by-Nextstrain_clade-clade.html) Tips in the tree and embeddings are colored by their Nextstrain clade assignment.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels in the tree and embeddings highlight larger clades.
- [Fig 9. **Phylogenetic trees (left) and embeddings (right) of early (2020--2022) SARS-CoV-2 sequences colored by HDBSCAN cluster.**](https://blab.github.io/cartography/sarscov2-embeddings-by-cluster-vs-Nextstrain_clade.html) Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades).
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [Fig 10. **Phylogenetic trees (left) and embeddings (right) of late (2022--2023) SARS-CoV-2 sequences colored by HDBSCAN cluster.**](https://blab.github.io/cartography/sarscov2-test-embeddings-by-cluster-vs-Nextstrain_clade.html) Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades).

### Supplemental figures

- [S4 Fig. **MDS embeddings for early (2016–2018) influenza H3N2 HA sequences showing all three components.**](https://blab.github.io/cartography/flu-2016-2018-mds-by-clade.html) Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods). Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment. Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [S6 Fig. **Phylogeny of late (2018–2020) influenza H3N2 HA sequences plotted by nucleotide substitutions per site on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).**](https://blab.github.io/cartography/flu-2018-2020-ha-embeddings-by-clade.html) Tips in the tree and embeddings are colored by their Nextstrain clade assignment. Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods). Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment. Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [S4 Fig. **MDS embeddings for early (2016--2018) influenza H3N2 HA sequences showing all three components.**](https://blab.github.io/cartography/flu-2016-2018-mds-by-clade.html) Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment.
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels appear in the tree at the earliest ancestral node of the tree for each clade.
Clade labels appear in each embedding at the average position on the x and y axis for sequences in a given clade.
- [S7 Fig. **Phylogeny of late (2018--2020) influenza H3N2 HA sequences plotted by nucleotide substitutions per site on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).**](https://blab.github.io/cartography/flu-2018-2020-ha-embeddings-by-clade.html) Tips in the tree and embeddings are colored by their Nextstrain clade assignment.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment.
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels appear in the tree at the earliest ancestral node of the tree for each clade.
Clade labels appear in each embedding at the average position on the x and y axis for sequences in a given clade.
- [S7 Fig. **MDS embeddings for late (2018–2020) influenza H3N2 HA sequences showing all three components.**](https://blab.github.io/cartography/flu-2018-2020-mds-by-clade.html) Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods). Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment. Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
- [S9 Fig. **Embeddings influenza H3N2 HA-only (left) and combined HA/NA (right) showing the effects of additional NA genetic information on the
placement of reassortment events detected by TreeKnit (MCCs).**](https://blab.github.io/cartography/flu-2016-2018-ha-na-all-embeddings-by-mcc.html)
Expand Down

0 comments on commit 7a42428

Please sign in to comment.