Skip to content

Commit

Permalink
Merge pull request #123 from blab/cluster-within-sc2-clade
Browse files Browse the repository at this point in the history
Evaluate within-clade t-SNE cluster accuracy for SARS-CoV-2
  • Loading branch information
huddlej authored Aug 14, 2024
2 parents 5588f10 + e4b9561 commit 04b9eab
Show file tree
Hide file tree
Showing 14 changed files with 2,420 additions and 3 deletions.
3 changes: 3 additions & 0 deletions manuscript/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ rule concat_HDBSCAN_tables:
"sars-cov-2-nextstrain/results/optimal_cluster_accuracy_and_parameters_for_Nextclade_pango_collapsed.csv",
"sars-cov-2-nextstrain-2022-2023/results/full_HDBSCAN_metadata_for_Nextstrain_clade.csv",
"sars-cov-2-nextstrain-2022-2023/results/full_HDBSCAN_metadata_for_Nextclade_pango_collapsed.csv",
"sars-cov-2-nextstrain/results/single_clade/cluster_accuracy_t-sne_for_Nextclade_pango_collapsed.csv",
]
output:
fullKDE = "manuscript/tables/total_HDBSCAN_table.csv",
Expand All @@ -83,6 +84,7 @@ rule concat_HDBSCAN_tables:
"early SARS-CoV-2 (2020-2022)",
"late SARS-CoV-2 (2022-2023)",
"late SARS-CoV-2 (2022-2023)",
"SARS-CoV-2 21J (Delta) only",
],
genetic_group_types = [
"Nextstrain clade",
Expand All @@ -93,6 +95,7 @@ rule concat_HDBSCAN_tables:
"Pango",
"Nextstrain clade",
"Pango",
"Pango",
],
conda: "../cartography.yml"
shell:
Expand Down
13 changes: 12 additions & 1 deletion manuscript/cartography.tex
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,16 @@ \subsection{SARS-CoV-2 clusters recapitulate broad genetic groups corresponding
With the exception of t-SNE's performance, these results replicate the patterns we observed with early SARS-CoV-2 data where clusters from embeddings more effectively represented broader genetic diversity than the finer resolution diversity denoted by Pango lineages.
Unlike the Pango lineages in the early SARS-CoV-2 data, the lineages from the later data exhibited fewer pairwise genetic distances between samples in each lineage than samples in Nextstrain clades or any embedding cluster (Supplementary Fig.~S\ref{S_Fig_sarscov2_within_between_group_distances}).

To understand whether t-SNE clusters could capture Pango-resolution genetic groups within a single Nextstrain clade, we evenly sampled approximately 2,000 sequences from a dominant Nextstrain clade with many Pango lineages, 21J (Delta), and identified clusters from a t-SNE embedding of those data.
Within the 1,992 sequences sampled from 21J (Delta), we found 38 Pango lineages after collapsing lineages with fewer than 10 sequences into their parent lineages.
We found 28 t-SNE clusters representing 1,806 sequences (91\%) with 186 sequences (9\%) assigned to the unclustered ``-1'' label (Supplementary Fig.~\ref{S_Fig_sarscov2_single_clade_embeddings_tsne_counts}).
The VI distance between Pango lineages and all clusters (including the unclustered group) was 0.17 (Supplementary Table~\ref{S_Table_optimal_cluster_parameters}).
This distance was consistent with the distance of 0.14 between Pango lineages and t-SNE clusters from both the full early and late SARS-CoV-2 datasets.
The VI distance between Pango lineages and clusters without the unclustered sequences was 0.13, confirming that one quarter of the distance between t-SNE clusters and Pango lineages above came from unclustered sequences.
Of the 38 Pango lineages with a t-SNE cluster, 30 lineages (79\%) had a single corresponding t-SNE cluster, seven lineages (18\%) had two or three t-SNE clusters, and one lineage (B.1.617.2) had five t-SNE clusters (Supplementary Fig.~\ref{S_Fig_sarscov2_single_clade_embeddings_tsne_counts}).
Of the 28 t-SNE clusters, 21 clusters (75\%) had a single corresponding Pango lineage, six (21\%) mapped to two or three Pango lineages, and one (cluster 27) mapped to 18 Pango lineages with most sequences from B.1.617.2 and AY.4.
These results suggest that clusters from t-SNE embeddings can capture more Pango-resolution genetic groups by analyzing sequences within a specific Nextstrain clade.

\subsection{Distance-based embeddings reflect SARS-CoV-2 recombination events}

Finally, we tested the ability of sequence embeddings to place known recombinant lineages of SARS-CoV-2 between their parental lineages in Euclidean space.
Expand Down Expand Up @@ -482,7 +492,8 @@ \subsection{Limitations of methods and analysis}

Despite the promise of these simple methods to answer important public health questions about human pathogenic viruses, these methods and our analyses suffer from inherent limitations.
The lack of an underlying biological model is both a strength and the clearest limitation of the dimensionality reduction methods we considered here.
For example, embeddings of SARS-CoV-2 genomes cannot capture the same fine-grained genetic resolution as Pango lineage annotations.
For example, embeddings of SARS-CoV-2 genomes that represent the complete circulating diversity at a given time cannot capture the same fine-grained genetic resolution as Pango lineage annotations.
Only t-SNE clusters of SARS-CoV-2 genomes within a single Nextstrain clade get close to defining Pango-resolution genetic groups.
Each method provides only a few parameters to tune its embeddings and these parameters have little effect on the qualitative outcome.
In maintaining a linear relationship between Euclidean and genetic distances, MDS sacrifices the ability to form more accurate genetic clusters for viruses with large genomes like SARS-CoV-2.
Neither t-SNE nor UMAP maintain a linear relationship between pairwise Euclidean and genetic distances across the observed range of genetic distances.
Expand Down
7 changes: 7 additions & 0 deletions manuscript/cartography_supplement.tex
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,13 @@ \section*{Supplementary data}
\caption{{\bf Number of late SARS-CoV-2 strains per combination of recombinant Pango lineage and t-SNE cluster where the count was at least 10 strains.}}\label{S_Fig_sarscov2_late_embeddings_tsne_recombinant_counts}
\end{figure}

\begin{figure}[!h]
\includegraphics[width=\columnwidth]{figures/sarscov2-single-clade-lineages-and-clusters-counts.png}
\caption{{\bf Number of SARS-CoV-2 sequences from Nextstrain clade 21J (Delta) per combination of Pango lineage and t-SNE cluster showing the degree to which clusters of sequences from a single Nextstrain clade can capture Pango-resolution genetic groups.}
The size of each circle represents the number of sequences associated with each pair of t-SNE cluster and Pango lineage.
The cluster label ``-1'' represents sequences that could not be assigned to a cluster.}\label{S_Fig_sarscov2_single_clade_embeddings_tsne_counts}
\end{figure}

\begin{figure}[!h]
\includegraphics[width=0.75\columnwidth]{figures/sarscov2-test-mds-reassortment-XE-BA1-BA2.png}\\
\includegraphics[width=0.75\columnwidth]{figures/sarscov2-test-mds-reassortment-XAS-BA2-BA5.png}
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions manuscript/tables/accuracy_table.tex
Original file line number Diff line number Diff line change
Expand Up @@ -42,5 +42,6 @@
& & MDS & 18 & 0.37 & \\
& & genetic & 15 & 0.43 & \\
& & PCA & 7 & 0.46 & \\
SARS-CoV-2 21J (Delta) only & Pango & t-SNE & 28 & 0.17 & \\
\botrule
\end{tabular}
1 change: 1 addition & 0 deletions manuscript/tables/total_HDBSCAN_table.csv
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,4 @@
37,MDS,0.37,18,,,
38,genetic,0.43,15,,,
39,PCA,0.46,7,,,
40,t-SNE,0.17,28,,SARS-CoV-2 21J (Delta) only,Pango
5 changes: 3 additions & 2 deletions notebooks/scripts/concat_KDE_tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,9 @@
df = df.loc[:, [column for column in columns_to_use if column != "distance_threshold"]].copy()
df["distance_threshold"] = ""

df["Pathogen Dataset"] = [pathogen] + [""] * (len(embedding_name_by_abbreviation) - 1)
df["Genetic Group Type"] = [genetic_group_type] + [""] * (len(embedding_name_by_abbreviation) - 1)
number_of_methods = df["method"].drop_duplicates().shape[0]
df["Pathogen Dataset"] = [pathogen] + [""] * (number_of_methods - 1)
df["Genetic Group Type"] = [genetic_group_type] + [""] * (number_of_methods - 1)
tables.append(df)

df = pd.concat(tables, ignore_index=True)
Expand Down
245 changes: 245 additions & 0 deletions sars-cov-2-nextstrain/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ rule sarscov2:
"auspice/cartography_ncov-2020-2022.json",
expand("sars-cov-2-nextstrain/results/mutation_table_for_{clade_membership}.csv", clade_membership=SARS_COV_2_CLADE_MEMBERSHIPS),
"sars-cov-2-nextstrain/results/monophyletic_clusters.csv",
"sars-cov-2-nextstrain/results/single_clade/cluster_accuracy_t-sne_for_Nextclade_pango_collapsed.csv",
"sars-cov-2-nextstrain/results/single_clade/cluster_accuracy_without_unclustered_t-sne_for_Nextclade_pango_collapsed.csv",
"sars-cov-2-nextstrain/results/single_clade/lineages_and_clusters_counts.tsv",
"sars-cov-2-nextstrain/results/single_clade/lineages_and_clusters_counts_without_unclustered.tsv",
"manuscript/figures/sarscov2-single-clade-lineages-and-clusters-counts.png",

rule sarscov2_download_filtered_metadata:
output:
Expand Down Expand Up @@ -960,3 +965,243 @@ rule sarscov2_get_authors_for_accessions:
| csvtk rename -l -t -f "Isolate Lineage" -n "Virus Name" \
| csvtk mutate2 -l -t -n "dataset" -e "'sars-cov-2-2020-2022'" > {output.authors}
"""

rule sarscov2_subsample_single_clade:
input:
metadata="sars-cov-2-nextstrain/data/filtered_metadata.tsv.zst",
outliers="sars-cov-2-nextstrain/config/exclude.txt",
output:
strains="sars-cov-2-nextstrain/data/single_clade/strains.txt",
params:
random_seed=RANDOM_SEED,
single_clade="21J (Delta)",
conda: "../cartography.yml"
shell:
"""
augur filter \
--metadata {input.metadata} \
--exclude {input.outliers} \
--min-date 2020-01-01 \
--max-date 2022-01-01 \
--query "Nextstrain_clade == '{params.single_clade}'" \
--group-by region year month \
--subsample-max-sequences 2000 \
--subsample-seed {params.random_seed} \
--output-strains {output.strains}
"""

rule sarscov2_extract_subsampled_data_for_single_clade:
input:
metadata="sars-cov-2-nextstrain/data/filtered_metadata.tsv.zst",
alignment="sars-cov-2-nextstrain/data/all_aligned.fasta.zst",
strains="sars-cov-2-nextstrain/data/single_clade/strains.txt",
output:
metadata="sars-cov-2-nextstrain/data/single_clade/metadata.tsv",
alignment="sars-cov-2-nextstrain/data/single_clade/aligned.fasta",
conda: "../cartography.yml"
shell:
"""
augur filter \
--metadata {input.metadata} \
--sequences {input.alignment} \
--exclude-all \
--include {input.strains} \
--output-metadata {output.metadata} \
--output-sequences {output.alignment}
"""

rule sarscov2_collapse_pango_lineages_for_single_clade:
input:
metadata="sars-cov-2-nextstrain/data/single_clade/metadata.tsv",
output:
metadata="sars-cov-2-nextstrain/data/single_clade/metadata_Pango_collapsed.tsv",
conda: "../cartography.yml"
log:
"logs/sars-cov-2-nextstrain/collapse_pango_lineages_for_single_clade.txt",
params:
min_samples=10,
pango_column="Nextclade_pango",
new_column="Nextclade_pango_collapsed",
shell:
"""
python3 notebooks/scripts/collapse_pango_lineages.py \
--metadata {input.metadata} \
--min-samples {params.min_samples} \
--pango-column {params.pango_column} \
--new-column {params.new_column} \
--output {output.metadata} 2>&1 | tee {log}
"""

rule sarscov2_create_distance_matrix_for_single_clade:
input:
alignment="sars-cov-2-nextstrain/data/single_clade/aligned.fasta",
output:
output="sars-cov-2-nextstrain/results/single_clade/distance_matrix.csv",
conda: "../cartography.yml"
shell:
"""
pathogen-distance \
--alignment {input.alignment} \
--indel-distance \
--output {output.output}
"""

rule sarscov2_embed_tsne_for_single_clade:
input:
alignment="sars-cov-2-nextstrain/data/single_clade/aligned.fasta",
distance_matrix="sars-cov-2-nextstrain/results/single_clade/distance_matrix.csv",
parameters="simulations/coronavirus-like/moderate-recombination-rate/t-sne_parameters.csv",
output:
dataframe="sars-cov-2-nextstrain/results/single_clade/embed_t-sne.csv",
params:
random_seed=RANDOM_SEED,
conda: "../cartography.yml"
shell:
"""
pathogen-embed \
--alignment {input.alignment} \
--distance-matrix {input.distance_matrix} \
--embedding-parameters {input.parameters} \
--random-seed {params.random_seed} \
--output-dataframe {output.dataframe} \
t-sne \
--pca-encoding simplex
"""

rule sarscov2_cluster_for_single_clade:
input:
embedding="sars-cov-2-nextstrain/results/single_clade/embed_t-sne.csv",
parameters="sars-cov-2-nextstrain/results/optimal_cluster_accuracy_and_parameters_for_Nextclade_pango_collapsed.csv",
output:
clustered_embedding="sars-cov-2-nextstrain/results/single_clade/cluster_embed_t-sne_for_Nextclade_pango_collapsed.csv",
params:
min_size=CLUSTER_MIN_SIZE,
min_samples=CLUSTER_MIN_SAMPLES,
conda: "../cartography.yml"
shell:
"""
pathogen-cluster \
--embedding {input.embedding} \
--min-size {params.min_size} \
--min-samples {params.min_samples} \
--distance-threshold "$(csvtk filter2 -f '$method=="t-sne"' {input.parameters} | csvtk cut -f distance_threshold | csvtk del-header)" \
--label-attribute "t-sne_label_for_Nextclade_pango_collapsed" \
--output-dataframe {output.clustered_embedding}
"""

rule sarscov2_cluster_accuracy_for_single_clade:
input:
metadata="sars-cov-2-nextstrain/data/single_clade/metadata_Pango_collapsed.tsv",
clusters="sars-cov-2-nextstrain/results/single_clade/cluster_embed_t-sne_for_Nextclade_pango_collapsed.csv",
output:
dataframe="sars-cov-2-nextstrain/results/single_clade/cluster_accuracy_t-sne_for_Nextclade_pango_collapsed.csv",
params:
ignored_clusters="unassigned",
conda: "../cartography.yml"
shell:
"""
python3 notebooks/scripts/metadata_HDBSCAN.py \
--method "t-sne" \
--true-clusters {input.metadata} \
--true-clusters-column "Nextclade_pango_collapsed" \
--predicted-clusters {input.clusters} \
--predicted-clusters-column "t-sne_label_for_Nextclade_pango_collapsed" \
--ignored-clusters {params.ignored_clusters:q} \
--output {output.dataframe}
"""

rule sarscov2_cluster_accuracy_without_unclustered_for_single_clade:
input:
metadata="sars-cov-2-nextstrain/data/single_clade/metadata_Pango_collapsed.tsv",
clusters="sars-cov-2-nextstrain/results/single_clade/cluster_embed_t-sne_for_Nextclade_pango_collapsed.csv",
output:
dataframe="sars-cov-2-nextstrain/results/single_clade/cluster_accuracy_without_unclustered_t-sne_for_Nextclade_pango_collapsed.csv",
params:
ignored_clusters=["unassigned", "-1"],
conda: "../cartography.yml"
shell:
"""
python3 notebooks/scripts/metadata_HDBSCAN.py \
--method "t-sne" \
--true-clusters {input.metadata} \
--true-clusters-column "Nextclade_pango_collapsed" \
--predicted-clusters {input.clusters} \
--predicted-clusters-column "t-sne_label_for_Nextclade_pango_collapsed" \
--ignored-clusters {params.ignored_clusters:q} \
--output {output.dataframe}
"""

rule sarscov2_select_lineages_for_single_clade:
input:
metadata="sars-cov-2-nextstrain/data/single_clade/metadata_Pango_collapsed.tsv",
output:
lineages="sars-cov-2-nextstrain/results/single_clade/lineages.tsv",
conda: "../cartography.yml"
shell:
"""
tsv-select -H -f strain,Nextclade_pango_collapsed {input.metadata} > {output.lineages}
"""

rule sarscov2_select_clusters_for_single_clade:
input:
clusters="sars-cov-2-nextstrain/results/single_clade/cluster_embed_t-sne_for_Nextclade_pango_collapsed.csv",
output:
clusters="sars-cov-2-nextstrain/results/single_clade/clusters.tsv",
conda: "../cartography.yml"
shell:
"""
tsv-select -d "," -H -f 'strain','t\-sne_label_for_Nextclade_pango_collapsed' {input.clusters} \
| csv2tsv > {output.clusters}
"""

rule sarscov2_merge_lineages_and_clusters_for_single_clade:
input:
lineages="sars-cov-2-nextstrain/results/single_clade/lineages.tsv",
clusters="sars-cov-2-nextstrain/results/single_clade/clusters.tsv",
output:
table="sars-cov-2-nextstrain/results/single_clade/lineages_and_clusters.tsv",
conda: "../cartography.yml"
shell:
"""
tsv-join -H -f {input.lineages} -k strain -a Nextclade_pango_collapsed {input.clusters} > {output.table}
"""

rule sarscov2_summarize_lineages_and_clusters_for_single_clade:
input:
table="sars-cov-2-nextstrain/results/single_clade/lineages_and_clusters.tsv",
output:
counts="sars-cov-2-nextstrain/results/single_clade/lineages_and_clusters_counts.tsv",
conda: "../cartography.yml"
shell:
"""
tsv-summarize -H -g 'Nextclade_pango_collapsed','t\-sne_label_for_Nextclade_pango_collapsed' --count {input.table} > {output.counts}
"""

rule sarscov2_filter_lineages_and_clusters_for_single_clade:
input:
counts="sars-cov-2-nextstrain/results/single_clade/lineages_and_clusters_counts.tsv",
output:
counts="sars-cov-2-nextstrain/results/single_clade/lineages_and_clusters_counts_without_unclustered.tsv",
conda: "../cartography.yml"
shell:
"""
tsv-filter -H --str-ne "t\-sne_label_for_Nextclade_pango_collapsed:-1" {input.counts} > {output.counts}
"""

rule sarscov2_plot_lineages_and_clusters_for_single_clade:
input:
counts="sars-cov-2-nextstrain/results/single_clade/lineages_and_clusters_counts.tsv",
output:
counts_plot="manuscript/figures/sarscov2-single-clade-lineages-and-clusters-counts.png",
conda: "../cartography.yml"
params:
lineage_column="Nextclade_pango_collapsed",
cluster_column="t-sne_label_for_Nextclade_pango_collapsed",
shell:
"""
python3 sars-cov-2-nextstrain/scripts/plot-single-clade-lineage-and-cluster-counts.py \
--counts {input.counts} \
--lineage-column {params.lineage_column:q} \
--cluster-column {params.cluster_column:q} \
--output {output.counts_plot}
"""
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
method,predicted_clusters_column,normalized_vi,n_predicted_clusters,n_predicted_cluster_samples,n_ignored_predicted_clusters,n_true_cluster_samples,n_ignored_true_clusters,n_vi_cluster_samples
t-sne,t-sne_label_for_Nextclade_pango_collapsed,0.17,28,1992,0,1992,0,1992
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
method,predicted_clusters_column,normalized_vi,n_predicted_clusters,n_predicted_cluster_samples,n_ignored_predicted_clusters,n_true_cluster_samples,n_ignored_true_clusters,n_vi_cluster_samples
t-sne,t-sne_label_for_Nextclade_pango_collapsed,0.13,28,1806,186,1992,0,1806
Loading

0 comments on commit 04b9eab

Please sign in to comment.