As we work on expanding and updating the P2RA preprint for publication, I’m working on running the relevant datasets through my pipeline to compare the results to those from the original Kraken-based approach. I began by processing Yang et al. (2020), a study that carried out RNA viral metagenomics on raw sewage samples from Xinjiang, China.
+
The study collected 7 1L grab samples of raw sewage from the inlets of three WWTPs in Xinjiang, and processed them using an “anion filter membrane absorption” method quite distinct from the otherstudies I’ve looked at so far:
+
+
Initially, samples were centrifuged to pellet cells and debris.
+
The supernatant was treated with magnesium chloride and HCl, then filtered through a mixed cellulose ester membrane filter, discarding the filtrate.
+
Material was then eluted from the filters by ultrasonication with 3% beef extract, before going a second filtering step using a 0.22um filter, this time retaining the filtrate.
+
Next, the samples underwent RNA extraction using the QIAamp viral RNA Mini Kit.
+
Finally, sequencing libraries were generated using KAPA Hyper Prep Kit. No mention is made of rRNA depletion or panel enrichment.
+
+
This study is of particular interest because previous analysis found that the MGS data it generated contains a particularly high fraction of both total viruses and human-infecting viruses. It would be good to know whether this is attributable to the fact that the authors used an unusual method, or if it’s the result of genuine differences on the ground (e.g. an elevated prevalence of enteric disease compared to the USA).
+
The raw data
+
The Yang data is unusual in that it contains only a small number of samples (7 samples from 3 different locations) but each of these samples is sequenced quite deeply: the number of reads per sample varied from 162M to 283M, with an average of 203M. Taken together, the samples totaled roughly 1.4B read pairs (425 gigabases of sequence).
+
Read qualities were consistently very high, but adapter levels were also elevated. The level of sequence duplication levels was very high according to FASTQC, with a minimum of 89% across all samples, suggesting a high degree of oversequencing; perhaps unsurprising given the sheer depth of these samples.
On average, cleaning & deduplication together removed about 86% of total reads from each sample, with the largest losses seen during deduplication. Only a few reads (low single-digit percentages at most) were lost during subsequent ribodepletion – a surprising finding given the apparent lack of rRNA depletion in the study methods. This makes me suspect that the methods did include rRNA depletion without mentioning it; if this isn’t the case, it suggests that the methods used by the authors might be particularly effective at removing bacteria from the sample.
Data cleaning with FASTP was very successful at removing adapters, with very few adapter sequences found by FASTQC at any stage after the raw data. Improvements in quality were unsurprisingly minimal, given the high initial quality observed.
According to FASTQC, deduplication was only moderately effective at reducing measured duplicate levels, despite the high read losses observed. After deduplication, FASTQC-measured duplicate levels fell from an average of 92% to one of 58%:
As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:
Overall, in these unenriched samples, between 89% and 95% of reads (mean 93%) are low-quality, duplicates, or unassigned. Of the remainder, 0.4-3.9% (mean 1.6%) were identified as ribosomal, and 0.06-0.22% (mean 0.13%) were otherwise classified as bacterial. The fraction of human reads ranged from 0.006-0.198% (mean 0.049%). Strikingly, those of viral reads ranged from 0.6% all the way up to 10.2% (mean 5.5%).
+
The total fraction of viral reads is strikingly high, even higher on average than for Rothman unenriched samples. As in Rothman, this is primarily accounted for by very high levels of tobamoviruses (viral family Virgaviridae), which make up 74-98% (mean 84%) of viral reads. Most of the remainder is accounted for by Tombusviridae and Solemoviridae, two other families of plant viruses. Only one family of vertebrate viruses, Astroviridae, accounted for more than 1% of viral reads in any sample.
Almost as striking as the high fraction of viral reads in these data is the extremely low prevalence of bacterial reads. For comparison, non-ribosomal bacterial reads in unenriched Crits-Christoph samples ranged from 16-20%, and in unenriched Rothman samples it ranged from 2-12%. Yang et al. are thus somehow achieving a greater than 10-fold reduction in the fraction of bacterial reads compared to other studies. Unlike the increased prevalence of viruses, this seems harder to explain away via differences in conditions on the ground. Together with the elevated viral fraction, these results suggest to me that Yang’s ideosyncratic methods are worth investigating further.
+
Human-infecting virus reads: validation
+
Next, I investigated the human-infecting virus read content of these unenriched samples, using a pipeline identical to that described in my entry on unenriched Rothman samples. This process identified a total of 532,583 read pairs across all samples (0.3% of surviving reads):
This is a far larger number of putative viral reads than I previously needed to analyze, and is far too many to realistically process with BLASTN. To address this problem, I selected 1% of putative viral sequences (5,326) and ran them through BLASTN in a similar manner to past datasets:
High-scoring true-positives massively outnumber false-positives, while low-scoring true-positives are fairly rare. My usual score cutoff, a disjunctive threshold at S=20, achieves a sensitivity of ~97% and an F1 score of ~98%. As such, I think this looks pretty good, and feel comfortable moving on to analyzing the entire HV dataset.
Applying a disjunctive cutoff at S=20 identifies 513,259 reads as human-viral out of 1.42B total reads, for a relative HV abundance of \(3.62 \times 10^{-4}\). This compares to \(2.0 \times 10^{-4}\) on the public dashboard, corresponding to the results for Kraken-only identification: an 80% increase, smaller than the 4-5x increases seen for Crits-Christoph and Rothman. Relative HV abundances for individual samples ranged from \(6.36 \times 10^{-5}\) to \(1.19 \times 10^{-3}\):
+
+Code
# Visualize
+g_phv_agg<-ggplot(read_counts_agg, aes(x=sample, y=p_reads_hv, color=location))+
+geom_point()+
+scale_y_log10("Relative abundance of human virus reads")+
+scale_color_brewer(palette ="Set1", name ="Location")+
+theme_kit
+g_phv_agg
+
+
+
+
+
Digging into individual viruses, we see that fecal-oral viruses predominate, especially Mamastrovirus, Rotavirus, Salivirus, and fecal-oral Enterovirus species (especially Enterovirus C, which includes poliovirus):
+
+Code
# Get viral taxon names for putative HV reads
+viral_taxa$name[viral_taxa$taxid==249588]<-"Mamastrovirus"
+viral_taxa$name[viral_taxa$taxid==194960]<-"Kobuvirus"
+viral_taxa$name[viral_taxa$taxid==688449]<-"Salivirus"
+mrg_hv_named<-mrg_hv%>%left_join(viral_taxa, by="taxid")
+
+# Discover viral species & genera for HV reads
+raise_rank<-function(read_db, taxid_db, out_rank="species", verbose=FALSE){
+# Get higher ranks than search rank
+ranks<-c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
+rank_match<-which.max(ranks==out_rank)
+high_ranks<-ranks[rank_match:length(ranks)]
+# Merge read DB and taxid DB
+reads<-read_db%>%select(-parent_taxid, -rank, -name)%>%
+left_join(taxid_db, by="taxid")
+# Extract sequences that are already at appropriate rank
+reads_rank<-filter(reads, rank==out_rank)
+# Drop sequences at a higher rank and return unclassified sequences
+reads_norank<-reads%>%filter(rank!=out_rank, !rank%in%high_ranks, !is.na(taxid))
+while(nrow(reads_norank)>0){# As long as there are unclassified sequences...
+# Promote read taxids and re-merge with taxid DB, then re-classify and filter
+reads_remaining<-reads_norank%>%mutate(taxid =parent_taxid)%>%
+select(-parent_taxid, -rank, -name)%>%
+left_join(taxid_db, by="taxid")
+reads_rank<-reads_remaining%>%filter(rank==out_rank)%>%
+bind_rows(reads_rank)
+reads_norank<-reads_remaining%>%
+filter(rank!=out_rank, !rank%in%high_ranks, !is.na(taxid))
+}
+# Finally, extract and append reads that were excluded during the process
+reads_dropped<-reads%>%filter(!seq_id%in%reads_rank$seq_id)
+reads_out<-reads_rank%>%bind_rows(reads_dropped)%>%
+select(-parent_taxid, -rank, -name)%>%
+left_join(taxid_db, by="taxid")
+return(reads_out)
+}
+hv_reads_species<-raise_rank(mrg_hv_named, viral_taxa, "species")
+hv_reads_genera<-raise_rank(mrg_hv_named, viral_taxa, "genus")
These results are consistent with what’s been observed before, e.g. in the P2RA preprint and public dashboard.
+
Conclusion
+
I analyzed Yang et al. for two reasons: first, because it’s included in the P2RA dataset that we’re currently reanalyzing, and second, because it has the highest relative abundance of both total and human-infecting viruses of any wastewater study we’ve looked at so far. Having done this analysis, the key question to ask is: is this elevated viral abundance a consequence of the unusual sample-processing methods employed by the authors, or the region in which samples were collected? Both are plausible: the processing methods used in this paper are distinct from any other paper we’ve looked at, which makes it possible that they are significantly superior; on the other hand, it’s also plausible that the methods are comparable in efficacy to more standard approaches, and the difference arises from a genuinely elevated level of viruses in Xinjiang compared to e.g. California (for Rothman and Crits-Christoph).
+
The analyses conducted here aren’t able to give a dispositive answer to that question, but they are suggestive. The fact that human-infecting viruses are strongly dominated by fecal-oral species is consistent with the elevated-infection theory, as these are the kinds of viruses I’d most expect to be elevated in a poorer region of the world compared to a rich one. On the other hand, the fact that total viruses (including plant viruses) are also elevated points in the other direction; I don’t see any obvious reason why we’d expect more tobamoviruses in Xinjiang than California.
+
Most compellingly, from my perspective, is the very low fraction of bacterial and ribosomal reads found in the Yang data compared to other datasets I’ve analyzed. The methods make no mention of rRNA depletion prior to sequencing; even if this was conducted without being mentioned, that still doesn’t explain the very low level of non-ribosomal bacterial reads. In general, I’d expect a region with elevated enteric viral pathogens to also show elevated enteric bacterial pathogens, so this absence is difficult to explain with a catchment-based theory. Instead, it points towards the viral enrichment protocols used by this study as potentially achieving genuinely better results than other methods we’ve tried.
+
This certainly isn’t conclusive evidence, or close to it. However, based on these results, I’d be very interested in learning more about the details of Yang et al’s approach to viral enrichment, and to see it tried by another group in a different context to see if these impressive results can be replicated.
+
+
+
+
+
Source Code
+
---
+title: "Workflow analysis of Yang et al. (2020)"
+subtitle: "Wastewater from Xinjiang."
+author: "Will Bradshaw"
+date: 2024-03-16
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ code-link: true
+ df-print: paged
+editor: visual
+title-block-banner: black
+---
+
+```{r}
+#| label: load-packages
+#| include: false
+library(tidyverse)
+library(cowplot)
+library(patchwork)
+library(fastqcr)
+library(RColorBrewer)
+source("../scripts/aux_plot-theme.R")
+theme_base <- theme_base +theme(aspect.ratio =NULL)
+theme_kit <- theme_base +theme(
+axis.text.x =element_text(hjust =1, angle =45),
+axis.title.x =element_blank(),
+)
+tnl <-theme(legend.position ="none")
+```
+
+As we work on expanding and updating the [P2RA preprint](https://doi.org/10.1101/2023.12.22.23300450) for publication, I'm working on running the relevant datasets through my pipeline to compare the results to those from the original Kraken-based approach. I began by processing [Yang et al. (2020)](https://doi.org/10.1016/j.scitotenv.2020.142322), a study that carried out RNA viral metagenomics on raw sewage samples from Xinjiang, China.
+
+The study collected 7 1L grab samples of raw sewage from the inlets of three WWTPs in Xinjiang, and processed them using an "anion filter membrane absorption" method quite distinct from the [other](https://data.securebio.org/wills-public-notebook/notebooks/2024-02-04_crits-christoph-1.html)[studies](https://data.securebio.org/wills-public-notebook/notebooks/2024-02-27_rothman-1.html) I've looked at so far:
+
+1. Initially, samples were centrifuged to pellet cells and debris.
+2. The supernatant was treated with magnesium chloride and HCl, then filtered through a mixed cellulose ester membrane filter, **discarding the filtrate**.
+3. Material was then eluted from the filters by ultrasonication with 3% beef extract, before going a second filtering step using a 0.22um filter, this time retaining the filtrate.
+4. Next, the samples underwent RNA extraction using the QIAamp viral RNA Mini Kit.
+5. Finally, sequencing libraries were generated using KAPA Hyper Prep Kit. No mention is made of rRNA depletion or panel enrichment.
+
+This study is of particular interest because previous analysis found that the MGS data it generated contains a particularly high fraction of both total viruses and human-infecting viruses. It would be good to know whether this is attributable to the fact that the authors used an unusual method, or if it's the result of genuine differences on the ground (e.g. an elevated prevalence of enteric disease compared to the USA).
+
+# The raw data
+
+The Yang data is unusual in that it contains only a small number of samples (7 samples from 3 different locations) but each of these samples is sequenced quite deeply: the number of reads per sample varied from 162M to 283M, with an average of 203M. Taken together, the samples totaled roughly 1.4B read pairs (425 gigabases of sequence).
+
+Read qualities were consistently very high, but adapter levels were also elevated. The level of sequence duplication levels was very high according to FASTQC, with a minimum of 89% across all samples, suggesting a high degree of oversequencing; perhaps unsurprising given the sheer depth of these samples.
+
+```{r}
+#| fig-height: 5
+#| warning: false
+
+# Data input paths
+data_dir <-"../data/2024-03-15_yang/"
+libraries_path <-file.path(data_dir, "sample-metadata.csv")
+basic_stats_path <-file.path(data_dir, "qc_basic_stats.tsv")
+adapter_stats_path <-file.path(data_dir, "qc_adapter_stats.tsv")
+quality_base_stats_path <-file.path(data_dir, "qc_quality_base_stats.tsv")
+quality_seq_stats_path <-file.path(data_dir, "qc_quality_sequence_stats.tsv")
+
+# Import libraries and extract metadata from sample names
+libraries <-read_csv(libraries_path, show_col_types =FALSE)
+
+# Import QC data
+stages <-c("raw_concat", "cleaned", "dedup", "ribo_initial", "ribo_secondary")
+basic_stats <-read_tsv(basic_stats_path, show_col_types =FALSE) %>%
+# mutate(sample = sub("_2$", "", sample)) %>%
+inner_join(libraries, by="sample") %>%
+mutate(stage =factor(stage, levels = stages),
+sample =fct_inorder(sample))
+adapter_stats <-read_tsv(adapter_stats_path, show_col_types =FALSE) %>%
+mutate(sample =sub("_2$", "", sample)) %>%
+inner_join(libraries, by="sample") %>%
+mutate(stage =factor(stage, levels = stages),
+read_pair =fct_inorder(as.character(read_pair)))
+quality_base_stats <-read_tsv(quality_base_stats_path, show_col_types =FALSE) %>%
+# mutate(sample = sub("_2$", "", sample)) %>%
+inner_join(libraries, by="sample") %>%
+mutate(stage =factor(stage, levels = stages),
+read_pair =fct_inorder(as.character(read_pair)))
+quality_seq_stats <-read_tsv(quality_seq_stats_path, show_col_types =FALSE) %>%
+# mutate(sample = sub("_2$", "", sample)) %>%
+inner_join(libraries, by="sample") %>%
+mutate(stage =factor(stage, levels = stages),
+read_pair =fct_inorder(as.character(read_pair)))
+
+# Filter to raw data
+basic_stats_raw <- basic_stats %>%filter(stage =="raw_concat")
+adapter_stats_raw <- adapter_stats %>%filter(stage =="raw_concat")
+quality_base_stats_raw <- quality_base_stats %>%filter(stage =="raw_concat")
+quality_seq_stats_raw <- quality_seq_stats %>%filter(stage =="raw_concat")
+
+# Visualize basic stats
+g_nreads_raw <-ggplot(basic_stats_raw, aes(x=sample, y=n_read_pairs, fill=location)) +geom_col(position="dodge") +scale_y_continuous(name="# Read pairs", expand=c(0,0)) +scale_fill_brewer(palette="Set1", name="Location") + theme_kit
+legend_location <-get_legend(g_nreads_raw)
+g_nreads_raw_2 <- g_nreads_raw +theme(legend.position ="none")
+g_nbases_raw <-ggplot(basic_stats_raw, aes(x=sample, y=n_bases_approx, fill=location)) +geom_col(position="dodge") +scale_y_continuous(name="Total base pairs (approx)", expand=c(0,0)) +scale_fill_brewer(palette="Set1", name="Location") + theme_kit +theme(legend.position ="none")
+g_ndup_raw <-ggplot(basic_stats_raw, aes(x=sample, y=percent_duplicates, fill=location)) +geom_col(position="dodge") +scale_y_continuous(name="% Duplicates (FASTQC)", expand=c(0,0), limits =c(0,100), breaks =seq(0,100,20)) +scale_fill_brewer(palette="Set1", name="Location") + theme_kit +theme(legend.position ="none")
+g_basic_raw <-plot_grid(g_nreads_raw_2 + g_nbases_raw + g_ndup_raw, legend_location, ncol =1, rel_heights =c(1,0.1))
+g_basic_raw
+
+# Visualize adapters
+g_adapters_raw <-ggplot(adapter_stats_raw, aes(x=position, y=pc_adapters, color=location, linetype = read_pair, group=interaction(sample, read_pair))) +geom_line() +
+scale_color_brewer(palette ="Set1", name ="Location") +
+scale_linetype_discrete(name ="Read Pair") +
+scale_y_continuous(name="% Adapters", limits=c(0,50),
+breaks =seq(0,50,10), expand=c(0,0)) +
+scale_x_continuous(name="Position", limits=c(0,140),
+breaks=seq(0,140,20), expand=c(0,0)) +
+guides(color=guide_legend(nrow=2,byrow=TRUE),
+linetype =guide_legend(nrow=2,byrow=TRUE)) +
+facet_wrap(~adapter) + theme_base
+g_adapters_raw
+
+# Visualize quality
+g_quality_base_raw <-ggplot(quality_base_stats_raw, aes(x=position, y=mean_phred_score, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +
+geom_hline(yintercept=25, linetype="dashed", color="red") +
+geom_hline(yintercept=30, linetype="dashed", color="red") +
+geom_line() +
+scale_color_brewer(palette ="Set1", name ="Location") +
+scale_linetype_discrete(name ="Read Pair") +
+scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+scale_x_continuous(name="Position", limits=c(0,140),
+breaks=seq(0,140,20), expand=c(0,0)) +
+guides(color=guide_legend(nrow=2,byrow=TRUE),
+linetype =guide_legend(nrow=2,byrow=TRUE)) +
+ theme_base
+g_quality_seq_raw <-ggplot(quality_seq_stats_raw, aes(x=mean_phred_score, y=n_sequences, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +
+geom_vline(xintercept=25, linetype="dashed", color="red") +
+geom_vline(xintercept=30, linetype="dashed", color="red") +
+geom_line() +
+scale_color_brewer(palette ="Set1", name ="Location") +
+scale_linetype_discrete(name ="Read Pair") +
+scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+guides(color=guide_legend(nrow=2,byrow=TRUE),
+linetype =guide_legend(nrow=2,byrow=TRUE)) +
+ theme_base
+g_quality_base_raw
+g_quality_seq_raw
+```
+
+# Preprocessing
+
+```{r}
+n_reads_rel <- basic_stats %>%
+select(sample, location, stage, percent_duplicates, n_read_pairs) %>%
+group_by(sample, location) %>%arrange(sample, location, stage) %>%
+mutate(p_reads_retained = n_read_pairs /lag(n_read_pairs),
+p_reads_lost =1- p_reads_retained,
+p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
+p_reads_lost_abs =1-p_reads_retained_abs,
+p_reads_lost_abs_marginal = p_reads_lost_abs -lag(p_reads_lost_abs))
+
+```
+
+On average, cleaning & deduplication together removed about 86% of total reads from each sample, with the largest losses seen during deduplication. Only a few reads (low single-digit percentages at most) were lost during subsequent ribodepletion – a surprising finding given the apparent lack of rRNA depletion in the study methods. This makes me suspect that the methods did include rRNA depletion without mentioning it; if this isn't the case, it suggests that the methods used by the authors might be particularly effective at removing bacteria from the sample.
+
+```{r}
+#| warning: false
+# Plot reads over preprocessing
+g_reads_stages <-ggplot(basic_stats, aes(x=stage, y=n_read_pairs,fill=location,group=sample)) +
+geom_col(position="dodge") +scale_fill_brewer(palette="Set1", name="Location") +
+scale_y_continuous("# Read pairs", expand=c(0,0)) +
+ theme_kit
+g_reads_stages
+
+# Plot relative read losses during preprocessing
+g_reads_rel <-ggplot(n_reads_rel, aes(x=stage, y=p_reads_lost_abs_marginal,fill=location,group=sample)) +
+geom_col(position="dodge") +scale_fill_brewer(palette="Set1", name="Location") +
+scale_y_continuous("% Total Reads Lost", expand=c(0,0), labels =function(x) x*100) +
+ theme_kit
+g_reads_rel
+```
+
+Data cleaning with FASTP was very successful at removing adapters, with very few adapter sequences found by FASTQC at any stage after the raw data. Improvements in quality were unsurprisingly minimal, given the high initial quality observed.
+
+```{r}
+#| warning: false
+# Visualize adapters
+g_adapters <-ggplot(adapter_stats, aes(x=position, y=pc_adapters, color=location, linetype = read_pair, group=interaction(sample, read_pair))) +geom_line() +
+scale_color_brewer(palette ="Set1", name ="Location") +
+scale_linetype_discrete(name ="Read Pair") +
+scale_y_continuous(name="% Adapters", limits=c(0,50),
+breaks =seq(0,50,10), expand=c(0,0)) +
+scale_x_continuous(name="Position", limits=c(0,140),
+breaks=seq(0,140,20), expand=c(0,0)) +
+guides(color=guide_legend(nrow=2,byrow=TRUE),
+linetype =guide_legend(nrow=2,byrow=TRUE)) +
+facet_grid(stage~adapter) + theme_base
+g_adapters
+
+# Visualize quality
+g_quality_base <-ggplot(quality_base_stats, aes(x=position, y=mean_phred_score, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +
+geom_hline(yintercept=25, linetype="dashed", color="red") +
+geom_hline(yintercept=30, linetype="dashed", color="red") +
+geom_line() +
+scale_color_brewer(palette ="Set1", name ="Location") +
+scale_linetype_discrete(name ="Read Pair") +
+scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+scale_x_continuous(name="Position", limits=c(0,140),
+breaks=seq(0,140,20), expand=c(0,0)) +
+guides(color=guide_legend(nrow=2,byrow=TRUE),
+linetype =guide_legend(nrow=2,byrow=TRUE)) +
+facet_grid(stage~.) + theme_base
+g_quality_seq <-ggplot(quality_seq_stats, aes(x=mean_phred_score, y=n_sequences, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +
+geom_vline(xintercept=25, linetype="dashed", color="red") +
+geom_vline(xintercept=30, linetype="dashed", color="red") +
+geom_line() +
+scale_color_brewer(palette ="Set1", name ="Location") +
+scale_linetype_discrete(name ="Read Pair") +
+scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+guides(color=guide_legend(nrow=2,byrow=TRUE),
+linetype =guide_legend(nrow=2,byrow=TRUE)) +
+facet_grid(stage~., scales="free_y") + theme_base
+g_quality_base
+g_quality_seq
+```
+
+According to FASTQC, deduplication was only moderately effective at reducing measured duplicate levels, despite the high read losses observed. After deduplication, FASTQC-measured duplicate levels fell from an average of 92% to one of 58%:
+
+```{r}
+g_dup_stages <-ggplot(basic_stats, aes(x=stage, y=percent_duplicates, fill=location, group=sample)) +
+geom_col(position="dodge") +scale_fill_brewer(palette ="Set1", name="Location") +
+scale_y_continuous("% Duplicates", limits=c(0,100), breaks=seq(0,100,20), expand=c(0,0)) +
+ theme_kit
+g_readlen_stages <-ggplot(basic_stats, aes(x=stage, y=mean_seq_len, fill=location, group=sample)) +
+geom_col(position="dodge") +scale_fill_brewer(palette ="Set1", name="Location") +
+scale_y_continuous("Mean read length (nt)", expand=c(0,0)) +
+ theme_kit
+legend_loc <-get_legend(g_dup_stages)
+g_dup_stages
+g_readlen_stages
+```
+
+# High-level composition
+
+As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:
+
+```{r}
+# Import Bracken data
+bracken_path <-file.path(data_dir, "bracken_counts.tsv")
+bracken <-read_tsv(bracken_path, show_col_types =FALSE)
+total_assigned <- bracken %>%group_by(sample) %>%summarize(
+name ="Total",
+kraken_assigned_reads =sum(kraken_assigned_reads),
+added_reads =sum(added_reads),
+new_est_reads =sum(new_est_reads),
+fraction_total_reads =sum(fraction_total_reads)
+)
+bracken_spread <- bracken %>%select(name, sample, new_est_reads) %>%
+mutate(name =tolower(name)) %>%
+pivot_wider(id_cols ="sample", names_from ="name", values_from ="new_est_reads")
+# Count reads
+read_counts_preproc <- basic_stats %>%
+select(sample, location, stage, n_read_pairs) %>%
+pivot_wider(id_cols =c("sample", "location"), names_from="stage", values_from="n_read_pairs")
+read_counts <- read_counts_preproc %>%
+inner_join(total_assigned %>%select(sample, new_est_reads), by ="sample") %>%
+rename(assigned = new_est_reads) %>%
+inner_join(bracken_spread, by="sample")
+# Assess composition
+read_comp <-transmute(read_counts, sample=sample, location=location,
+n_filtered = raw_concat-cleaned,
+n_duplicate = cleaned-dedup,
+n_ribosomal = (dedup-ribo_initial) + (ribo_initial-ribo_secondary),
+n_unassigned = ribo_secondary-assigned,
+n_bacterial = bacteria,
+n_archaeal = archaea,
+n_viral = viruses,
+n_human = eukaryota)
+read_comp_long <-pivot_longer(read_comp, -(sample:location), names_to ="classification",
+names_prefix ="n_", values_to ="n_reads") %>%
+mutate(classification =fct_inorder(str_to_sentence(classification))) %>%
+group_by(sample) %>%mutate(p_reads = n_reads/sum(n_reads))
+read_comp_summ <- read_comp_long %>%
+group_by(location, classification) %>%
+summarize(n_reads =sum(n_reads), .groups ="drop_last") %>%
+mutate(n_reads =replace_na(n_reads,0),
+p_reads = n_reads/sum(n_reads),
+pc_reads = p_reads*100)
+
+# Plot overall composition
+g_comp <-ggplot(read_comp_long, aes(x=sample, y=p_reads, fill=classification)) +
+geom_col(position ="stack") +
+scale_y_continuous(name ="% Reads", limits =c(0,1.01), breaks =seq(0,1,0.2),
+expand =c(0,0), labels =function(x) x*100) +
+scale_fill_brewer(palette ="Set1", name ="Classification") +
+facet_wrap(.~location, scales="free") +
+ theme_kit
+g_comp
+
+# Plot composition of minor components
+read_comp_minor <- read_comp_long %>%filter(classification %in%c("Archaeal", "Viral", "Human", "Other"))
+palette_minor <-brewer.pal(9, "Set1")[6:9]
+g_comp_minor <-ggplot(read_comp_minor, aes(x=sample, y=p_reads, fill=classification)) +
+geom_col(position ="stack") +
+scale_y_continuous(name ="% Reads", breaks =seq(0,0.1,0.01),
+expand =c(0,0), labels =function(x) x*100) +
+scale_fill_manual(values=palette_minor, name ="Classification") +
+facet_wrap(.~location, scales ="free_x") +
+ theme_kit
+g_comp_minor
+```
+
+```{r}
+p_reads_summ <- read_comp_long %>%
+mutate(classification =ifelse(classification %in%c("Filtered", "Duplicate", "Unassigned"), "Excluded", as.character(classification))) %>%
+group_by(classification, sample, location) %>%
+summarize(p_reads =sum(p_reads), .groups ="drop") %>%
+group_by(classification) %>%
+summarize(pc_min =min(p_reads)*100, pc_max =max(p_reads)*100, pc_mean =mean(p_reads)*100)
+```
+
+Overall, in these unenriched samples, between 89% and 95% of reads (mean 93%) are low-quality, duplicates, or unassigned. Of the remainder, 0.4-3.9% (mean 1.6%) were identified as ribosomal, and 0.06-0.22% (mean 0.13%) were otherwise classified as bacterial. The fraction of human reads ranged from 0.006-0.198% (mean 0.049%). Strikingly, those of viral reads ranged from 0.6% all the way up to 10.2% (mean 5.5%).
+
+The total fraction of viral reads is strikingly high, even higher on average than for Rothman unenriched samples. As in Rothman, this is primarily accounted for by very high levels of tobamoviruses (viral family Virgaviridae), which make up 74-98% (mean 84%) of viral reads. Most of the remainder is accounted for by [*Tombusviridae*](https://en.wikipedia.org/wiki/Tombusviridae) and [*Solemoviridae*](https://en.wikipedia.org/wiki/Solemoviridae), two other families of plant viruses. Only one family of vertebrate viruses, *Astroviridae*, accounted for more than 1% of viral reads in any sample.
+
+```{r}
+viral_taxa_path <-file.path(data_dir, "viral-taxids.tsv.gz")
+viral_taxa <-read_tsv(viral_taxa_path, show_col_types =FALSE)
+samples <-as.character(basic_stats_raw$sample)
+col_names <-c("pc_reads_total", "n_reads_clade", "n_reads_direct", "rank", "taxid", "name")
+report_paths <-paste0(data_dir, "kraken/", samples, ".report.gz")
+kraken_reports <-lapply(1:length(samples),
+function(n) read_tsv(report_paths[n], col_names = col_names,
+show_col_types =FALSE) %>%
+mutate(sample = samples[n])) %>% bind_rows
+kraken_reports_viral <-filter(kraken_reports, taxid %in% viral_taxa$taxid) %>%
+group_by(sample) %>%
+mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])
+viral_families <- kraken_reports_viral %>%filter(rank =="F")
+viral_families_major_list <- viral_families %>%group_by(name, taxid) %>%
+summarize(p_reads_viral_max =max(p_reads_viral), .groups="drop") %>%
+filter(p_reads_viral_max >=0.01) %>%pull(name)
+viral_families_major <- viral_families %>%filter(name %in% viral_families_major_list) %>%
+select(name, taxid, sample, p_reads_viral)
+viral_families_minor <- viral_families_major %>%group_by(sample) %>%
+summarize(p_reads_viral_major =sum(p_reads_viral)) %>%
+mutate(name ="Other", taxid=NA, p_reads_viral =1-p_reads_viral_major) %>%
+select(name, taxid, sample, p_reads_viral)
+viral_families_display <-bind_rows(viral_families_major, viral_families_minor) %>%
+arrange(desc(p_reads_viral)) %>%mutate(name =factor(name, levels=c(viral_families_major_list, "Other")))
+g_families <-ggplot(viral_families_display, aes(x=sample, y=p_reads_viral, fill=name)) +
+geom_col(position="stack") +
+scale_fill_brewer(palette ="Dark2", name ="Viral family") +
+scale_y_continuous(name="% Viral Reads", limits=c(0,1), breaks =seq(0,1,0.2),
+expand=c(0,0), labels =function(y) y*100) +
+ theme_kit
+g_families
+```
+
+Almost as striking as the high fraction of viral reads in these data is the extremely low prevalence of bacterial reads. For comparison, non-ribosomal bacterial reads in unenriched Crits-Christoph samples ranged from 16-20%, and in unenriched Rothman samples it ranged from 2-12%. Yang et al. are thus somehow achieving a greater than 10-fold reduction in the fraction of bacterial reads compared to other studies. Unlike the increased prevalence of viruses, this seems harder to explain away via differences in conditions on the ground. Together with the elevated viral fraction, these results suggest to me that Yang's ideosyncratic methods are worth investigating further.
+
+# Human-infecting virus reads: validation
+
+Next, I investigated the human-infecting virus read content of these unenriched samples, using a pipeline identical to that described in my [entry on unenriched Rothman samples](https://data.securebio.org/wills-public-notebook/notebooks/2024-02-27_rothman-1.html). This process identified a total of 532,583 read pairs across all samples (0.3% of surviving reads):
+
+```{r}
+hv_reads_filtered_path <-file.path(data_dir, "hv_hits_putative_filtered.tsv.gz")
+hv_reads_filtered <-read_tsv(hv_reads_filtered_path, show_col_types =FALSE) %>%
+inner_join(libraries, by="sample") %>%
+arrange(location) %>%
+mutate(sample =fct_inorder(sample))
+n_hv_filtered <- hv_reads_filtered %>%group_by(sample, location) %>% count %>%inner_join(basic_stats %>%filter(stage =="ribo_initial") %>%select(sample, n_read_pairs), by="sample") %>%rename(n_putative = n, n_total = n_read_pairs) %>%mutate(p_reads = n_putative/n_total, pc_reads = p_reads *100)
+```
+
+```{r}
+#| warning: false
+#| fig-width: 8
+
+mrg <- hv_reads_filtered %>%
+mutate(kraken_label =ifelse(assigned_hv, "Kraken2 HV\nassignment",
+ifelse(hit_hv, "Kraken2 HV\nhit",
+"No hit or\nassignment"))) %>%
+group_by(sample) %>%arrange(desc(adj_score_fwd), desc(adj_score_rev)) %>%
+mutate(seq_num =row_number(),
+adj_score_max =pmax(adj_score_fwd, adj_score_rev))
+# Import Bowtie2/Kraken data and perform filtering steps
+g_mrg <-ggplot(mrg, aes(x=adj_score_fwd, y=adj_score_rev)) +
+geom_point(alpha=0.5, shape=16) +
+scale_x_continuous(name="S(forward read)", limits=c(0,40), breaks=seq(0,100,10), expand =c(0,0)) +
+scale_y_continuous(name="S(reverse read)", limits=c(0,40), breaks=seq(0,100,10), expand =c(0,0)) +
+facet_wrap(~kraken_label, labeller =labeller(kit =label_wrap_gen(20))) +
+ theme_base +theme(aspect.ratio=1)
+g_mrg
+
+g_hist_0 <-ggplot(mrg, aes(x=adj_score_max)) +geom_histogram(binwidth=5,boundary=0,position="dodge") +facet_wrap(~kraken_label) +scale_x_continuous(name ="Maximum adjusted alignment score") +scale_y_continuous(name="# Read pairs") +scale_fill_brewer(palette ="Dark2") + theme_base +geom_vline(xintercept=20, linetype="dashed", color="red")
+g_hist_0
+```
+
+This is a far larger number of putative viral reads than I previously needed to analyze, and is far too many to realistically process with BLASTN. To address this problem, I selected 1% of putative viral sequences (5,326) and ran them through BLASTN in a similar manner to past datasets:
+
+```{r}
+set.seed(83745)
+mrg_sample <-sample_frac(mrg, 0.01)
+mrg_num <- mrg_sample %>%group_by(sample) %>%
+arrange(sample, desc(adj_score_max)) %>%
+mutate(seq_num =row_number(), seq_head =paste0(">", sample, "_", seq_num))
+mrg_fasta <- mrg_num %>%
+ ungroup %>%
+select(header1=seq_head, seq1=query_seq_fwd, header2=seq_head, seq2=query_seq_rev) %>%
+mutate(header1=paste0(header1, "_1"), header2=paste0(header2, "_2"))
+mrg_fasta_out <-do.call(paste, c(mrg_fasta, sep="\n")) %>%paste(collapse="\n")
+write(mrg_fasta_out, file.path(data_dir, "yang-putative-viral.fasta"))
+```
+
+```{r}
+#| warning: false
+# Import BLAST results
+blast_results_path <-file.path(data_dir, "yang-putative-viral-all.blast.gz")
+blast_cols <-c("qseqid", "sseqid", "sgi", "staxid", "qlen", "evalue", "bitscore", "qcovs", "length", "pident", "mismatch", "gapopen", "sstrand", "qstart", "qend", "sstart", "send")
+blast_results <-read_tsv(blast_results_path, show_col_types =FALSE,
+col_names = blast_cols, col_types =cols(.default="c"))
+
+# Add viral status
+blast_results_viral <-mutate(blast_results, viral = staxid %in% viral_taxa$taxid)
+
+# Filter for best hit for each query/subject, then rank hits for each query
+blast_results_best <- blast_results_viral %>%group_by(qseqid, staxid) %>%
+filter(bitscore ==max(bitscore)) %>%
+filter(length ==max(length)) %>%filter(row_number() ==1)
+blast_results_ranked <- blast_results_best %>%
+group_by(qseqid) %>%mutate(rank =dense_rank(desc(bitscore)))
+blast_results_highrank <- blast_results_ranked %>%filter(rank <=5) %>%
+mutate(read_pair =str_split(qseqid, "_") %>%sapply(nth, n=-1),
+seq_num =str_split(qseqid, "_") %>%sapply(nth, n=-2),
+sample =str_split(qseqid, "_") %>%lapply(head, n=-2) %>%
+sapply(paste, collapse="_")) %>%
+mutate(bitscore =as.numeric(bitscore), seq_num =as.numeric(seq_num))
+
+# Summarize by read pair and taxid
+blast_results_paired <- blast_results_highrank %>%
+group_by(sample, seq_num, staxid, viral) %>%
+summarize(bitscore_max =max(bitscore), bitscore_min =min(bitscore),
+n_reads =n(), .groups ="drop") %>%
+mutate(viral_full = viral & n_reads ==2)
+
+# Compare to Kraken & Bowtie assignments
+mrg_assign <- mrg_num %>%select(sample, seq_num, taxid, assigned_taxid)
+blast_results_assign <-left_join(blast_results_paired, mrg_assign, by=c("sample", "seq_num")) %>%
+mutate(taxid_match_bowtie = (staxid == taxid),
+taxid_match_kraken = (staxid == assigned_taxid),
+taxid_match_any = taxid_match_bowtie | taxid_match_kraken)
+blast_results_out <- blast_results_assign %>%
+group_by(sample, seq_num) %>%
+summarize(viral_status =ifelse(any(viral_full), 2,
+ifelse(any(taxid_match_any), 2,
+ifelse(any(viral), 1, 0))),
+.groups ="drop")
+```
+
+```{r}
+#| fig-height: 8
+#| warning: false
+
+# Merge BLAST results with unenriched read data
+mrg_blast <-full_join(mrg_num, blast_results_out, by=c("sample", "seq_num")) %>%
+mutate(viral_status =replace_na(viral_status, 0),
+viral_status_out =ifelse(viral_status ==0, FALSE, TRUE))
+
+# Plot
+g_mrg_blast_0 <- mrg_blast %>%
+ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +
+geom_point(alpha=0.5, shape=16) +
+scale_x_continuous(name="S(forward read)", limits=c(0,40), breaks=seq(0,100,10), expand =c(0,0)) +
+scale_y_continuous(name="S(reverse read)", limits=c(0,40), breaks=seq(0,100,10), expand =c(0,0)) +
+scale_color_brewer(palette ="Set1", name ="Viral status") +
+facet_grid(viral_status_out~kraken_label, labeller =labeller(kit =label_wrap_gen(20))) +
+ theme_base +theme(aspect.ratio=1)
+g_mrg_blast_0
+```
+
+```{r}
+g_hist_1 <-ggplot(mrg_blast, aes(x=adj_score_max)) +geom_histogram(binwidth=5,boundary=0,position="dodge") +facet_wrap(~viral_status_out) +scale_x_continuous(name ="Maximum adjusted alignment score") +scale_y_continuous(name="# Read pairs") +scale_fill_brewer(palette ="Dark2") + theme_base +geom_vline(xintercept=20, linetype="dashed", color="red")
+g_hist_1
+```
+
+High-scoring true-positives massively outnumber false-positives, while low-scoring true-positives are fairly rare. My usual score cutoff, a disjunctive threshold at S=20, achieves a sensitivity of \~97% and an F1 score of \~98%. As such, I think this looks pretty good, and feel comfortable moving on to analyzing the entire HV dataset.
+
+```{r}
+test_sens_spec <-function(tab, score_threshold, conjunctive, include_special){
+if (!include_special) tab <-filter(tab, viral_status_out %in%c("TRUE", "FALSE"))
+ tab_retained <- tab %>%mutate(
+conjunctive = conjunctive,
+retain_score_conjunctive = (adj_score_fwd > score_threshold & adj_score_rev > score_threshold),
+retain_score_disjunctive = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),
+retain_score =ifelse(conjunctive, retain_score_conjunctive, retain_score_disjunctive),
+retain = assigned_hv | hit_hv | retain_score) %>%
+group_by(viral_status_out, retain) %>% count
+ pos_tru <- tab_retained %>%filter(viral_status_out =="TRUE", retain) %>%pull(n) %>% sum
+ pos_fls <- tab_retained %>%filter(viral_status_out !="TRUE", retain) %>%pull(n) %>% sum
+ neg_tru <- tab_retained %>%filter(viral_status_out !="TRUE", !retain) %>%pull(n) %>% sum
+ neg_fls <- tab_retained %>%filter(viral_status_out =="TRUE", !retain) %>%pull(n) %>% sum
+ sensitivity <- pos_tru / (pos_tru + neg_fls)
+ specificity <- neg_tru / (neg_tru + pos_fls)
+ precision <- pos_tru / (pos_tru + pos_fls)
+ f1 <-2* precision * sensitivity / (precision + sensitivity)
+ out <-tibble(threshold=score_threshold, include_special = include_special,
+conjunctive = conjunctive, sensitivity=sensitivity,
+specificity=specificity, precision=precision, f1=f1)
+return(out)
+}
+range_f1 <-function(intab, inc_special, inrange=15:45){
+ tss <- purrr::partial(test_sens_spec, tab=intab, include_special=inc_special)
+ stats_conj <-lapply(inrange, tss, conjunctive=TRUE) %>% bind_rows
+ stats_disj <-lapply(inrange, tss, conjunctive=FALSE) %>% bind_rows
+ stats_all <-bind_rows(stats_conj, stats_disj) %>%
+pivot_longer(!(threshold:conjunctive), names_to="metric", values_to="value") %>%
+mutate(conj_label =ifelse(conjunctive, "Conjunctive", "Disjunctive"))
+return(stats_all)
+}
+inc_special <-FALSE
+stats_0 <-range_f1(mrg_blast, inc_special) %>%mutate(attempt=0)
+threshold_opt_0 <- stats_0 %>%group_by(conj_label,attempt) %>%filter(metric =="f1") %>%
+filter(value ==max(value)) %>%filter(threshold ==min(threshold))
+g_stats_0 <-ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +
+geom_vline(data = threshold_opt_0, mapping =aes(xintercept=threshold),
+color ="red", linetype ="dashed") +
+geom_line() +
+scale_y_continuous(name ="Value", limits=c(0,1), breaks =seq(0,1,0.2), expand =c(0,0)) +
+scale_x_continuous(name ="Threshold", expand =c(0,0)) +
+scale_color_brewer(palette="Set3") +
+facet_wrap(~conj_label) +
+ theme_base
+g_stats_0
+```
+
+# Human-infecting virus reads: analysis
+
+```{r}
+# Get raw read counts
+read_counts_raw <- basic_stats_raw %>%
+select(sample, location, n_reads_raw = n_read_pairs)
+
+# Get HV read counts & RA
+mrg_hv <- mrg %>%
+mutate(hv_status = assigned_hv | hit_hv | adj_score_max >=20)
+read_counts_hv <- mrg_hv %>%filter(hv_status) %>%group_by(sample) %>%count(name ="n_reads_hv")
+read_counts <- read_counts_raw %>%
+left_join(read_counts_hv, by="sample") %>%
+mutate(n_reads_hv =replace_na(n_reads_hv, 0),
+p_reads_hv = n_reads_hv/n_reads_raw)
+
+# Aggregate
+read_counts_total <- read_counts %>% ungroup %>%
+summarize(n_reads_raw =sum(n_reads_raw),
+n_reads_hv =sum(n_reads_hv)) %>%
+mutate(p_reads_hv = n_reads_hv/n_reads_raw,
+sample="All samples", location ="All locations")
+read_counts_agg <- read_counts %>%
+bind_rows(read_counts_total) %>%
+mutate(sample =fct_inorder(sample),
+location =fct_inorder(location))
+```
+
+Applying a disjunctive cutoff at S=20 identifies 513,259 reads as human-viral out of 1.42B total reads, for a relative HV abundance of $3.62 \times 10^{-4}$. This compares to $2.0 \times 10^{-4}$ on the public dashboard, corresponding to the results for Kraken-only identification: an 80% increase, smaller than the 4-5x increases seen for Crits-Christoph and Rothman. Relative HV abundances for individual samples ranged from $6.36 \times 10^{-5}$ to $1.19 \times 10^{-3}$:
+
+```{r}
+# Visualize
+g_phv_agg <-ggplot(read_counts_agg, aes(x=sample, y=p_reads_hv, color=location)) +
+geom_point() +
+scale_y_log10("Relative abundance of human virus reads") +
+scale_color_brewer(palette ="Set1", name ="Location") +
+ theme_kit
+g_phv_agg
+```
+
+Digging into individual viruses, we see that fecal-oral viruses predominate, especially *Mamastrovirus*, *Rotavirus*, *Salivirus*, and fecal-oral *Enterovirus* species (especially *Enterovirus C*, which includes poliovirus):
+
+```{r}
+# Get viral taxon names for putative HV reads
+viral_taxa$name[viral_taxa$taxid ==249588] <-"Mamastrovirus"
+viral_taxa$name[viral_taxa$taxid ==194960] <-"Kobuvirus"
+viral_taxa$name[viral_taxa$taxid ==688449] <-"Salivirus"
+mrg_hv_named <- mrg_hv %>%left_join(viral_taxa, by="taxid")
+
+# Discover viral species & genera for HV reads
+raise_rank <-function(read_db, taxid_db, out_rank ="species", verbose =FALSE){
+# Get higher ranks than search rank
+ ranks <-c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
+ rank_match <-which.max(ranks == out_rank)
+ high_ranks <- ranks[rank_match:length(ranks)]
+# Merge read DB and taxid DB
+ reads <- read_db %>%select(-parent_taxid, -rank, -name) %>%
+left_join(taxid_db, by="taxid")
+# Extract sequences that are already at appropriate rank
+ reads_rank <-filter(reads, rank == out_rank)
+# Drop sequences at a higher rank and return unclassified sequences
+ reads_norank <- reads %>%filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+while(nrow(reads_norank) >0){ # As long as there are unclassified sequences...
+# Promote read taxids and re-merge with taxid DB, then re-classify and filter
+ reads_remaining <- reads_norank %>%mutate(taxid = parent_taxid) %>%
+select(-parent_taxid, -rank, -name) %>%
+left_join(taxid_db, by="taxid")
+ reads_rank <- reads_remaining %>%filter(rank == out_rank) %>%
+bind_rows(reads_rank)
+ reads_norank <- reads_remaining %>%
+filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+ }
+# Finally, extract and append reads that were excluded during the process
+ reads_dropped <- reads %>%filter(!seq_id %in% reads_rank$seq_id)
+ reads_out <- reads_rank %>%bind_rows(reads_dropped) %>%
+select(-parent_taxid, -rank, -name) %>%
+left_join(taxid_db, by="taxid")
+return(reads_out)
+}
+hv_reads_species <-raise_rank(mrg_hv_named, viral_taxa, "species")
+hv_reads_genera <-raise_rank(mrg_hv_named, viral_taxa, "genus")
+```
+
+```{r}
+# Count relative abundance for species
+hv_species_counts_raw <- hv_reads_species %>%group_by(sample, location, name) %>%
+summarize(n_reads_hv =sum(hv_status), .groups ="drop") %>%
+inner_join(read_counts_agg %>%select(sample, n_reads_raw), by="sample")
+hv_species_counts_all <- hv_species_counts_raw %>%group_by(name) %>%
+summarize(n_reads_hv =sum(n_reads_hv),
+n_reads_raw =sum(n_reads_raw)) %>%
+mutate(sample ="All samples", location ="All locations")
+hv_species_counts_agg <-bind_rows(hv_species_counts_raw, hv_species_counts_all) %>%
+mutate(p_reads_hv = n_reads_hv/n_reads_raw)
+
+# Count relative abundance for genera
+hv_genera_counts_raw <- hv_reads_genera %>%group_by(sample, location, name) %>%
+summarize(n_reads_hv =sum(hv_status), .groups ="drop") %>%
+inner_join(read_counts_agg %>%select(sample, n_reads_raw), by="sample")
+hv_genera_counts_all <- hv_genera_counts_raw %>%group_by(name) %>%
+summarize(n_reads_hv =sum(n_reads_hv),
+n_reads_raw =sum(n_reads_raw)) %>%
+mutate(sample ="All samples", location ="All locations")
+hv_genera_counts_agg <-bind_rows(hv_genera_counts_raw, hv_genera_counts_all) %>%
+mutate(p_reads_hv = n_reads_hv/n_reads_raw)
+```
+
+```{r}
+#| fig-height: 6
+# Compute ranks for species and genera and restrict to high-ranking taxa
+max_rank_species <-5
+max_rank_genera <-5
+hv_species_counts_ranked <- hv_species_counts_agg %>%group_by(sample) %>%
+mutate(rank =rank(desc(n_reads_hv), ties.method="max"),
+highrank = rank <= max_rank_species) %>%
+group_by(name) %>%
+mutate(highrank_any =any(highrank),
+name_display =ifelse(highrank_any, name, "Other")) %>%
+group_by(name_display) %>%
+mutate(mean_rank =mean(rank)) %>%
+arrange(mean_rank) %>% ungroup %>%
+mutate(name_display =fct_inorder(name_display))
+hv_genera_counts_ranked <- hv_genera_counts_agg %>%group_by(sample) %>%
+mutate(rank =rank(desc(n_reads_hv), ties.method="max"),
+highrank = rank <= max_rank_genera) %>%
+group_by(name) %>%
+mutate(highrank_any =any(highrank),
+name_display =ifelse(highrank_any, name, "Other")) %>%
+group_by(name_display) %>%
+mutate(mean_rank =mean(rank)) %>%
+arrange(mean_rank) %>% ungroup %>%
+mutate(name_display =fct_inorder(name_display))
+
+# Plot composition
+palette_rank <-c(brewer.pal(8, "Dark2"), brewer.pal(8, "Set2"))
+g_vcomp_genera <-ggplot(hv_genera_counts_ranked, aes(x=sample, y=n_reads_hv, fill=name_display)) +
+geom_col(position ="fill") +
+scale_fill_manual(values = palette_rank, name ="Viral genus") +
+scale_y_continuous(name ="Fraction of HV reads", breaks =seq(0,1,0.2), expand =c(0,0)) +
+guides(fill =guide_legend(nrow=5)) +
+ theme_kit
+g_vcomp_species <-ggplot(hv_species_counts_ranked, aes(x=sample, y=n_reads_hv, fill=name_display)) +
+geom_col(position ="fill") +
+scale_fill_manual(values = palette_rank, name ="Viral species") +
+scale_y_continuous(name ="Fraction of HV reads", breaks =seq(0,1,0.2), expand =c(0,0)) +
+guides(fill =guide_legend(nrow=7)) +
+ theme_kit
+
+g_vcomp_genera
+g_vcomp_species
+```
+
+These results are consistent with what's been observed before, e.g. in the [P2RA preprint](https://doi.org/10.1101/2023.12.22.23300450) and public dashboard.
+
+# Conclusion
+
+I analyzed Yang et al. for two reasons: first, because it's included in the P2RA dataset that we're currently reanalyzing, and second, because it has the highest relative abundance of both total and human-infecting viruses of any wastewater study we've looked at so far. Having done this analysis, the key question to ask is: is this elevated viral abundance a consequence of the unusual sample-processing methods employed by the authors, or the region in which samples were collected? Both are plausible: the processing methods used in this paper are distinct from any other paper we've looked at, which makes it possible that they are significantly superior; on the other hand, it's also plausible that the methods are comparable in efficacy to more standard approaches, and the difference arises from a genuinely elevated level of viruses in Xinjiang compared to e.g. California (for Rothman and Crits-Christoph).
+
+The analyses conducted here aren't able to give a dispositive answer to that question, but they are suggestive. The fact that human-infecting viruses are strongly dominated by fecal-oral species is consistent with the elevated-infection theory, as these are the kinds of viruses I'd most expect to be elevated in a poorer region of the world compared to a rich one. On the other hand, the fact that total viruses (including plant viruses) are also elevated points in the other direction; I don't see any obvious reason why we'd expect more tobamoviruses in Xinjiang than California.
+
+Most compellingly, from my perspective, is the very low fraction of bacterial and ribosomal reads found in the Yang data compared to other datasets I've analyzed. The methods make no mention of rRNA depletion prior to sequencing; even if this was conducted without being mentioned, that still doesn't explain the very low level of non-ribosomal bacterial reads. In general, I'd expect a region with elevated enteric viral pathogens to also show elevated enteric bacterial pathogens, so this absence is difficult to explain with a catchment-based theory. Instead, it points towards the viral enrichment protocols used by this study as potentially achieving genuinely better results than other methods we've tried.
+
+This certainly isn't conclusive evidence, or close to it. However, based on these results, I'd be very interested in learning more about the details of Yang et al's approach to viral enrichment, and to see it tried by another group in a different context to see if these impressive results can be replicated.
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-11-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-11-1.png
new file mode 100644
index 0000000..2efcc71
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-11-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-11-2.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-11-2.png
new file mode 100644
index 0000000..8f5f19f
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-11-2.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-14-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-14-1.png
new file mode 100644
index 0000000..c2e4c81
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-14-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-15-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-15-1.png
new file mode 100644
index 0000000..435841d
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-15-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-16-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-16-1.png
new file mode 100644
index 0000000..65e4b38
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-16-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-18-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-18-1.png
new file mode 100644
index 0000000..c4b7302
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-18-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-1.png
new file mode 100644
index 0000000..d814f8d
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-2.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-2.png
new file mode 100644
index 0000000..16be27d
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-2.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-3.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-3.png
new file mode 100644
index 0000000..787fd7d
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-3.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-4.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-4.png
new file mode 100644
index 0000000..247e61c
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-2-4.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-21-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-21-1.png
new file mode 100644
index 0000000..dab1bb9
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-21-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-21-2.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-21-2.png
new file mode 100644
index 0000000..ed8c0d8
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-21-2.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-4-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-4-1.png
new file mode 100644
index 0000000..74c7adf
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-4-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-4-2.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-4-2.png
new file mode 100644
index 0000000..432674c
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-4-2.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-5-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-5-1.png
new file mode 100644
index 0000000..51f5070
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-5-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-5-2.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-5-2.png
new file mode 100644
index 0000000..2f5ed86
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-5-2.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-5-3.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-5-3.png
new file mode 100644
index 0000000..bd9b1e8
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-5-3.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-6-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-6-1.png
new file mode 100644
index 0000000..70ebb3b
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-6-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-6-2.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-6-2.png
new file mode 100644
index 0000000..d887513
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-6-2.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-7-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-7-1.png
new file mode 100644
index 0000000..4fa66ae
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-7-1.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-7-2.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-7-2.png
new file mode 100644
index 0000000..1f4d447
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-7-2.png differ
diff --git a/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-9-1.png b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-9-1.png
new file mode 100644
index 0000000..1f275b1
Binary files /dev/null and b/docs/notebooks/2024-03-16_yang_files/figure-html/unnamed-chunk-9-1.png differ
diff --git a/docs/search.json b/docs/search.json
index dc334cf..d76d616 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -32,7 +32,7 @@
"href": "index.html",
"title": "Will's Public NAO Notebook",
"section": "",
- "text": "Improving read deduplication in the MGS workflow\n\n\nRemoving reverse-complement duplicates of human-viral reads.\n\n\n\n\n\n\nMar 1, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 2\n\n\nPanel-enriched samples.\n\n\n\n\n\n\nFeb 29, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 1\n\n\nUnenriched samples.\n\n\n\n\n\n\nFeb 27, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 3\n\n\nFixing the virus pipeline.\n\n\n\n\n\n\nFeb 15, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 2\n\n\nAbundance and composition of human-infecting viruses.\n\n\n\n\n\n\nFeb 8, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 1\n\n\nPreprocessing and composition.\n\n\n\n\n\n\nFeb 4, 2024\n\n\n\n\n\n\n \n\n\n\n\nAutomating BLAST validation of human viral read assignment\n\n\nExperiments with BLASTN remote mode\n\n\n\n\n\n\nJan 30, 2024\n\n\n\n\n\n\n \n\n\n\n\nProject Runway RNA-seq testing data: removing livestock reads\n\n\n\n\n\n\n\n\n\nDec 22, 2023\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Project Runway RNA-seq testing data\n\n\nApplying a new workflow to some oldish data.\n\n\n\n\n\n\nDec 19, 2023\n\n\n\n\n\n\n \n\n\n\n\nEstimating the effect of read depth on duplication rate for Project Runway DNA data\n\n\nHow deep can we go?\n\n\n\n\n\n\nNov 8, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n \n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n \n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n \n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
+ "text": "Workflow analysis of Yang et al. (2020)\n\n\nWastewater from Xinjiang.\n\n\n\n\n\n\nMar 16, 2024\n\n\n\n\n\n\n \n\n\n\n\nImproving read deduplication in the MGS workflow\n\n\nRemoving reverse-complement duplicates of human-viral reads.\n\n\n\n\n\n\nMar 1, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 2\n\n\nPanel-enriched samples.\n\n\n\n\n\n\nFeb 29, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 1\n\n\nUnenriched samples.\n\n\n\n\n\n\nFeb 27, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 3\n\n\nFixing the virus pipeline.\n\n\n\n\n\n\nFeb 15, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 2\n\n\nAbundance and composition of human-infecting viruses.\n\n\n\n\n\n\nFeb 8, 2024\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 1\n\n\nPreprocessing and composition.\n\n\n\n\n\n\nFeb 4, 2024\n\n\n\n\n\n\n \n\n\n\n\nAutomating BLAST validation of human viral read assignment\n\n\nExperiments with BLASTN remote mode\n\n\n\n\n\n\nJan 30, 2024\n\n\n\n\n\n\n \n\n\n\n\nProject Runway RNA-seq testing data: removing livestock reads\n\n\n\n\n\n\n\n\n\nDec 22, 2023\n\n\n\n\n\n\n \n\n\n\n\nWorkflow analysis of Project Runway RNA-seq testing data\n\n\nApplying a new workflow to some oldish data.\n\n\n\n\n\n\nDec 19, 2023\n\n\n\n\n\n\n \n\n\n\n\nEstimating the effect of read depth on duplication rate for Project Runway DNA data\n\n\nHow deep can we go?\n\n\n\n\n\n\nNov 8, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n \n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n \n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n \n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
},
{
"objectID": "notebooks/2023-10-12_fastp-vs-adapterremoval.html",
@@ -250,5 +250,12 @@
"title": "Improving read deduplication in the MGS workflow",
"section": "",
"text": "After investigating several options in previous entries, I settled on Clumpify for deduplication of reads for my MGS pipeline. Unfortunately, Clumpify as I’m currently running it has a key flaw, which it shares with most other deduplication tools that run on paired reads: it’s unable to detect “reverse-complement” duplicates in which the forward read of one pair matches the reverse read of another & vice-versa. Since the orientation of reads is random in many cases, this will leave a random subset of duplicates undetected.\nThis is tolerable in many instances, but is a problem for cases where we care a lot about getting accurate read counts, such as when estimating the relative abundance of human-infecting viruses. As such, it would be great to find a solution that lets us remove these extra duplicates prior to estimating viral relative abundance.\nClumpify does have a configuration where it unpairs reads prior to deduplication, allowing them to be deduplicated as individual reads. This solves the problem above, but (a) can lead to over-removal in cases where only one read in a pair is a duplicate, and (b) typically breaks on large datasets, I think due to memory issues. Despite looking at quite a number of possible options, I was unable to find a deduplication tool that met my desiderata of (i) running on paired reads, (ii) identifying duplicates in a sensible, error-tolerant way, and (iii) handling reverse-complement duplicates.\nAs a result, I turned to an alternative approach, which was to restrict deduplication of the whole read set to RC-insensitive Clumpify, then apply additional, more stringent deduplication to putative human-viral reads. During the process of HV read identification, downstream of Bowtie2 but upstream of Kraken2, putative viral read pairs are merged & concatenated to produce a single sequence per read pair. This makes deduplication much easier as I can run deduplication tools on the result without needing them to specifically handle paired reads. In particular, Clumpify and its sister program Dedupe should both work well on these sequences, removing duplicates in either orientation while correctly handling errors and “containments” (partial duplicates in which one sequence is completely contained within another).\nTo investigate this, I downloaded the putative human-viral reads for each sample in Crits-Christoph 2021, after identification with Bowtie2 and screening against potential contaminants but before taxonomic assignment with Kraken. I ran three deduplication tools on these sequences:\n\nClumpify in single-end mode\n\nclumpify.sh in=reads/${s}_bowtie2_mjc.fastq.gz out=clumpify/${s}.fastq.gz dedupe containment\n\nDedupe in single-end mode\n\ndedupe.sh in=reads/${s}_bowtie2_mjc.fastq.gz out=dedupe/${s}.fastq.gz\n\nrmdup from the seqkit package\n\nseqkit rmdup -so seqkit/${s}.fastq.gz reads/${s}_bowtie2_mjc.fastq.gz\nI then quantified the number of surviving sequences in each case with FASTQC and MultiQC.\n\nCode# Paths to data\ndata_dir <- \"../data/2024-02-29_dedup/\"\nraw_path <- file.path(data_dir, \"multiqc/multiqc_raw_fastqc.txt\")\nclumpify_path <- file.path(data_dir, \"multiqc/multiqc_clumpify_fastqc.txt\")\ndedupe_path <- file.path(data_dir, \"multiqc/multiqc_dedupe_fastqc.txt\")\nseqkit_path <- file.path(data_dir, \"multiqc/multiqc_seqkit_fastqc.txt\")\n\n# Import data\nraw <- read_tsv(raw_path, show_col_types = FALSE) %>%\n mutate(Sample = sub(\"_bowtie2_mjc\", \"\", Sample),\n Method = \"none\")\nclumpify <- read_tsv(clumpify_path, show_col_types = FALSE) %>% mutate(Method = \"clumpify\")\ndedupe <- read_tsv(dedupe_path, show_col_types = FALSE) %>% mutate(Method = \"dedupe\")\nseqkit <- read_tsv(seqkit_path, show_col_types = FALSE) %>% mutate(Method = \"seqkit\")\nprocessed <- bind_rows(raw, clumpify, dedupe, seqkit) %>%\n mutate(Method = fct_inorder(Method),\n seqs_abs = `Total Sequences`) %>%\n group_by(Sample) %>%\n mutate(seqs_rel = seqs_abs/max(seqs_abs))\n\n# Visualize\ng_dedup_abs <- ggplot(processed, aes(x=Sample, y=seqs_abs, fill=Method)) +\n geom_col(position = \"dodge\") +\n scale_fill_brewer(palette = \"Dark2\") +\n scale_y_continuous(name=\"# Surviving Reads\", expand = c(0,0), limits = c(0,140000), breaks = seq(0,200000,40000)) +\n theme_kit\ng_dedup_abs\n\n\n\nCodeg_dedup_rel <- ggplot(processed, aes(x=Sample, y=seqs_rel, fill=Method)) +\n geom_col(position = \"dodge\") +\n scale_fill_brewer(palette = \"Dark2\") +\n scale_y_continuous(name=\"% Surviving Reads\", breaks = seq(0,1,0.2), labels = function(x) x*100, expand = c(0,0)) +\n theme_kit\ng_dedup_rel\n\n\n\n\nClumpify consistently removed the most reads, with Dedupe close behind and seqkit’s rmdup performing much less well. As such, I decided to implement Clumpify to remove duplicates at this point in the pipeline, then look at the effect on predicted relative abundance of viruses in one previously-analyzed dataset.\nBrief re-analysis of Crits-Christoph 2021\nOn average, adding in RC-sensitive deduplication reduced overall HV relative abundance measurements by about 4% in unenriched samples and about 10% in panel-enriched samples in Crits-Christoph 2021:\n\nCodecc_dir_new <- file.path(data_dir, \"cc-rerun\")\ncc_dir_old <- file.path(data_dir, \"cc-prev\")\nbasic_stats_new_path <- file.path(cc_dir_new, \"qc_basic_stats.tsv\")\nbasic_stats_old_path <- file.path(cc_dir_old, \"qc_basic_stats.tsv\")\nhv_reads_new_path <- file.path(cc_dir_new, \"hv_hits_putative_filtered.tsv.gz\")\nhv_reads_old_path <- file.path(cc_dir_old, \"hv_hits_putative_filtered.tsv\")\nlibraries_path <- file.path(cc_dir_new, \"cc-libraries.txt\")\n\n# Get raw read counts\nbasic_stats_new <- read_tsv(basic_stats_new_path, show_col_types = FALSE)\nbasic_stats_old <- read_tsv(basic_stats_old_path, show_col_types = FALSE)\nread_counts_raw_new <- basic_stats_new %>% filter(stage == \"raw_concat\") %>% \n select(sample, n_reads_raw = n_read_pairs) %>%\n mutate(dedup_rc = TRUE)\nread_counts_raw_old <- basic_stats_old %>% filter(stage == \"raw_concat\") %>% \n select(sample, n_reads_raw = n_read_pairs) %>%\n mutate(dedup_rc = FALSE)\nread_counts_raw <- bind_rows(read_counts_raw_new, read_counts_raw_old)\n\n# Get HV read counts\nlibraries <- read_tsv(libraries_path, show_col_types = FALSE) %>%\n mutate(enrichment = str_to_title(enrichment))\nhv_reads_new <- read_tsv(hv_reads_new_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>% \n arrange(enrichment, location, collection_date) %>%\n mutate(sample = fct_inorder(sample),\n adj_score_max = pmax(adj_score_fwd, adj_score_rev),\n dedup_rc = TRUE)\nhv_reads_old <- read_tsv(hv_reads_old_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>% \n arrange(enrichment, location, collection_date) %>%\n mutate(sample = fct_inorder(sample),\n adj_score_max = pmax(adj_score_fwd, adj_score_rev),\n dedup_rc = FALSE)\nhv_reads <- bind_rows(hv_reads_new, hv_reads_old)\nhv_reads_cut <- hv_reads %>%\n mutate(hv_status = assigned_hv | hit_hv | adj_score_max >= 20)\nhv_reads_counts <- hv_reads_cut %>% group_by(sample, enrichment, dedup_rc) %>% count(name=\"n_reads_hv\")\n\n# Merge\nhv_reads_ra <- inner_join(hv_reads_counts, read_counts_raw, by=c(\"sample\", \"dedup_rc\")) %>%\n mutate(p_reads_hv = n_reads_hv/n_reads_raw)\nhv_reads_total <- hv_reads_ra %>% group_by(enrichment, dedup_rc) %>%\n summarize(n_reads_hv = sum(n_reads_hv), n_reads_raw = sum(n_reads_raw), .groups=\"drop\") %>%\n mutate(sample = paste(\"All\", str_to_lower(enrichment)), p_reads_hv = n_reads_hv/n_reads_raw)\nhv_reads_bound <- bind_rows(hv_reads_ra, hv_reads_total) %>% arrange(enrichment, dedup_rc)\nhv_reads_bound$sample <- fct_inorder(hv_reads_bound$sample)\n\n# Calculate effect of dedup on RA\nra_rel <- hv_reads_bound %>% ungroup %>% select(sample, enrichment, dedup_rc, p_reads_hv) %>%\n pivot_wider(names_from=\"dedup_rc\", values_from = \"p_reads_hv\", names_prefix = \"dedup_\") %>%\n mutate(rel_ra = dedup_TRUE/dedup_FALSE)\n\n\n\nCodeg_phv <- ggplot(hv_reads_bound, aes(x=sample, y=p_reads_hv, \n color=enrichment, shape=dedup_rc)) +\n geom_point() +\n scale_y_log10(\"Relative abundance of human virus reads\") +\n scale_color_brewer(palette=\"Dark2\", name=\"Panel enrichment\") +\n scale_shape_discrete(name=\"RC-sensitive deduplication\") +\n guides(color=\"none\") +\n facet_wrap(~enrichment, scales = \"free\") + \n theme_base + theme(axis.text.x = element_text(angle=45, hjust=1))\ng_phv\n\n\n\n\nTurning to specific viruses, just plotting out the relative abundances with and without deduplication isn’t terribly informative due to the wide log scales involved:\n\nCodeviral_taxa_path <- file.path(data_dir, \"viral-taxa.tsv.gz\")\nviral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)\n\n# Get viral taxon names for putative HV reads\nhv_named <- hv_reads_cut %>% left_join(viral_taxa, by=\"taxid\")\n\n# Discover viral species & genera for HV reads\nraise_rank <- function(read_db, taxid_db, out_rank = \"species\", verbose = FALSE){\n # Get higher ranks than search rank\n ranks <- c(\"subspecies\", \"species\", \"subgenus\", \"genus\", \"subfamily\", \"family\", \"suborder\", \"order\", \"class\", \"subphylum\", \"phylum\", \"kingdom\", \"superkingdom\")\n rank_match <- which.max(ranks == out_rank)\n high_ranks <- ranks[rank_match:length(ranks)]\n # Merge read DB and taxid DB\n reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n # Extract sequences that are already at appropriate rank\n reads_rank <- filter(reads, rank == out_rank)\n # Drop sequences at a higher rank and return unclassified sequences\n reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...\n # Promote read taxids and re-merge with taxid DB, then re-classify and filter\n reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%\n select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%\n bind_rows(reads_rank)\n reads_norank <- reads_remaining %>%\n filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n }\n # Finally, extract and append reads that were excluded during the process\n reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)\n reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%\n select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n return(reads_out)\n}\n\nhv_reads_genera <- raise_rank(hv_named, viral_taxa, \"genus\")\n\n# Count relative abundance for genera\nhv_genera_counts_raw <- hv_reads_genera %>% \n group_by(sample, enrichment, name, dedup_rc) %>%\n count(name=\"n_reads_hv\") %>%\n inner_join(read_counts_raw, by=c(\"sample\", \"dedup_rc\")) %>%\n mutate(p_reads_hv = n_reads_hv/n_reads_raw)\nhv_genera_counts_all <- hv_genera_counts_raw %>% group_by(name, enrichment, dedup_rc) %>%\n summarize(n_reads_hv = sum(n_reads_hv),\n n_reads_raw = sum(n_reads_raw), .groups = \"drop\") %>%\n mutate(sample = \"All samples\")\nhv_genera_counts_agg <- bind_rows(hv_genera_counts_raw, hv_genera_counts_all) %>%\n mutate(p_reads_hv = n_reads_hv/n_reads_raw)\n\n\n\nCodehv_genera_counts_plot <- hv_genera_counts_agg %>%\n filter(sample == \"All samples\", !is.na(name))\ng_genera <- ggplot(hv_genera_counts_plot,\n aes(x=name, y=p_reads_hv, shape=dedup_rc, color=enrichment)) +\n geom_point() +\n scale_y_log10(\"Relative abundance of human virus reads\") +\n scale_color_brewer(palette=\"Dark2\", name=\"Panel enrichment\") +\n scale_shape_discrete(name=\"RC-sensitive deduplication\") +\n theme_kit + theme(plot.margin = margin(l=1, t=0.5, unit=\"cm\"))\ng_genera\n\n\n\n\n\nCodehv_genera_counts_wide <- hv_genera_counts_agg %>%\n select(sample, enrichment, name, dedup_rc, p_reads_hv) %>%\n pivot_wider(names_from=\"enrichment\", values_from=\"p_reads_hv\") %>%\n mutate(Enriched = replace_na(Enriched, 0),\n Unenriched = replace_na(Unenriched, 0),\n rel_enrichment = Enriched/Unenriched,\n log_enrichment = log10(rel_enrichment))\nhv_genera_counts_rel_plot <- hv_genera_counts_wide %>%\n filter(sample == \"All samples\", log_enrichment < \"Inf\", log_enrichment > \"-Inf\")\ng_rel_ra <- ggplot(hv_genera_counts_rel_plot,\n aes(x=name, y=log_enrichment, shape=dedup_rc)) +\n geom_point() +\n scale_y_continuous(name=\"Log10 enrichment in panel-enriched samples\",\n limits = c(0,5), breaks = seq(0,10,1), expand=c(0,0)) +\n scale_shape_discrete(name=\"RC-sensitive deduplication\") +\n theme_kit + theme(plot.margin = margin(l=1, t=0.5, unit=\"cm\"))\ng_rel_ra\n\n\n\n\nHowever, if we specifically look at the change in relative abundance with vs without deduplication, we do see some variation, with many viruses showing no change and others showing reductions in measured RA of up to 20%:\n\nCodehv_genera_counts_rel_dedup <- hv_genera_counts_agg %>%\n filter(sample == \"All samples\", !is.na(name)) %>%\n select(sample, enrichment, name, dedup_rc, p_reads_hv) %>%\n pivot_wider(names_from=\"dedup_rc\", names_prefix=\"dedup_\", \n values_from=\"p_reads_hv\", values_fill=0) %>%\n mutate(dedup_rel = dedup_TRUE/dedup_FALSE,\n dedup_rel_log = log10(dedup_rel))\ng_genera_rel_dedup <- ggplot(hv_genera_counts_rel_dedup,\n aes(x=name, y=dedup_rel, color=enrichment)) +\n geom_point(alpha=0.5) +\n scale_y_continuous(\"Viral relative abundance in deduplicated vs\\nnon-deduplicated samples\") +\n scale_color_brewer(palette=\"Dark2\", name=\"Panel enrichment\") +\n theme_kit + theme(plot.margin = margin(l=1, t=0.5, unit=\"cm\"))\ng_genera_rel_dedup\n\n\n\n\nGoing forward, I’ll include this additional deduplication step in analysis of future data."
+ },
+ {
+ "objectID": "notebooks/2024-03-16_yang.html",
+ "href": "notebooks/2024-03-16_yang.html",
+ "title": "Workflow analysis of Yang et al. (2020)",
+ "section": "",
+ "text": "As we work on expanding and updating the P2RA preprint for publication, I’m working on running the relevant datasets through my pipeline to compare the results to those from the original Kraken-based approach. I began by processing Yang et al. (2020), a study that carried out RNA viral metagenomics on raw sewage samples from Xinjiang, China.\nThe study collected 7 1L grab samples of raw sewage from the inlets of three WWTPs in Xinjiang, and processed them using an “anion filter membrane absorption” method quite distinct from the other studies I’ve looked at so far:\n\nInitially, samples were centrifuged to pellet cells and debris.\nThe supernatant was treated with magnesium chloride and HCl, then filtered through a mixed cellulose ester membrane filter, discarding the filtrate.\nMaterial was then eluted from the filters by ultrasonication with 3% beef extract, before going a second filtering step using a 0.22um filter, this time retaining the filtrate.\nNext, the samples underwent RNA extraction using the QIAamp viral RNA Mini Kit.\nFinally, sequencing libraries were generated using KAPA Hyper Prep Kit. No mention is made of rRNA depletion or panel enrichment.\n\nThis study is of particular interest because previous analysis found that the MGS data it generated contains a particularly high fraction of both total viruses and human-infecting viruses. It would be good to know whether this is attributable to the fact that the authors used an unusual method, or if it’s the result of genuine differences on the ground (e.g. an elevated prevalence of enteric disease compared to the USA).\nThe raw data\nThe Yang data is unusual in that it contains only a small number of samples (7 samples from 3 different locations) but each of these samples is sequenced quite deeply: the number of reads per sample varied from 162M to 283M, with an average of 203M. Taken together, the samples totaled roughly 1.4B read pairs (425 gigabases of sequence).\nRead qualities were consistently very high, but adapter levels were also elevated. The level of sequence duplication levels was very high according to FASTQC, with a minimum of 89% across all samples, suggesting a high degree of oversequencing; perhaps unsurprising given the sheer depth of these samples.\n\nCode# Data input paths\ndata_dir <- \"../data/2024-03-15_yang/\"\nlibraries_path <- file.path(data_dir, \"sample-metadata.csv\")\nbasic_stats_path <- file.path(data_dir, \"qc_basic_stats.tsv\")\nadapter_stats_path <- file.path(data_dir, \"qc_adapter_stats.tsv\")\nquality_base_stats_path <- file.path(data_dir, \"qc_quality_base_stats.tsv\")\nquality_seq_stats_path <- file.path(data_dir, \"qc_quality_sequence_stats.tsv\")\n\n# Import libraries and extract metadata from sample names\nlibraries <- read_csv(libraries_path, show_col_types = FALSE)\n\n# Import QC data\nstages <- c(\"raw_concat\", \"cleaned\", \"dedup\", \"ribo_initial\", \"ribo_secondary\")\nbasic_stats <- read_tsv(basic_stats_path, show_col_types = FALSE) %>%\n # mutate(sample = sub(\"_2$\", \"\", sample)) %>%\n inner_join(libraries, by=\"sample\") %>% \n mutate(stage = factor(stage, levels = stages),\n sample = fct_inorder(sample))\nadapter_stats <- read_tsv(adapter_stats_path, show_col_types = FALSE) %>%\n mutate(sample = sub(\"_2$\", \"\", sample)) %>%\n inner_join(libraries, by=\"sample\") %>%\n mutate(stage = factor(stage, levels = stages),\n read_pair = fct_inorder(as.character(read_pair)))\nquality_base_stats <- read_tsv(quality_base_stats_path, show_col_types = FALSE) %>%\n # mutate(sample = sub(\"_2$\", \"\", sample)) %>%\n inner_join(libraries, by=\"sample\") %>%\n mutate(stage = factor(stage, levels = stages),\n read_pair = fct_inorder(as.character(read_pair)))\nquality_seq_stats <- read_tsv(quality_seq_stats_path, show_col_types = FALSE) %>%\n # mutate(sample = sub(\"_2$\", \"\", sample)) %>%\n inner_join(libraries, by=\"sample\") %>%\n mutate(stage = factor(stage, levels = stages),\n read_pair = fct_inorder(as.character(read_pair)))\n\n# Filter to raw data\nbasic_stats_raw <- basic_stats %>% filter(stage == \"raw_concat\")\nadapter_stats_raw <- adapter_stats %>% filter(stage == \"raw_concat\")\nquality_base_stats_raw <- quality_base_stats %>% filter(stage == \"raw_concat\")\nquality_seq_stats_raw <- quality_seq_stats %>% filter(stage == \"raw_concat\")\n\n# Visualize basic stats\ng_nreads_raw <- ggplot(basic_stats_raw, aes(x=sample, y=n_read_pairs, fill=location)) + geom_col(position=\"dodge\") + scale_y_continuous(name=\"# Read pairs\", expand=c(0,0)) + scale_fill_brewer(palette=\"Set1\", name=\"Location\") + theme_kit\nlegend_location <- get_legend(g_nreads_raw)\ng_nreads_raw_2 <- g_nreads_raw + theme(legend.position = \"none\")\ng_nbases_raw <- ggplot(basic_stats_raw, aes(x=sample, y=n_bases_approx, fill=location)) + geom_col(position=\"dodge\") + scale_y_continuous(name=\"Total base pairs (approx)\", expand=c(0,0)) + scale_fill_brewer(palette=\"Set1\", name=\"Location\") + theme_kit + theme(legend.position = \"none\")\ng_ndup_raw <- ggplot(basic_stats_raw, aes(x=sample, y=percent_duplicates, fill=location)) + geom_col(position=\"dodge\") + scale_y_continuous(name=\"% Duplicates (FASTQC)\", expand=c(0,0), limits = c(0,100), breaks = seq(0,100,20)) + scale_fill_brewer(palette=\"Set1\", name=\"Location\") + theme_kit + theme(legend.position = \"none\")\ng_basic_raw <- plot_grid(g_nreads_raw_2 + g_nbases_raw + g_ndup_raw, legend_location, ncol = 1, rel_heights = c(1,0.1))\ng_basic_raw\n\n\n\nCode# Visualize adapters\ng_adapters_raw <- ggplot(adapter_stats_raw, aes(x=position, y=pc_adapters, color=location, linetype = read_pair, group=interaction(sample, read_pair))) + geom_line() +\n scale_color_brewer(palette = \"Set1\", name = \"Location\") +\n scale_linetype_discrete(name = \"Read Pair\") +\n scale_y_continuous(name=\"% Adapters\", limits=c(0,50),\n breaks = seq(0,50,10), expand=c(0,0)) +\n scale_x_continuous(name=\"Position\", limits=c(0,140),\n breaks=seq(0,140,20), expand=c(0,0)) +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n facet_wrap(~adapter) + theme_base\ng_adapters_raw\n\n\n\nCode# Visualize quality\ng_quality_base_raw <- ggplot(quality_base_stats_raw, aes(x=position, y=mean_phred_score, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +\n geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line() +\n scale_color_brewer(palette = \"Set1\", name = \"Location\") +\n scale_linetype_discrete(name = \"Read Pair\") +\n scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n scale_x_continuous(name=\"Position\", limits=c(0,140),\n breaks=seq(0,140,20), expand=c(0,0)) +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n theme_base\ng_quality_seq_raw <- ggplot(quality_seq_stats_raw, aes(x=mean_phred_score, y=n_sequences, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +\n geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line() +\n scale_color_brewer(palette = \"Set1\", name = \"Location\") +\n scale_linetype_discrete(name = \"Read Pair\") +\n scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n scale_y_continuous(name=\"# Sequences\", expand=c(0,0)) +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n theme_base\ng_quality_base_raw\n\n\n\nCodeg_quality_seq_raw\n\n\n\n\nPreprocessing\n\nCoden_reads_rel <- basic_stats %>% \n select(sample, location, stage, percent_duplicates, n_read_pairs) %>%\n group_by(sample, location) %>% arrange(sample, location, stage) %>%\n mutate(p_reads_retained = n_read_pairs / lag(n_read_pairs),\n p_reads_lost = 1 - p_reads_retained,\n p_reads_retained_abs = n_read_pairs / n_read_pairs[1],\n p_reads_lost_abs = 1-p_reads_retained_abs,\n p_reads_lost_abs_marginal = p_reads_lost_abs - lag(p_reads_lost_abs))\n\n\nOn average, cleaning & deduplication together removed about 86% of total reads from each sample, with the largest losses seen during deduplication. Only a few reads (low single-digit percentages at most) were lost during subsequent ribodepletion – a surprising finding given the apparent lack of rRNA depletion in the study methods. This makes me suspect that the methods did include rRNA depletion without mentioning it; if this isn’t the case, it suggests that the methods used by the authors might be particularly effective at removing bacteria from the sample.\n\nCode# Plot reads over preprocessing\ng_reads_stages <- ggplot(basic_stats, aes(x=stage, y=n_read_pairs,fill=location,group=sample)) +\n geom_col(position=\"dodge\") + scale_fill_brewer(palette=\"Set1\", name=\"Location\") +\n scale_y_continuous(\"# Read pairs\", expand=c(0,0)) +\n theme_kit\ng_reads_stages\n\n\n\nCode# Plot relative read losses during preprocessing\ng_reads_rel <- ggplot(n_reads_rel, aes(x=stage, y=p_reads_lost_abs_marginal,fill=location,group=sample)) +\n geom_col(position=\"dodge\") + scale_fill_brewer(palette=\"Set1\", name=\"Location\") +\n scale_y_continuous(\"% Total Reads Lost\", expand=c(0,0), labels = function(x) x*100) +\n theme_kit\ng_reads_rel\n\n\n\n\nData cleaning with FASTP was very successful at removing adapters, with very few adapter sequences found by FASTQC at any stage after the raw data. Improvements in quality were unsurprisingly minimal, given the high initial quality observed.\n\nCode# Visualize adapters\ng_adapters <- ggplot(adapter_stats, aes(x=position, y=pc_adapters, color=location, linetype = read_pair, group=interaction(sample, read_pair))) + geom_line() +\n scale_color_brewer(palette = \"Set1\", name = \"Location\") +\n scale_linetype_discrete(name = \"Read Pair\") +\n scale_y_continuous(name=\"% Adapters\", limits=c(0,50),\n breaks = seq(0,50,10), expand=c(0,0)) +\n scale_x_continuous(name=\"Position\", limits=c(0,140),\n breaks=seq(0,140,20), expand=c(0,0)) +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n facet_grid(stage~adapter) + theme_base\ng_adapters\n\n\n\nCode# Visualize quality\ng_quality_base <- ggplot(quality_base_stats, aes(x=position, y=mean_phred_score, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +\n geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line() +\n scale_color_brewer(palette = \"Set1\", name = \"Location\") +\n scale_linetype_discrete(name = \"Read Pair\") +\n scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n scale_x_continuous(name=\"Position\", limits=c(0,140),\n breaks=seq(0,140,20), expand=c(0,0)) +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n facet_grid(stage~.) + theme_base\ng_quality_seq <- ggplot(quality_seq_stats, aes(x=mean_phred_score, y=n_sequences, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +\n geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line() +\n scale_color_brewer(palette = \"Set1\", name = \"Location\") +\n scale_linetype_discrete(name = \"Read Pair\") +\n scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n scale_y_continuous(name=\"# Sequences\", expand=c(0,0)) +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n facet_grid(stage~., scales=\"free_y\") + theme_base\ng_quality_base\n\n\n\nCodeg_quality_seq\n\n\n\n\nAccording to FASTQC, deduplication was only moderately effective at reducing measured duplicate levels, despite the high read losses observed. After deduplication, FASTQC-measured duplicate levels fell from an average of 92% to one of 58%:\n\nCodeg_dup_stages <- ggplot(basic_stats, aes(x=stage, y=percent_duplicates, fill=location, group=sample)) +\n geom_col(position=\"dodge\") + scale_fill_brewer(palette = \"Set1\", name=\"Location\") +\n scale_y_continuous(\"% Duplicates\", limits=c(0,100), breaks=seq(0,100,20), expand=c(0,0)) +\n theme_kit\ng_readlen_stages <- ggplot(basic_stats, aes(x=stage, y=mean_seq_len, fill=location, group=sample)) +\n geom_col(position=\"dodge\") + scale_fill_brewer(palette = \"Set1\", name=\"Location\") +\n scale_y_continuous(\"Mean read length (nt)\", expand=c(0,0)) +\n theme_kit\nlegend_loc <- get_legend(g_dup_stages)\ng_dup_stages\n\n\n\nCodeg_readlen_stages\n\n\n\n\nHigh-level composition\nAs before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:\n\nCode# Import Bracken data\nbracken_path <- file.path(data_dir, \"bracken_counts.tsv\")\nbracken <- read_tsv(bracken_path, show_col_types = FALSE)\ntotal_assigned <- bracken %>% group_by(sample) %>% summarize(\n name = \"Total\",\n kraken_assigned_reads = sum(kraken_assigned_reads),\n added_reads = sum(added_reads),\n new_est_reads = sum(new_est_reads),\n fraction_total_reads = sum(fraction_total_reads)\n)\nbracken_spread <- bracken %>% select(name, sample, new_est_reads) %>%\n mutate(name = tolower(name)) %>%\n pivot_wider(id_cols = \"sample\", names_from = \"name\", values_from = \"new_est_reads\")\n# Count reads\nread_counts_preproc <- basic_stats %>% \n select(sample, location, stage, n_read_pairs) %>%\n pivot_wider(id_cols = c(\"sample\", \"location\"), names_from=\"stage\", values_from=\"n_read_pairs\")\nread_counts <- read_counts_preproc %>%\n inner_join(total_assigned %>% select(sample, new_est_reads), by = \"sample\") %>%\n rename(assigned = new_est_reads) %>%\n inner_join(bracken_spread, by=\"sample\")\n# Assess composition\nread_comp <- transmute(read_counts, sample=sample, location=location,\n n_filtered = raw_concat-cleaned,\n n_duplicate = cleaned-dedup,\n n_ribosomal = (dedup-ribo_initial) + (ribo_initial-ribo_secondary),\n n_unassigned = ribo_secondary-assigned,\n n_bacterial = bacteria,\n n_archaeal = archaea,\n n_viral = viruses,\n n_human = eukaryota)\nread_comp_long <- pivot_longer(read_comp, -(sample:location), names_to = \"classification\",\n names_prefix = \"n_\", values_to = \"n_reads\") %>%\n mutate(classification = fct_inorder(str_to_sentence(classification))) %>%\n group_by(sample) %>% mutate(p_reads = n_reads/sum(n_reads))\nread_comp_summ <- read_comp_long %>% \n group_by(location, classification) %>%\n summarize(n_reads = sum(n_reads), .groups = \"drop_last\") %>%\n mutate(n_reads = replace_na(n_reads,0),\n p_reads = n_reads/sum(n_reads),\n pc_reads = p_reads*100)\n \n# Plot overall composition\ng_comp <- ggplot(read_comp_long, aes(x=sample, y=p_reads, fill=classification)) +\n geom_col(position = \"stack\") +\n scale_y_continuous(name = \"% Reads\", limits = c(0,1.01), breaks = seq(0,1,0.2),\n expand = c(0,0), labels = function(x) x*100) +\n scale_fill_brewer(palette = \"Set1\", name = \"Classification\") +\n facet_wrap(.~location, scales=\"free\") +\n theme_kit\ng_comp\n\n\n\nCode# Plot composition of minor components\nread_comp_minor <- read_comp_long %>% filter(classification %in% c(\"Archaeal\", \"Viral\", \"Human\", \"Other\"))\npalette_minor <- brewer.pal(9, \"Set1\")[6:9]\ng_comp_minor <- ggplot(read_comp_minor, aes(x=sample, y=p_reads, fill=classification)) +\n geom_col(position = \"stack\") +\n scale_y_continuous(name = \"% Reads\", breaks = seq(0,0.1,0.01),\n expand = c(0,0), labels = function(x) x*100) +\n scale_fill_manual(values=palette_minor, name = \"Classification\") +\n facet_wrap(.~location, scales = \"free_x\") +\n theme_kit\ng_comp_minor\n\n\n\n\n\nCodep_reads_summ <- read_comp_long %>%\n mutate(classification = ifelse(classification %in% c(\"Filtered\", \"Duplicate\", \"Unassigned\"), \"Excluded\", as.character(classification))) %>%\n group_by(classification, sample, location) %>%\n summarize(p_reads = sum(p_reads), .groups = \"drop\") %>%\n group_by(classification) %>%\n summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, pc_mean = mean(p_reads)*100)\n\n\nOverall, in these unenriched samples, between 89% and 95% of reads (mean 93%) are low-quality, duplicates, or unassigned. Of the remainder, 0.4-3.9% (mean 1.6%) were identified as ribosomal, and 0.06-0.22% (mean 0.13%) were otherwise classified as bacterial. The fraction of human reads ranged from 0.006-0.198% (mean 0.049%). Strikingly, those of viral reads ranged from 0.6% all the way up to 10.2% (mean 5.5%).\nThe total fraction of viral reads is strikingly high, even higher on average than for Rothman unenriched samples. As in Rothman, this is primarily accounted for by very high levels of tobamoviruses (viral family Virgaviridae), which make up 74-98% (mean 84%) of viral reads. Most of the remainder is accounted for by Tombusviridae and Solemoviridae, two other families of plant viruses. Only one family of vertebrate viruses, Astroviridae, accounted for more than 1% of viral reads in any sample.\n\nCodeviral_taxa_path <- file.path(data_dir, \"viral-taxids.tsv.gz\")\nviral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)\nsamples <- as.character(basic_stats_raw$sample)\ncol_names <- c(\"pc_reads_total\", \"n_reads_clade\", \"n_reads_direct\", \"rank\", \"taxid\", \"name\")\nreport_paths <- paste0(data_dir, \"kraken/\", samples, \".report.gz\")\nkraken_reports <- lapply(1:length(samples), \n function(n) read_tsv(report_paths[n], col_names = col_names,\n show_col_types = FALSE) %>%\n mutate(sample = samples[n])) %>% bind_rows\nkraken_reports_viral <- filter(kraken_reports, taxid %in% viral_taxa$taxid) %>%\n group_by(sample) %>%\n mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])\nviral_families <- kraken_reports_viral %>% filter(rank == \"F\")\nviral_families_major_list <- viral_families %>% group_by(name, taxid) %>%\n summarize(p_reads_viral_max = max(p_reads_viral), .groups=\"drop\") %>%\n filter(p_reads_viral_max >= 0.01) %>% pull(name)\nviral_families_major <- viral_families %>% filter(name %in% viral_families_major_list) %>%\n select(name, taxid, sample, p_reads_viral)\nviral_families_minor <- viral_families_major %>% group_by(sample) %>%\n summarize(p_reads_viral_major = sum(p_reads_viral)) %>%\n mutate(name = \"Other\", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%\n select(name, taxid, sample, p_reads_viral)\nviral_families_display <- bind_rows(viral_families_major, viral_families_minor) %>%\n arrange(desc(p_reads_viral)) %>% mutate(name = factor(name, levels=c(viral_families_major_list, \"Other\")))\ng_families <- ggplot(viral_families_display, aes(x=sample, y=p_reads_viral, fill=name)) +\n geom_col(position=\"stack\") +\n scale_fill_brewer(palette = \"Dark2\", name = \"Viral family\") +\n scale_y_continuous(name=\"% Viral Reads\", limits=c(0,1), breaks = seq(0,1,0.2),\n expand=c(0,0), labels = function(y) y*100) +\n theme_kit\ng_families\n\n\n\n\nAlmost as striking as the high fraction of viral reads in these data is the extremely low prevalence of bacterial reads. For comparison, non-ribosomal bacterial reads in unenriched Crits-Christoph samples ranged from 16-20%, and in unenriched Rothman samples it ranged from 2-12%. Yang et al. are thus somehow achieving a greater than 10-fold reduction in the fraction of bacterial reads compared to other studies. Unlike the increased prevalence of viruses, this seems harder to explain away via differences in conditions on the ground. Together with the elevated viral fraction, these results suggest to me that Yang’s ideosyncratic methods are worth investigating further.\nHuman-infecting virus reads: validation\nNext, I investigated the human-infecting virus read content of these unenriched samples, using a pipeline identical to that described in my entry on unenriched Rothman samples. This process identified a total of 532,583 read pairs across all samples (0.3% of surviving reads):\n\nCodehv_reads_filtered_path <- file.path(data_dir, \"hv_hits_putative_filtered.tsv.gz\")\nhv_reads_filtered <- read_tsv(hv_reads_filtered_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>% \n arrange(location) %>%\n mutate(sample = fct_inorder(sample))\nn_hv_filtered <- hv_reads_filtered %>% group_by(sample, location) %>% count %>% inner_join(basic_stats %>% filter(stage == \"ribo_initial\") %>% select(sample, n_read_pairs), by=\"sample\") %>% rename(n_putative = n, n_total = n_read_pairs) %>% mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)\n\n\n\nCodemrg <- hv_reads_filtered %>%\n mutate(kraken_label = ifelse(assigned_hv, \"Kraken2 HV\\nassignment\",\n ifelse(hit_hv, \"Kraken2 HV\\nhit\",\n \"No hit or\\nassignment\"))) %>%\n group_by(sample) %>% arrange(desc(adj_score_fwd), desc(adj_score_rev)) %>%\n mutate(seq_num = row_number(),\n adj_score_max = pmax(adj_score_fwd, adj_score_rev))\n# Import Bowtie2/Kraken data and perform filtering steps\ng_mrg <- ggplot(mrg, aes(x=adj_score_fwd, y=adj_score_rev)) +\n geom_point(alpha=0.5, shape=16) + \n scale_x_continuous(name=\"S(forward read)\", limits=c(0,40), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_y_continuous(name=\"S(reverse read)\", limits=c(0,40), breaks=seq(0,100,10), expand = c(0,0)) +\n facet_wrap(~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +\n theme_base + theme(aspect.ratio=1)\ng_mrg\n\n\n\nCodeg_hist_0 <- ggplot(mrg, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position=\"dodge\") + facet_wrap(~kraken_label) + scale_x_continuous(name = \"Maximum adjusted alignment score\") + scale_y_continuous(name=\"# Read pairs\") + scale_fill_brewer(palette = \"Dark2\") + theme_base + geom_vline(xintercept=20, linetype=\"dashed\", color=\"red\")\ng_hist_0\n\n\n\n\nThis is a far larger number of putative viral reads than I previously needed to analyze, and is far too many to realistically process with BLASTN. To address this problem, I selected 1% of putative viral sequences (5,326) and ran them through BLASTN in a similar manner to past datasets:\n\nCodeset.seed(83745)\nmrg_sample <- sample_frac(mrg, 0.01)\nmrg_num <- mrg_sample %>% group_by(sample) %>%\n arrange(sample, desc(adj_score_max)) %>%\n mutate(seq_num = row_number(), seq_head = paste0(\">\", sample, \"_\", seq_num))\nmrg_fasta <- mrg_num %>% \n ungroup %>%\n select(header1=seq_head, seq1=query_seq_fwd, header2=seq_head, seq2=query_seq_rev) %>%\n mutate(header1=paste0(header1, \"_1\"), header2=paste0(header2, \"_2\"))\nmrg_fasta_out <- do.call(paste, c(mrg_fasta, sep=\"\\n\")) %>% paste(collapse=\"\\n\")\nwrite(mrg_fasta_out, file.path(data_dir, \"yang-putative-viral.fasta\"))\n\n\n\nCode# Import BLAST results\nblast_results_path <- file.path(data_dir, \"yang-putative-viral-all.blast.gz\")\nblast_cols <- c(\"qseqid\", \"sseqid\", \"sgi\", \"staxid\", \"qlen\", \"evalue\", \"bitscore\", \"qcovs\", \"length\", \"pident\", \"mismatch\", \"gapopen\", \"sstrand\", \"qstart\", \"qend\", \"sstart\", \"send\")\nblast_results <- read_tsv(blast_results_path, show_col_types = FALSE, \n col_names = blast_cols, col_types = cols(.default=\"c\"))\n\n# Add viral status\nblast_results_viral <- mutate(blast_results, viral = staxid %in% viral_taxa$taxid)\n\n# Filter for best hit for each query/subject, then rank hits for each query\nblast_results_best <- blast_results_viral %>% group_by(qseqid, staxid) %>% \n filter(bitscore == max(bitscore)) %>%\n filter(length == max(length)) %>% filter(row_number() == 1)\nblast_results_ranked <- blast_results_best %>% \n group_by(qseqid) %>% mutate(rank = dense_rank(desc(bitscore)))\nblast_results_highrank <- blast_results_ranked %>% filter(rank <= 5) %>%\n mutate(read_pair = str_split(qseqid, \"_\") %>% sapply(nth, n=-1), \n seq_num = str_split(qseqid, \"_\") %>% sapply(nth, n=-2), \n sample = str_split(qseqid, \"_\") %>% lapply(head, n=-2) %>% \n sapply(paste, collapse=\"_\")) %>%\n mutate(bitscore = as.numeric(bitscore), seq_num = as.numeric(seq_num))\n\n# Summarize by read pair and taxid\nblast_results_paired <- blast_results_highrank %>%\n group_by(sample, seq_num, staxid, viral) %>%\n summarize(bitscore_max = max(bitscore), bitscore_min = min(bitscore),\n n_reads = n(), .groups = \"drop\") %>%\n mutate(viral_full = viral & n_reads == 2)\n\n# Compare to Kraken & Bowtie assignments\nmrg_assign <- mrg_num %>% select(sample, seq_num, taxid, assigned_taxid)\nblast_results_assign <- left_join(blast_results_paired, mrg_assign, by=c(\"sample\", \"seq_num\")) %>%\n mutate(taxid_match_bowtie = (staxid == taxid),\n taxid_match_kraken = (staxid == assigned_taxid),\n taxid_match_any = taxid_match_bowtie | taxid_match_kraken)\nblast_results_out <- blast_results_assign %>%\n group_by(sample, seq_num) %>%\n summarize(viral_status = ifelse(any(viral_full), 2,\n ifelse(any(taxid_match_any), 2,\n ifelse(any(viral), 1, 0))),\n .groups = \"drop\")\n\n\n\nCode# Merge BLAST results with unenriched read data\nmrg_blast <- full_join(mrg_num, blast_results_out, by=c(\"sample\", \"seq_num\")) %>%\n mutate(viral_status = replace_na(viral_status, 0),\n viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))\n\n# Plot\ng_mrg_blast_0 <- mrg_blast %>%\n ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +\n geom_point(alpha=0.5, shape=16) +\n scale_x_continuous(name=\"S(forward read)\", limits=c(0,40), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_y_continuous(name=\"S(reverse read)\", limits=c(0,40), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_color_brewer(palette = \"Set1\", name = \"Viral status\") +\n facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +\n theme_base + theme(aspect.ratio=1)\ng_mrg_blast_0\n\n\n\n\n\nCodeg_hist_1 <- ggplot(mrg_blast, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position=\"dodge\") + facet_wrap(~viral_status_out) + scale_x_continuous(name = \"Maximum adjusted alignment score\") + scale_y_continuous(name=\"# Read pairs\") + scale_fill_brewer(palette = \"Dark2\") + theme_base + geom_vline(xintercept=20, linetype=\"dashed\", color=\"red\")\ng_hist_1\n\n\n\n\nHigh-scoring true-positives massively outnumber false-positives, while low-scoring true-positives are fairly rare. My usual score cutoff, a disjunctive threshold at S=20, achieves a sensitivity of ~97% and an F1 score of ~98%. As such, I think this looks pretty good, and feel comfortable moving on to analyzing the entire HV dataset.\n\nCodetest_sens_spec <- function(tab, score_threshold, conjunctive, include_special){\n if (!include_special) tab <- filter(tab, viral_status_out %in% c(\"TRUE\", \"FALSE\"))\n tab_retained <- tab %>% mutate(\n conjunctive = conjunctive,\n retain_score_conjunctive = (adj_score_fwd > score_threshold & adj_score_rev > score_threshold), \n retain_score_disjunctive = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),\n retain_score = ifelse(conjunctive, retain_score_conjunctive, retain_score_disjunctive),\n retain = assigned_hv | hit_hv | retain_score) %>%\n group_by(viral_status_out, retain) %>% count\n pos_tru <- tab_retained %>% filter(viral_status_out == \"TRUE\", retain) %>% pull(n) %>% sum\n pos_fls <- tab_retained %>% filter(viral_status_out != \"TRUE\", retain) %>% pull(n) %>% sum\n neg_tru <- tab_retained %>% filter(viral_status_out != \"TRUE\", !retain) %>% pull(n) %>% sum\n neg_fls <- tab_retained %>% filter(viral_status_out == \"TRUE\", !retain) %>% pull(n) %>% sum\n sensitivity <- pos_tru / (pos_tru + neg_fls)\n specificity <- neg_tru / (neg_tru + pos_fls)\n precision <- pos_tru / (pos_tru + pos_fls)\n f1 <- 2 * precision * sensitivity / (precision + sensitivity)\n out <- tibble(threshold=score_threshold, include_special = include_special, \n conjunctive = conjunctive, sensitivity=sensitivity, \n specificity=specificity, precision=precision, f1=f1)\n return(out)\n}\nrange_f1 <- function(intab, inc_special, inrange=15:45){\n tss <- purrr::partial(test_sens_spec, tab=intab, include_special=inc_special)\n stats_conj <- lapply(inrange, tss, conjunctive=TRUE) %>% bind_rows\n stats_disj <- lapply(inrange, tss, conjunctive=FALSE) %>% bind_rows\n stats_all <- bind_rows(stats_conj, stats_disj) %>%\n pivot_longer(!(threshold:conjunctive), names_to=\"metric\", values_to=\"value\") %>%\n mutate(conj_label = ifelse(conjunctive, \"Conjunctive\", \"Disjunctive\"))\n return(stats_all)\n}\ninc_special <- FALSE\nstats_0 <- range_f1(mrg_blast, inc_special) %>% mutate(attempt=0)\nthreshold_opt_0 <- stats_0 %>% group_by(conj_label,attempt) %>% filter(metric == \"f1\") %>%\n filter(value == max(value)) %>% filter(threshold == min(threshold))\ng_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +\n geom_vline(data = threshold_opt_0, mapping = aes(xintercept=threshold),\n color = \"red\", linetype = \"dashed\") +\n geom_line() +\n scale_y_continuous(name = \"Value\", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +\n scale_x_continuous(name = \"Threshold\", expand = c(0,0)) +\n scale_color_brewer(palette=\"Set3\") +\n facet_wrap(~conj_label) +\n theme_base\ng_stats_0\n\n\n\n\nHuman-infecting virus reads: analysis\n\nCode# Get raw read counts\nread_counts_raw <- basic_stats_raw %>%\n select(sample, location, n_reads_raw = n_read_pairs)\n\n# Get HV read counts & RA\nmrg_hv <- mrg %>%\n mutate(hv_status = assigned_hv | hit_hv | adj_score_max >= 20)\nread_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% count(name = \"n_reads_hv\")\nread_counts <- read_counts_raw %>%\n left_join(read_counts_hv, by=\"sample\") %>%\n mutate(n_reads_hv = replace_na(n_reads_hv, 0),\n p_reads_hv = n_reads_hv/n_reads_raw)\n\n# Aggregate\nread_counts_total <- read_counts %>% ungroup %>%\n summarize(n_reads_raw = sum(n_reads_raw),\n n_reads_hv = sum(n_reads_hv)) %>%\n mutate(p_reads_hv = n_reads_hv/n_reads_raw,\n sample= \"All samples\", location = \"All locations\")\nread_counts_agg <- read_counts %>%\n bind_rows(read_counts_total) %>%\n mutate(sample = fct_inorder(sample),\n location = fct_inorder(location))\n\n\nApplying a disjunctive cutoff at S=20 identifies 513,259 reads as human-viral out of 1.42B total reads, for a relative HV abundance of \\(3.62 \\times 10^{-4}\\). This compares to \\(2.0 \\times 10^{-4}\\) on the public dashboard, corresponding to the results for Kraken-only identification: an 80% increase, smaller than the 4-5x increases seen for Crits-Christoph and Rothman. Relative HV abundances for individual samples ranged from \\(6.36 \\times 10^{-5}\\) to \\(1.19 \\times 10^{-3}\\):\n\nCode# Visualize\ng_phv_agg <- ggplot(read_counts_agg, aes(x=sample, y=p_reads_hv, color=location)) +\n geom_point() +\n scale_y_log10(\"Relative abundance of human virus reads\") +\n scale_color_brewer(palette = \"Set1\", name = \"Location\") +\n theme_kit\ng_phv_agg\n\n\n\n\nDigging into individual viruses, we see that fecal-oral viruses predominate, especially Mamastrovirus, Rotavirus, Salivirus, and fecal-oral Enterovirus species (especially Enterovirus C, which includes poliovirus):\n\nCode# Get viral taxon names for putative HV reads\nviral_taxa$name[viral_taxa$taxid == 249588] <- \"Mamastrovirus\"\nviral_taxa$name[viral_taxa$taxid == 194960] <- \"Kobuvirus\"\nviral_taxa$name[viral_taxa$taxid == 688449] <- \"Salivirus\"\nmrg_hv_named <- mrg_hv %>% left_join(viral_taxa, by=\"taxid\")\n\n# Discover viral species & genera for HV reads\nraise_rank <- function(read_db, taxid_db, out_rank = \"species\", verbose = FALSE){\n # Get higher ranks than search rank\n ranks <- c(\"subspecies\", \"species\", \"subgenus\", \"genus\", \"subfamily\", \"family\", \"suborder\", \"order\", \"class\", \"subphylum\", \"phylum\", \"kingdom\", \"superkingdom\")\n rank_match <- which.max(ranks == out_rank)\n high_ranks <- ranks[rank_match:length(ranks)]\n # Merge read DB and taxid DB\n reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n # Extract sequences that are already at appropriate rank\n reads_rank <- filter(reads, rank == out_rank)\n # Drop sequences at a higher rank and return unclassified sequences\n reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...\n # Promote read taxids and re-merge with taxid DB, then re-classify and filter\n reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%\n select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%\n bind_rows(reads_rank)\n reads_norank <- reads_remaining %>%\n filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n }\n # Finally, extract and append reads that were excluded during the process\n reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)\n reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%\n select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n return(reads_out)\n}\nhv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, \"species\")\nhv_reads_genera <- raise_rank(mrg_hv_named, viral_taxa, \"genus\")\n\n\n\nCode# Count relative abundance for species\nhv_species_counts_raw <- hv_reads_species %>% group_by(sample, location, name) %>%\n summarize(n_reads_hv = sum(hv_status), .groups = \"drop\") %>%\n inner_join(read_counts_agg %>% select(sample, n_reads_raw), by=\"sample\")\nhv_species_counts_all <- hv_species_counts_raw %>% group_by(name) %>%\n summarize(n_reads_hv = sum(n_reads_hv),\n n_reads_raw = sum(n_reads_raw)) %>%\n mutate(sample = \"All samples\", location = \"All locations\")\nhv_species_counts_agg <- bind_rows(hv_species_counts_raw, hv_species_counts_all) %>%\n mutate(p_reads_hv = n_reads_hv/n_reads_raw)\n\n# Count relative abundance for genera\nhv_genera_counts_raw <- hv_reads_genera %>% group_by(sample, location, name) %>%\n summarize(n_reads_hv = sum(hv_status), .groups = \"drop\") %>%\n inner_join(read_counts_agg %>% select(sample, n_reads_raw), by=\"sample\")\nhv_genera_counts_all <- hv_genera_counts_raw %>% group_by(name) %>%\n summarize(n_reads_hv = sum(n_reads_hv),\n n_reads_raw = sum(n_reads_raw)) %>%\n mutate(sample = \"All samples\", location = \"All locations\")\nhv_genera_counts_agg <- bind_rows(hv_genera_counts_raw, hv_genera_counts_all) %>%\n mutate(p_reads_hv = n_reads_hv/n_reads_raw)\n\n\n\nCode# Compute ranks for species and genera and restrict to high-ranking taxa\nmax_rank_species <- 5\nmax_rank_genera <- 5\nhv_species_counts_ranked <- hv_species_counts_agg %>% group_by(sample) %>%\n mutate(rank = rank(desc(n_reads_hv), ties.method=\"max\"),\n highrank = rank <= max_rank_species) %>%\n group_by(name) %>%\n mutate(highrank_any = any(highrank),\n name_display = ifelse(highrank_any, name, \"Other\")) %>%\n group_by(name_display) %>%\n mutate(mean_rank = mean(rank)) %>%\n arrange(mean_rank) %>% ungroup %>%\n mutate(name_display = fct_inorder(name_display))\nhv_genera_counts_ranked <- hv_genera_counts_agg %>% group_by(sample) %>%\n mutate(rank = rank(desc(n_reads_hv), ties.method=\"max\"),\n highrank = rank <= max_rank_genera) %>%\n group_by(name) %>%\n mutate(highrank_any = any(highrank),\n name_display = ifelse(highrank_any, name, \"Other\")) %>%\n group_by(name_display) %>%\n mutate(mean_rank = mean(rank)) %>%\n arrange(mean_rank) %>% ungroup %>%\n mutate(name_display = fct_inorder(name_display))\n\n# Plot composition\npalette_rank <- c(brewer.pal(8, \"Dark2\"), brewer.pal(8, \"Set2\"))\ng_vcomp_genera <- ggplot(hv_genera_counts_ranked, aes(x=sample, y=n_reads_hv, fill=name_display)) +\n geom_col(position = \"fill\") +\n scale_fill_manual(values = palette_rank, name = \"Viral genus\") +\n scale_y_continuous(name = \"Fraction of HV reads\", breaks = seq(0,1,0.2), expand = c(0,0)) +\n guides(fill = guide_legend(nrow=5)) +\n theme_kit\ng_vcomp_species <- ggplot(hv_species_counts_ranked, aes(x=sample, y=n_reads_hv, fill=name_display)) +\n geom_col(position = \"fill\") +\n scale_fill_manual(values = palette_rank, name = \"Viral species\") +\n scale_y_continuous(name = \"Fraction of HV reads\", breaks = seq(0,1,0.2), expand = c(0,0)) +\n guides(fill = guide_legend(nrow=7)) +\n theme_kit\n\ng_vcomp_genera\n\n\n\nCodeg_vcomp_species\n\n\n\n\nThese results are consistent with what’s been observed before, e.g. in the P2RA preprint and public dashboard.\nConclusion\nI analyzed Yang et al. for two reasons: first, because it’s included in the P2RA dataset that we’re currently reanalyzing, and second, because it has the highest relative abundance of both total and human-infecting viruses of any wastewater study we’ve looked at so far. Having done this analysis, the key question to ask is: is this elevated viral abundance a consequence of the unusual sample-processing methods employed by the authors, or the region in which samples were collected? Both are plausible: the processing methods used in this paper are distinct from any other paper we’ve looked at, which makes it possible that they are significantly superior; on the other hand, it’s also plausible that the methods are comparable in efficacy to more standard approaches, and the difference arises from a genuinely elevated level of viruses in Xinjiang compared to e.g. California (for Rothman and Crits-Christoph).\nThe analyses conducted here aren’t able to give a dispositive answer to that question, but they are suggestive. The fact that human-infecting viruses are strongly dominated by fecal-oral species is consistent with the elevated-infection theory, as these are the kinds of viruses I’d most expect to be elevated in a poorer region of the world compared to a rich one. On the other hand, the fact that total viruses (including plant viruses) are also elevated points in the other direction; I don’t see any obvious reason why we’d expect more tobamoviruses in Xinjiang than California.\nMost compellingly, from my perspective, is the very low fraction of bacterial and ribosomal reads found in the Yang data compared to other datasets I’ve analyzed. The methods make no mention of rRNA depletion prior to sequencing; even if this was conducted without being mentioned, that still doesn’t explain the very low level of non-ribosomal bacterial reads. In general, I’d expect a region with elevated enteric viral pathogens to also show elevated enteric bacterial pathogens, so this absence is difficult to explain with a catchment-based theory. Instead, it points towards the viral enrichment protocols used by this study as potentially achieving genuinely better results than other methods we’ve tried.\nThis certainly isn’t conclusive evidence, or close to it. However, based on these results, I’d be very interested in learning more about the details of Yang et al’s approach to viral enrichment, and to see it tried by another group in a different context to see if these impressive results can be replicated."
}
]
\ No newline at end of file
diff --git a/notebooks/2024-03-16_yang.qmd b/notebooks/2024-03-16_yang.qmd
new file mode 100644
index 0000000..53f5fd9
--- /dev/null
+++ b/notebooks/2024-03-16_yang.qmd
@@ -0,0 +1,694 @@
+---
+title: "Workflow analysis of Yang et al. (2020)"
+subtitle: "Wastewater from Xinjiang."
+author: "Will Bradshaw"
+date: 2024-03-16
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ code-link: true
+ df-print: paged
+editor: visual
+title-block-banner: black
+---
+
+```{r}
+#| label: load-packages
+#| include: false
+library(tidyverse)
+library(cowplot)
+library(patchwork)
+library(fastqcr)
+library(RColorBrewer)
+source("../scripts/aux_plot-theme.R")
+theme_base <- theme_base + theme(aspect.ratio = NULL)
+theme_kit <- theme_base + theme(
+ axis.text.x = element_text(hjust = 1, angle = 45),
+ axis.title.x = element_blank(),
+)
+tnl <- theme(legend.position = "none")
+```
+
+As we work on expanding and updating the [P2RA preprint](https://doi.org/10.1101/2023.12.22.23300450) for publication, I'm working on running the relevant datasets through my pipeline to compare the results to those from the original Kraken-based approach. I began by processing [Yang et al. (2020)](https://doi.org/10.1016/j.scitotenv.2020.142322), a study that carried out RNA viral metagenomics on raw sewage samples from Xinjiang, China.
+
+The study collected 7 1L grab samples of raw sewage from the inlets of three WWTPs in Xinjiang, and processed them using an "anion filter membrane absorption" method quite distinct from the [other](https://data.securebio.org/wills-public-notebook/notebooks/2024-02-04_crits-christoph-1.html) [studies](https://data.securebio.org/wills-public-notebook/notebooks/2024-02-27_rothman-1.html) I've looked at so far:
+
+1. Initially, samples were centrifuged to pellet cells and debris.
+2. The supernatant was treated with magnesium chloride and HCl, then filtered through a mixed cellulose ester membrane filter, **discarding the filtrate**.
+3. Material was then eluted from the filters by ultrasonication with 3% beef extract, before going a second filtering step using a 0.22um filter, this time retaining the filtrate.
+4. Next, the samples underwent RNA extraction using the QIAamp viral RNA Mini Kit.
+5. Finally, sequencing libraries were generated using KAPA Hyper Prep Kit. No mention is made of rRNA depletion or panel enrichment.
+
+This study is of particular interest because previous analysis found that the MGS data it generated contains a particularly high fraction of both total viruses and human-infecting viruses. It would be good to know whether this is attributable to the fact that the authors used an unusual method, or if it's the result of genuine differences on the ground (e.g. an elevated prevalence of enteric disease compared to the USA).
+
+# The raw data
+
+The Yang data is unusual in that it contains only a small number of samples (7 samples from 3 different locations) but each of these samples is sequenced quite deeply: the number of reads per sample varied from 162M to 283M, with an average of 203M. Taken together, the samples totaled roughly 1.4B read pairs (425 gigabases of sequence).
+
+Read qualities were consistently very high, but adapter levels were also elevated. The level of sequence duplication levels was very high according to FASTQC, with a minimum of 89% across all samples, suggesting a high degree of oversequencing; perhaps unsurprising given the sheer depth of these samples.
+
+```{r}
+#| fig-height: 5
+#| warning: false
+
+# Data input paths
+data_dir <- "../data/2024-03-15_yang/"
+libraries_path <- file.path(data_dir, "sample-metadata.csv")
+basic_stats_path <- file.path(data_dir, "qc_basic_stats.tsv")
+adapter_stats_path <- file.path(data_dir, "qc_adapter_stats.tsv")
+quality_base_stats_path <- file.path(data_dir, "qc_quality_base_stats.tsv")
+quality_seq_stats_path <- file.path(data_dir, "qc_quality_sequence_stats.tsv")
+
+# Import libraries and extract metadata from sample names
+libraries <- read_csv(libraries_path, show_col_types = FALSE)
+
+# Import QC data
+stages <- c("raw_concat", "cleaned", "dedup", "ribo_initial", "ribo_secondary")
+basic_stats <- read_tsv(basic_stats_path, show_col_types = FALSE) %>%
+ # mutate(sample = sub("_2$", "", sample)) %>%
+ inner_join(libraries, by="sample") %>%
+ mutate(stage = factor(stage, levels = stages),
+ sample = fct_inorder(sample))
+adapter_stats <- read_tsv(adapter_stats_path, show_col_types = FALSE) %>%
+ mutate(sample = sub("_2$", "", sample)) %>%
+ inner_join(libraries, by="sample") %>%
+ mutate(stage = factor(stage, levels = stages),
+ read_pair = fct_inorder(as.character(read_pair)))
+quality_base_stats <- read_tsv(quality_base_stats_path, show_col_types = FALSE) %>%
+ # mutate(sample = sub("_2$", "", sample)) %>%
+ inner_join(libraries, by="sample") %>%
+ mutate(stage = factor(stage, levels = stages),
+ read_pair = fct_inorder(as.character(read_pair)))
+quality_seq_stats <- read_tsv(quality_seq_stats_path, show_col_types = FALSE) %>%
+ # mutate(sample = sub("_2$", "", sample)) %>%
+ inner_join(libraries, by="sample") %>%
+ mutate(stage = factor(stage, levels = stages),
+ read_pair = fct_inorder(as.character(read_pair)))
+
+# Filter to raw data
+basic_stats_raw <- basic_stats %>% filter(stage == "raw_concat")
+adapter_stats_raw <- adapter_stats %>% filter(stage == "raw_concat")
+quality_base_stats_raw <- quality_base_stats %>% filter(stage == "raw_concat")
+quality_seq_stats_raw <- quality_seq_stats %>% filter(stage == "raw_concat")
+
+# Visualize basic stats
+g_nreads_raw <- ggplot(basic_stats_raw, aes(x=sample, y=n_read_pairs, fill=location)) + geom_col(position="dodge") + scale_y_continuous(name="# Read pairs", expand=c(0,0)) + scale_fill_brewer(palette="Set1", name="Location") + theme_kit
+legend_location <- get_legend(g_nreads_raw)
+g_nreads_raw_2 <- g_nreads_raw + theme(legend.position = "none")
+g_nbases_raw <- ggplot(basic_stats_raw, aes(x=sample, y=n_bases_approx, fill=location)) + geom_col(position="dodge") + scale_y_continuous(name="Total base pairs (approx)", expand=c(0,0)) + scale_fill_brewer(palette="Set1", name="Location") + theme_kit + theme(legend.position = "none")
+g_ndup_raw <- ggplot(basic_stats_raw, aes(x=sample, y=percent_duplicates, fill=location)) + geom_col(position="dodge") + scale_y_continuous(name="% Duplicates (FASTQC)", expand=c(0,0), limits = c(0,100), breaks = seq(0,100,20)) + scale_fill_brewer(palette="Set1", name="Location") + theme_kit + theme(legend.position = "none")
+g_basic_raw <- plot_grid(g_nreads_raw_2 + g_nbases_raw + g_ndup_raw, legend_location, ncol = 1, rel_heights = c(1,0.1))
+g_basic_raw
+
+# Visualize adapters
+g_adapters_raw <- ggplot(adapter_stats_raw, aes(x=position, y=pc_adapters, color=location, linetype = read_pair, group=interaction(sample, read_pair))) + geom_line() +
+ scale_color_brewer(palette = "Set1", name = "Location") +
+ scale_linetype_discrete(name = "Read Pair") +
+ scale_y_continuous(name="% Adapters", limits=c(0,50),
+ breaks = seq(0,50,10), expand=c(0,0)) +
+ scale_x_continuous(name="Position", limits=c(0,140),
+ breaks=seq(0,140,20), expand=c(0,0)) +
+ guides(color=guide_legend(nrow=2,byrow=TRUE),
+ linetype = guide_legend(nrow=2,byrow=TRUE)) +
+ facet_wrap(~adapter) + theme_base
+g_adapters_raw
+
+# Visualize quality
+g_quality_base_raw <- ggplot(quality_base_stats_raw, aes(x=position, y=mean_phred_score, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +
+ geom_hline(yintercept=25, linetype="dashed", color="red") +
+ geom_hline(yintercept=30, linetype="dashed", color="red") +
+ geom_line() +
+ scale_color_brewer(palette = "Set1", name = "Location") +
+ scale_linetype_discrete(name = "Read Pair") +
+ scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+ scale_x_continuous(name="Position", limits=c(0,140),
+ breaks=seq(0,140,20), expand=c(0,0)) +
+ guides(color=guide_legend(nrow=2,byrow=TRUE),
+ linetype = guide_legend(nrow=2,byrow=TRUE)) +
+ theme_base
+g_quality_seq_raw <- ggplot(quality_seq_stats_raw, aes(x=mean_phred_score, y=n_sequences, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +
+ geom_vline(xintercept=25, linetype="dashed", color="red") +
+ geom_vline(xintercept=30, linetype="dashed", color="red") +
+ geom_line() +
+ scale_color_brewer(palette = "Set1", name = "Location") +
+ scale_linetype_discrete(name = "Read Pair") +
+ scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+ scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+ guides(color=guide_legend(nrow=2,byrow=TRUE),
+ linetype = guide_legend(nrow=2,byrow=TRUE)) +
+ theme_base
+g_quality_base_raw
+g_quality_seq_raw
+```
+
+# Preprocessing
+
+```{r}
+n_reads_rel <- basic_stats %>%
+ select(sample, location, stage, percent_duplicates, n_read_pairs) %>%
+ group_by(sample, location) %>% arrange(sample, location, stage) %>%
+ mutate(p_reads_retained = n_read_pairs / lag(n_read_pairs),
+ p_reads_lost = 1 - p_reads_retained,
+ p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
+ p_reads_lost_abs = 1-p_reads_retained_abs,
+ p_reads_lost_abs_marginal = p_reads_lost_abs - lag(p_reads_lost_abs))
+
+```
+
+On average, cleaning & deduplication together removed about 86% of total reads from each sample, with the largest losses seen during deduplication. Only a few reads (low single-digit percentages at most) were lost during subsequent ribodepletion – a surprising finding given the apparent lack of rRNA depletion in the study methods. This makes me suspect that the methods did include rRNA depletion without mentioning it; if this isn't the case, it suggests that the methods used by the authors might be particularly effective at removing bacteria from the sample.
+
+```{r}
+#| warning: false
+# Plot reads over preprocessing
+g_reads_stages <- ggplot(basic_stats, aes(x=stage, y=n_read_pairs,fill=location,group=sample)) +
+ geom_col(position="dodge") + scale_fill_brewer(palette="Set1", name="Location") +
+ scale_y_continuous("# Read pairs", expand=c(0,0)) +
+ theme_kit
+g_reads_stages
+
+# Plot relative read losses during preprocessing
+g_reads_rel <- ggplot(n_reads_rel, aes(x=stage, y=p_reads_lost_abs_marginal,fill=location,group=sample)) +
+ geom_col(position="dodge") + scale_fill_brewer(palette="Set1", name="Location") +
+ scale_y_continuous("% Total Reads Lost", expand=c(0,0), labels = function(x) x*100) +
+ theme_kit
+g_reads_rel
+```
+
+Data cleaning with FASTP was very successful at removing adapters, with very few adapter sequences found by FASTQC at any stage after the raw data. Improvements in quality were unsurprisingly minimal, given the high initial quality observed.
+
+```{r}
+#| warning: false
+# Visualize adapters
+g_adapters <- ggplot(adapter_stats, aes(x=position, y=pc_adapters, color=location, linetype = read_pair, group=interaction(sample, read_pair))) + geom_line() +
+ scale_color_brewer(palette = "Set1", name = "Location") +
+ scale_linetype_discrete(name = "Read Pair") +
+ scale_y_continuous(name="% Adapters", limits=c(0,50),
+ breaks = seq(0,50,10), expand=c(0,0)) +
+ scale_x_continuous(name="Position", limits=c(0,140),
+ breaks=seq(0,140,20), expand=c(0,0)) +
+ guides(color=guide_legend(nrow=2,byrow=TRUE),
+ linetype = guide_legend(nrow=2,byrow=TRUE)) +
+ facet_grid(stage~adapter) + theme_base
+g_adapters
+
+# Visualize quality
+g_quality_base <- ggplot(quality_base_stats, aes(x=position, y=mean_phred_score, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +
+ geom_hline(yintercept=25, linetype="dashed", color="red") +
+ geom_hline(yintercept=30, linetype="dashed", color="red") +
+ geom_line() +
+ scale_color_brewer(palette = "Set1", name = "Location") +
+ scale_linetype_discrete(name = "Read Pair") +
+ scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+ scale_x_continuous(name="Position", limits=c(0,140),
+ breaks=seq(0,140,20), expand=c(0,0)) +
+ guides(color=guide_legend(nrow=2,byrow=TRUE),
+ linetype = guide_legend(nrow=2,byrow=TRUE)) +
+ facet_grid(stage~.) + theme_base
+g_quality_seq <- ggplot(quality_seq_stats, aes(x=mean_phred_score, y=n_sequences, color=location, linetype = read_pair, group=interaction(sample,read_pair))) +
+ geom_vline(xintercept=25, linetype="dashed", color="red") +
+ geom_vline(xintercept=30, linetype="dashed", color="red") +
+ geom_line() +
+ scale_color_brewer(palette = "Set1", name = "Location") +
+ scale_linetype_discrete(name = "Read Pair") +
+ scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+ scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+ guides(color=guide_legend(nrow=2,byrow=TRUE),
+ linetype = guide_legend(nrow=2,byrow=TRUE)) +
+ facet_grid(stage~., scales="free_y") + theme_base
+g_quality_base
+g_quality_seq
+```
+
+According to FASTQC, deduplication was only moderately effective at reducing measured duplicate levels, despite the high read losses observed. After deduplication, FASTQC-measured duplicate levels fell from an average of 92% to one of 58%:
+
+```{r}
+g_dup_stages <- ggplot(basic_stats, aes(x=stage, y=percent_duplicates, fill=location, group=sample)) +
+ geom_col(position="dodge") + scale_fill_brewer(palette = "Set1", name="Location") +
+ scale_y_continuous("% Duplicates", limits=c(0,100), breaks=seq(0,100,20), expand=c(0,0)) +
+ theme_kit
+g_readlen_stages <- ggplot(basic_stats, aes(x=stage, y=mean_seq_len, fill=location, group=sample)) +
+ geom_col(position="dodge") + scale_fill_brewer(palette = "Set1", name="Location") +
+ scale_y_continuous("Mean read length (nt)", expand=c(0,0)) +
+ theme_kit
+legend_loc <- get_legend(g_dup_stages)
+g_dup_stages
+g_readlen_stages
+```
+
+# High-level composition
+
+As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:
+
+```{r}
+# Import Bracken data
+bracken_path <- file.path(data_dir, "bracken_counts.tsv")
+bracken <- read_tsv(bracken_path, show_col_types = FALSE)
+total_assigned <- bracken %>% group_by(sample) %>% summarize(
+ name = "Total",
+ kraken_assigned_reads = sum(kraken_assigned_reads),
+ added_reads = sum(added_reads),
+ new_est_reads = sum(new_est_reads),
+ fraction_total_reads = sum(fraction_total_reads)
+)
+bracken_spread <- bracken %>% select(name, sample, new_est_reads) %>%
+ mutate(name = tolower(name)) %>%
+ pivot_wider(id_cols = "sample", names_from = "name", values_from = "new_est_reads")
+# Count reads
+read_counts_preproc <- basic_stats %>%
+ select(sample, location, stage, n_read_pairs) %>%
+ pivot_wider(id_cols = c("sample", "location"), names_from="stage", values_from="n_read_pairs")
+read_counts <- read_counts_preproc %>%
+ inner_join(total_assigned %>% select(sample, new_est_reads), by = "sample") %>%
+ rename(assigned = new_est_reads) %>%
+ inner_join(bracken_spread, by="sample")
+# Assess composition
+read_comp <- transmute(read_counts, sample=sample, location=location,
+ n_filtered = raw_concat-cleaned,
+ n_duplicate = cleaned-dedup,
+ n_ribosomal = (dedup-ribo_initial) + (ribo_initial-ribo_secondary),
+ n_unassigned = ribo_secondary-assigned,
+ n_bacterial = bacteria,
+ n_archaeal = archaea,
+ n_viral = viruses,
+ n_human = eukaryota)
+read_comp_long <- pivot_longer(read_comp, -(sample:location), names_to = "classification",
+ names_prefix = "n_", values_to = "n_reads") %>%
+ mutate(classification = fct_inorder(str_to_sentence(classification))) %>%
+ group_by(sample) %>% mutate(p_reads = n_reads/sum(n_reads))
+read_comp_summ <- read_comp_long %>%
+ group_by(location, classification) %>%
+ summarize(n_reads = sum(n_reads), .groups = "drop_last") %>%
+ mutate(n_reads = replace_na(n_reads,0),
+ p_reads = n_reads/sum(n_reads),
+ pc_reads = p_reads*100)
+
+# Plot overall composition
+g_comp <- ggplot(read_comp_long, aes(x=sample, y=p_reads, fill=classification)) +
+ geom_col(position = "stack") +
+ scale_y_continuous(name = "% Reads", limits = c(0,1.01), breaks = seq(0,1,0.2),
+ expand = c(0,0), labels = function(x) x*100) +
+ scale_fill_brewer(palette = "Set1", name = "Classification") +
+ facet_wrap(.~location, scales="free") +
+ theme_kit
+g_comp
+
+# Plot composition of minor components
+read_comp_minor <- read_comp_long %>% filter(classification %in% c("Archaeal", "Viral", "Human", "Other"))
+palette_minor <- brewer.pal(9, "Set1")[6:9]
+g_comp_minor <- ggplot(read_comp_minor, aes(x=sample, y=p_reads, fill=classification)) +
+ geom_col(position = "stack") +
+ scale_y_continuous(name = "% Reads", breaks = seq(0,0.1,0.01),
+ expand = c(0,0), labels = function(x) x*100) +
+ scale_fill_manual(values=palette_minor, name = "Classification") +
+ facet_wrap(.~location, scales = "free_x") +
+ theme_kit
+g_comp_minor
+```
+
+```{r}
+p_reads_summ <- read_comp_long %>%
+ mutate(classification = ifelse(classification %in% c("Filtered", "Duplicate", "Unassigned"), "Excluded", as.character(classification))) %>%
+ group_by(classification, sample, location) %>%
+ summarize(p_reads = sum(p_reads), .groups = "drop") %>%
+ group_by(classification) %>%
+ summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, pc_mean = mean(p_reads)*100)
+```
+
+Overall, in these unenriched samples, between 89% and 95% of reads (mean 93%) are low-quality, duplicates, or unassigned. Of the remainder, 0.4-3.9% (mean 1.6%) were identified as ribosomal, and 0.06-0.22% (mean 0.13%) were otherwise classified as bacterial. The fraction of human reads ranged from 0.006-0.198% (mean 0.049%). Strikingly, those of viral reads ranged from 0.6% all the way up to 10.2% (mean 5.5%).
+
+The total fraction of viral reads is strikingly high, even higher on average than for Rothman unenriched samples. As in Rothman, this is primarily accounted for by very high levels of tobamoviruses (viral family Virgaviridae), which make up 74-98% (mean 84%) of viral reads. Most of the remainder is accounted for by [*Tombusviridae*](https://en.wikipedia.org/wiki/Tombusviridae) and [*Solemoviridae*](https://en.wikipedia.org/wiki/Solemoviridae), two other families of plant viruses. Only one family of vertebrate viruses, *Astroviridae*, accounted for more than 1% of viral reads in any sample.
+
+```{r}
+viral_taxa_path <- file.path(data_dir, "viral-taxids.tsv.gz")
+viral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)
+samples <- as.character(basic_stats_raw$sample)
+col_names <- c("pc_reads_total", "n_reads_clade", "n_reads_direct", "rank", "taxid", "name")
+report_paths <- paste0(data_dir, "kraken/", samples, ".report.gz")
+kraken_reports <- lapply(1:length(samples),
+ function(n) read_tsv(report_paths[n], col_names = col_names,
+ show_col_types = FALSE) %>%
+ mutate(sample = samples[n])) %>% bind_rows
+kraken_reports_viral <- filter(kraken_reports, taxid %in% viral_taxa$taxid) %>%
+ group_by(sample) %>%
+ mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])
+viral_families <- kraken_reports_viral %>% filter(rank == "F")
+viral_families_major_list <- viral_families %>% group_by(name, taxid) %>%
+ summarize(p_reads_viral_max = max(p_reads_viral), .groups="drop") %>%
+ filter(p_reads_viral_max >= 0.01) %>% pull(name)
+viral_families_major <- viral_families %>% filter(name %in% viral_families_major_list) %>%
+ select(name, taxid, sample, p_reads_viral)
+viral_families_minor <- viral_families_major %>% group_by(sample) %>%
+ summarize(p_reads_viral_major = sum(p_reads_viral)) %>%
+ mutate(name = "Other", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%
+ select(name, taxid, sample, p_reads_viral)
+viral_families_display <- bind_rows(viral_families_major, viral_families_minor) %>%
+ arrange(desc(p_reads_viral)) %>% mutate(name = factor(name, levels=c(viral_families_major_list, "Other")))
+g_families <- ggplot(viral_families_display, aes(x=sample, y=p_reads_viral, fill=name)) +
+ geom_col(position="stack") +
+ scale_fill_brewer(palette = "Dark2", name = "Viral family") +
+ scale_y_continuous(name="% Viral Reads", limits=c(0,1), breaks = seq(0,1,0.2),
+ expand=c(0,0), labels = function(y) y*100) +
+ theme_kit
+g_families
+```
+
+Almost as striking as the high fraction of viral reads in these data is the extremely low prevalence of bacterial reads. For comparison, non-ribosomal bacterial reads in unenriched Crits-Christoph samples ranged from 16-20%, and in unenriched Rothman samples it ranged from 2-12%. Yang et al. are thus somehow achieving a greater than 10-fold reduction in the fraction of bacterial reads compared to other studies. Unlike the increased prevalence of viruses, this seems harder to explain away via differences in conditions on the ground. Together with the elevated viral fraction, these results suggest to me that Yang's ideosyncratic methods are worth investigating further.
+
+# Human-infecting virus reads: validation
+
+Next, I investigated the human-infecting virus read content of these unenriched samples, using a pipeline identical to that described in my [entry on unenriched Rothman samples](https://data.securebio.org/wills-public-notebook/notebooks/2024-02-27_rothman-1.html). This process identified a total of 532,583 read pairs across all samples (0.3% of surviving reads):
+
+```{r}
+hv_reads_filtered_path <- file.path(data_dir, "hv_hits_putative_filtered.tsv.gz")
+hv_reads_filtered <- read_tsv(hv_reads_filtered_path, show_col_types = FALSE) %>%
+ inner_join(libraries, by="sample") %>%
+ arrange(location) %>%
+ mutate(sample = fct_inorder(sample))
+n_hv_filtered <- hv_reads_filtered %>% group_by(sample, location) %>% count %>% inner_join(basic_stats %>% filter(stage == "ribo_initial") %>% select(sample, n_read_pairs), by="sample") %>% rename(n_putative = n, n_total = n_read_pairs) %>% mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)
+```
+
+```{r}
+#| warning: false
+#| fig-width: 8
+
+mrg <- hv_reads_filtered %>%
+ mutate(kraken_label = ifelse(assigned_hv, "Kraken2 HV\nassignment",
+ ifelse(hit_hv, "Kraken2 HV\nhit",
+ "No hit or\nassignment"))) %>%
+ group_by(sample) %>% arrange(desc(adj_score_fwd), desc(adj_score_rev)) %>%
+ mutate(seq_num = row_number(),
+ adj_score_max = pmax(adj_score_fwd, adj_score_rev))
+# Import Bowtie2/Kraken data and perform filtering steps
+g_mrg <- ggplot(mrg, aes(x=adj_score_fwd, y=adj_score_rev)) +
+ geom_point(alpha=0.5, shape=16) +
+ scale_x_continuous(name="S(forward read)", limits=c(0,40), breaks=seq(0,100,10), expand = c(0,0)) +
+ scale_y_continuous(name="S(reverse read)", limits=c(0,40), breaks=seq(0,100,10), expand = c(0,0)) +
+ facet_wrap(~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +
+ theme_base + theme(aspect.ratio=1)
+g_mrg
+
+g_hist_0 <- ggplot(mrg, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position="dodge") + facet_wrap(~kraken_label) + scale_x_continuous(name = "Maximum adjusted alignment score") + scale_y_continuous(name="# Read pairs") + scale_fill_brewer(palette = "Dark2") + theme_base + geom_vline(xintercept=20, linetype="dashed", color="red")
+g_hist_0
+```
+
+This is a far larger number of putative viral reads than I previously needed to analyze, and is far too many to realistically process with BLASTN. To address this problem, I selected 1% of putative viral sequences (5,326) and ran them through BLASTN in a similar manner to past datasets:
+
+```{r}
+set.seed(83745)
+mrg_sample <- sample_frac(mrg, 0.01)
+mrg_num <- mrg_sample %>% group_by(sample) %>%
+ arrange(sample, desc(adj_score_max)) %>%
+ mutate(seq_num = row_number(), seq_head = paste0(">", sample, "_", seq_num))
+mrg_fasta <- mrg_num %>%
+ ungroup %>%
+ select(header1=seq_head, seq1=query_seq_fwd, header2=seq_head, seq2=query_seq_rev) %>%
+ mutate(header1=paste0(header1, "_1"), header2=paste0(header2, "_2"))
+mrg_fasta_out <- do.call(paste, c(mrg_fasta, sep="\n")) %>% paste(collapse="\n")
+write(mrg_fasta_out, file.path(data_dir, "yang-putative-viral.fasta"))
+```
+
+```{r}
+#| warning: false
+# Import BLAST results
+blast_results_path <- file.path(data_dir, "yang-putative-viral-all.blast.gz")
+blast_cols <- c("qseqid", "sseqid", "sgi", "staxid", "qlen", "evalue", "bitscore", "qcovs", "length", "pident", "mismatch", "gapopen", "sstrand", "qstart", "qend", "sstart", "send")
+blast_results <- read_tsv(blast_results_path, show_col_types = FALSE,
+ col_names = blast_cols, col_types = cols(.default="c"))
+
+# Add viral status
+blast_results_viral <- mutate(blast_results, viral = staxid %in% viral_taxa$taxid)
+
+# Filter for best hit for each query/subject, then rank hits for each query
+blast_results_best <- blast_results_viral %>% group_by(qseqid, staxid) %>%
+ filter(bitscore == max(bitscore)) %>%
+ filter(length == max(length)) %>% filter(row_number() == 1)
+blast_results_ranked <- blast_results_best %>%
+ group_by(qseqid) %>% mutate(rank = dense_rank(desc(bitscore)))
+blast_results_highrank <- blast_results_ranked %>% filter(rank <= 5) %>%
+ mutate(read_pair = str_split(qseqid, "_") %>% sapply(nth, n=-1),
+ seq_num = str_split(qseqid, "_") %>% sapply(nth, n=-2),
+ sample = str_split(qseqid, "_") %>% lapply(head, n=-2) %>%
+ sapply(paste, collapse="_")) %>%
+ mutate(bitscore = as.numeric(bitscore), seq_num = as.numeric(seq_num))
+
+# Summarize by read pair and taxid
+blast_results_paired <- blast_results_highrank %>%
+ group_by(sample, seq_num, staxid, viral) %>%
+ summarize(bitscore_max = max(bitscore), bitscore_min = min(bitscore),
+ n_reads = n(), .groups = "drop") %>%
+ mutate(viral_full = viral & n_reads == 2)
+
+# Compare to Kraken & Bowtie assignments
+mrg_assign <- mrg_num %>% select(sample, seq_num, taxid, assigned_taxid)
+blast_results_assign <- left_join(blast_results_paired, mrg_assign, by=c("sample", "seq_num")) %>%
+ mutate(taxid_match_bowtie = (staxid == taxid),
+ taxid_match_kraken = (staxid == assigned_taxid),
+ taxid_match_any = taxid_match_bowtie | taxid_match_kraken)
+blast_results_out <- blast_results_assign %>%
+ group_by(sample, seq_num) %>%
+ summarize(viral_status = ifelse(any(viral_full), 2,
+ ifelse(any(taxid_match_any), 2,
+ ifelse(any(viral), 1, 0))),
+ .groups = "drop")
+```
+
+```{r}
+#| fig-height: 8
+#| warning: false
+
+# Merge BLAST results with unenriched read data
+mrg_blast <- full_join(mrg_num, blast_results_out, by=c("sample", "seq_num")) %>%
+ mutate(viral_status = replace_na(viral_status, 0),
+ viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))
+
+# Plot
+g_mrg_blast_0 <- mrg_blast %>%
+ ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +
+ geom_point(alpha=0.5, shape=16) +
+ scale_x_continuous(name="S(forward read)", limits=c(0,40), breaks=seq(0,100,10), expand = c(0,0)) +
+ scale_y_continuous(name="S(reverse read)", limits=c(0,40), breaks=seq(0,100,10), expand = c(0,0)) +
+ scale_color_brewer(palette = "Set1", name = "Viral status") +
+ facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +
+ theme_base + theme(aspect.ratio=1)
+g_mrg_blast_0
+```
+
+```{r}
+g_hist_1 <- ggplot(mrg_blast, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position="dodge") + facet_wrap(~viral_status_out) + scale_x_continuous(name = "Maximum adjusted alignment score") + scale_y_continuous(name="# Read pairs") + scale_fill_brewer(palette = "Dark2") + theme_base + geom_vline(xintercept=20, linetype="dashed", color="red")
+g_hist_1
+```
+
+High-scoring true-positives massively outnumber false-positives, while low-scoring true-positives are fairly rare. My usual score cutoff, a disjunctive threshold at S=20, achieves a sensitivity of \~97% and an F1 score of \~98%. As such, I think this looks pretty good, and feel comfortable moving on to analyzing the entire HV dataset.
+
+```{r}
+test_sens_spec <- function(tab, score_threshold, conjunctive, include_special){
+ if (!include_special) tab <- filter(tab, viral_status_out %in% c("TRUE", "FALSE"))
+ tab_retained <- tab %>% mutate(
+ conjunctive = conjunctive,
+ retain_score_conjunctive = (adj_score_fwd > score_threshold & adj_score_rev > score_threshold),
+ retain_score_disjunctive = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),
+ retain_score = ifelse(conjunctive, retain_score_conjunctive, retain_score_disjunctive),
+ retain = assigned_hv | hit_hv | retain_score) %>%
+ group_by(viral_status_out, retain) %>% count
+ pos_tru <- tab_retained %>% filter(viral_status_out == "TRUE", retain) %>% pull(n) %>% sum
+ pos_fls <- tab_retained %>% filter(viral_status_out != "TRUE", retain) %>% pull(n) %>% sum
+ neg_tru <- tab_retained %>% filter(viral_status_out != "TRUE", !retain) %>% pull(n) %>% sum
+ neg_fls <- tab_retained %>% filter(viral_status_out == "TRUE", !retain) %>% pull(n) %>% sum
+ sensitivity <- pos_tru / (pos_tru + neg_fls)
+ specificity <- neg_tru / (neg_tru + pos_fls)
+ precision <- pos_tru / (pos_tru + pos_fls)
+ f1 <- 2 * precision * sensitivity / (precision + sensitivity)
+ out <- tibble(threshold=score_threshold, include_special = include_special,
+ conjunctive = conjunctive, sensitivity=sensitivity,
+ specificity=specificity, precision=precision, f1=f1)
+ return(out)
+}
+range_f1 <- function(intab, inc_special, inrange=15:45){
+ tss <- purrr::partial(test_sens_spec, tab=intab, include_special=inc_special)
+ stats_conj <- lapply(inrange, tss, conjunctive=TRUE) %>% bind_rows
+ stats_disj <- lapply(inrange, tss, conjunctive=FALSE) %>% bind_rows
+ stats_all <- bind_rows(stats_conj, stats_disj) %>%
+ pivot_longer(!(threshold:conjunctive), names_to="metric", values_to="value") %>%
+ mutate(conj_label = ifelse(conjunctive, "Conjunctive", "Disjunctive"))
+ return(stats_all)
+}
+inc_special <- FALSE
+stats_0 <- range_f1(mrg_blast, inc_special) %>% mutate(attempt=0)
+threshold_opt_0 <- stats_0 %>% group_by(conj_label,attempt) %>% filter(metric == "f1") %>%
+ filter(value == max(value)) %>% filter(threshold == min(threshold))
+g_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +
+ geom_vline(data = threshold_opt_0, mapping = aes(xintercept=threshold),
+ color = "red", linetype = "dashed") +
+ geom_line() +
+ scale_y_continuous(name = "Value", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +
+ scale_x_continuous(name = "Threshold", expand = c(0,0)) +
+ scale_color_brewer(palette="Set3") +
+ facet_wrap(~conj_label) +
+ theme_base
+g_stats_0
+```
+
+# Human-infecting virus reads: analysis
+
+```{r}
+# Get raw read counts
+read_counts_raw <- basic_stats_raw %>%
+ select(sample, location, n_reads_raw = n_read_pairs)
+
+# Get HV read counts & RA
+mrg_hv <- mrg %>%
+ mutate(hv_status = assigned_hv | hit_hv | adj_score_max >= 20)
+read_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% count(name = "n_reads_hv")
+read_counts <- read_counts_raw %>%
+ left_join(read_counts_hv, by="sample") %>%
+ mutate(n_reads_hv = replace_na(n_reads_hv, 0),
+ p_reads_hv = n_reads_hv/n_reads_raw)
+
+# Aggregate
+read_counts_total <- read_counts %>% ungroup %>%
+ summarize(n_reads_raw = sum(n_reads_raw),
+ n_reads_hv = sum(n_reads_hv)) %>%
+ mutate(p_reads_hv = n_reads_hv/n_reads_raw,
+ sample= "All samples", location = "All locations")
+read_counts_agg <- read_counts %>%
+ bind_rows(read_counts_total) %>%
+ mutate(sample = fct_inorder(sample),
+ location = fct_inorder(location))
+```
+
+Applying a disjunctive cutoff at S=20 identifies 513,259 reads as human-viral out of 1.42B total reads, for a relative HV abundance of $3.62 \times 10^{-4}$. This compares to $2.0 \times 10^{-4}$ on the public dashboard, corresponding to the results for Kraken-only identification: an 80% increase, smaller than the 4-5x increases seen for Crits-Christoph and Rothman. Relative HV abundances for individual samples ranged from $6.36 \times 10^{-5}$ to $1.19 \times 10^{-3}$:
+
+```{r}
+# Visualize
+g_phv_agg <- ggplot(read_counts_agg, aes(x=sample, y=p_reads_hv, color=location)) +
+ geom_point() +
+ scale_y_log10("Relative abundance of human virus reads") +
+ scale_color_brewer(palette = "Set1", name = "Location") +
+ theme_kit
+g_phv_agg
+```
+
+Digging into individual viruses, we see that fecal-oral viruses predominate, especially *Mamastrovirus*, *Rotavirus*, *Salivirus*, and fecal-oral *Enterovirus* species (especially *Enterovirus C*, which includes poliovirus):
+
+```{r}
+# Get viral taxon names for putative HV reads
+viral_taxa$name[viral_taxa$taxid == 249588] <- "Mamastrovirus"
+viral_taxa$name[viral_taxa$taxid == 194960] <- "Kobuvirus"
+viral_taxa$name[viral_taxa$taxid == 688449] <- "Salivirus"
+mrg_hv_named <- mrg_hv %>% left_join(viral_taxa, by="taxid")
+
+# Discover viral species & genera for HV reads
+raise_rank <- function(read_db, taxid_db, out_rank = "species", verbose = FALSE){
+ # Get higher ranks than search rank
+ ranks <- c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
+ rank_match <- which.max(ranks == out_rank)
+ high_ranks <- ranks[rank_match:length(ranks)]
+ # Merge read DB and taxid DB
+ reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%
+ left_join(taxid_db, by="taxid")
+ # Extract sequences that are already at appropriate rank
+ reads_rank <- filter(reads, rank == out_rank)
+ # Drop sequences at a higher rank and return unclassified sequences
+ reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+ while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...
+ # Promote read taxids and re-merge with taxid DB, then re-classify and filter
+ reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%
+ select(-parent_taxid, -rank, -name) %>%
+ left_join(taxid_db, by="taxid")
+ reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%
+ bind_rows(reads_rank)
+ reads_norank <- reads_remaining %>%
+ filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+ }
+ # Finally, extract and append reads that were excluded during the process
+ reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)
+ reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%
+ select(-parent_taxid, -rank, -name) %>%
+ left_join(taxid_db, by="taxid")
+ return(reads_out)
+}
+hv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, "species")
+hv_reads_genera <- raise_rank(mrg_hv_named, viral_taxa, "genus")
+```
+
+```{r}
+# Count relative abundance for species
+hv_species_counts_raw <- hv_reads_species %>% group_by(sample, location, name) %>%
+ summarize(n_reads_hv = sum(hv_status), .groups = "drop") %>%
+ inner_join(read_counts_agg %>% select(sample, n_reads_raw), by="sample")
+hv_species_counts_all <- hv_species_counts_raw %>% group_by(name) %>%
+ summarize(n_reads_hv = sum(n_reads_hv),
+ n_reads_raw = sum(n_reads_raw)) %>%
+ mutate(sample = "All samples", location = "All locations")
+hv_species_counts_agg <- bind_rows(hv_species_counts_raw, hv_species_counts_all) %>%
+ mutate(p_reads_hv = n_reads_hv/n_reads_raw)
+
+# Count relative abundance for genera
+hv_genera_counts_raw <- hv_reads_genera %>% group_by(sample, location, name) %>%
+ summarize(n_reads_hv = sum(hv_status), .groups = "drop") %>%
+ inner_join(read_counts_agg %>% select(sample, n_reads_raw), by="sample")
+hv_genera_counts_all <- hv_genera_counts_raw %>% group_by(name) %>%
+ summarize(n_reads_hv = sum(n_reads_hv),
+ n_reads_raw = sum(n_reads_raw)) %>%
+ mutate(sample = "All samples", location = "All locations")
+hv_genera_counts_agg <- bind_rows(hv_genera_counts_raw, hv_genera_counts_all) %>%
+ mutate(p_reads_hv = n_reads_hv/n_reads_raw)
+```
+
+```{r}
+#| fig-height: 6
+# Compute ranks for species and genera and restrict to high-ranking taxa
+max_rank_species <- 5
+max_rank_genera <- 5
+hv_species_counts_ranked <- hv_species_counts_agg %>% group_by(sample) %>%
+ mutate(rank = rank(desc(n_reads_hv), ties.method="max"),
+ highrank = rank <= max_rank_species) %>%
+ group_by(name) %>%
+ mutate(highrank_any = any(highrank),
+ name_display = ifelse(highrank_any, name, "Other")) %>%
+ group_by(name_display) %>%
+ mutate(mean_rank = mean(rank)) %>%
+ arrange(mean_rank) %>% ungroup %>%
+ mutate(name_display = fct_inorder(name_display))
+hv_genera_counts_ranked <- hv_genera_counts_agg %>% group_by(sample) %>%
+ mutate(rank = rank(desc(n_reads_hv), ties.method="max"),
+ highrank = rank <= max_rank_genera) %>%
+ group_by(name) %>%
+ mutate(highrank_any = any(highrank),
+ name_display = ifelse(highrank_any, name, "Other")) %>%
+ group_by(name_display) %>%
+ mutate(mean_rank = mean(rank)) %>%
+ arrange(mean_rank) %>% ungroup %>%
+ mutate(name_display = fct_inorder(name_display))
+
+# Plot composition
+palette_rank <- c(brewer.pal(8, "Dark2"), brewer.pal(8, "Set2"))
+g_vcomp_genera <- ggplot(hv_genera_counts_ranked, aes(x=sample, y=n_reads_hv, fill=name_display)) +
+ geom_col(position = "fill") +
+ scale_fill_manual(values = palette_rank, name = "Viral genus") +
+ scale_y_continuous(name = "Fraction of HV reads", breaks = seq(0,1,0.2), expand = c(0,0)) +
+ guides(fill = guide_legend(nrow=5)) +
+ theme_kit
+g_vcomp_species <- ggplot(hv_species_counts_ranked, aes(x=sample, y=n_reads_hv, fill=name_display)) +
+ geom_col(position = "fill") +
+ scale_fill_manual(values = palette_rank, name = "Viral species") +
+ scale_y_continuous(name = "Fraction of HV reads", breaks = seq(0,1,0.2), expand = c(0,0)) +
+ guides(fill = guide_legend(nrow=7)) +
+ theme_kit
+
+g_vcomp_genera
+g_vcomp_species
+```
+
+These results are consistent with what's been observed before, e.g. in the [P2RA preprint](https://doi.org/10.1101/2023.12.22.23300450) and public dashboard.
+
+# Conclusion
+
+I analyzed Yang et al. for two reasons: first, because it's included in the P2RA dataset that we're currently reanalyzing, and second, because it has the highest relative abundance of both total and human-infecting viruses of any wastewater study we've looked at so far. Having done this analysis, the key question to ask is: is this elevated viral abundance a consequence of the unusual sample-processing methods employed by the authors, or the region in which samples were collected? Both are plausible: the processing methods used in this paper are distinct from any other paper we've looked at, which makes it possible that they are significantly superior; on the other hand, it's also plausible that the methods are comparable in efficacy to more standard approaches, and the difference arises from a genuinely elevated level of viruses in Xinjiang compared to e.g. California (for Rothman and Crits-Christoph).
+
+The analyses conducted here aren't able to give a dispositive answer to that question, but they are suggestive. The fact that human-infecting viruses are strongly dominated by fecal-oral species is consistent with the elevated-infection theory, as these are the kinds of viruses I'd most expect to be elevated in a poorer region of the world compared to a rich one. On the other hand, the fact that total viruses (including plant viruses) are also elevated points in the other direction; I don't see any obvious reason why we'd expect more tobamoviruses in Xinjiang than California.
+
+Most compellingly, from my perspective, is the very low fraction of bacterial and ribosomal reads found in the Yang data compared to other datasets I've analyzed. The methods make no mention of rRNA depletion prior to sequencing; even if this was conducted without being mentioned, that still doesn't explain the very low level of non-ribosomal bacterial reads. In general, I'd expect a region with elevated enteric viral pathogens to also show elevated enteric bacterial pathogens, so this absence is difficult to explain with a catchment-based theory. Instead, it points towards the viral enrichment protocols used by this study as potentially achieving genuinely better results than other methods we've tried.
+
+This certainly isn't conclusive evidence, or close to it. However, based on these results, I'd be very interested in learning more about the details of Yang et al's approach to viral enrichment, and to see it tried by another group in a different context to see if these impressive results can be replicated.