Continuing my analysis of datasets from the P2RA preprint, I analyzed the data from Maritz et al. (2019), a study that used DNA sequencing of wastewater samples to characterize protist diversity and temporal diversity in New York City. Samples for this study underwent direct DNA extraction without a dedicated concentration step, then underwent library prep and Illumina sequencing.
The Maritz data comprised 16 samples, with 8.6M to 18.3M read pairs per samples (mean 10.8M). Taken together, the samples totaled roughly 170M read pairs (84 gigabases of sequence). Read quality was high, adapter content was low (but present), duplicate levels were mostly minimal:
The average fraction of reads lost at each stage in the preprocessing pipeline is shown in the following table. On average, cleaning & deduplication together removed about 8% of total reads, with almost none being removed during subsequent ribodepletion.
# Plot relative read losses during preprocessing
+g_reads_rel<-ggplot(n_reads_rel, aes(x=stage, y=p_reads_lost_abs_marginal,group=sample))+
+geom_col(position="dodge")+scale_fill_na()+
+scale_y_continuous("% Total Reads Lost", expand=c(0,0), labels =function(x)x*100)+
+theme_kit
+g_reads_rel
+
+
+
+
+
+
+
Data cleaning with FASTP was very successful at removing adapters, with no adapter sequences found by FASTQC at any stage after the raw data. FASTP also slightly increased (already high) read quality.
According to FASTQC, those few duplicates that were present were mostly removed during deduplication, with mean and maximum FASTQC-measured duplicate levels falling from 4.2%/16.9% to 2.4%/6.9%:
As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:
Overall, between 44% and 87% of reads (mean 57%) were low-quality, duplicates, or unassigned. 0.1-0.5% (mean 0.4%) were identified as ribosomal, and 11-55% (mean 42%) were otherwise classified as bacterial. The fraction of human reads ranged from 0.1-0.7% (mean 0.3%); that of viral reads ranged from 0.05%-1% (mean 0.15%).
+
This viral fraction is much lower than that seen for the RNA-seq studies I analyzed recently, I assume primarily due to the absence of highly abundant plant viruses. Looking more closely at the viral composition, we see that it is dominated by Suoliviridae and Intestiviridae, two families of CrAss-like phage, as well as Straboviridae, another group of DNA phages:
Next, I investigated the human-infecting virus read content of these unenriched samples, using a pipeline identical to that described in my entry on unenriched Rothman samples. This process identified a total of 318 read pairs across all samples (0.0002% of surviving reads):
I ran these reads through BLASTN in a similar manner to past datasets. The results were surprisingly poor with a significant fraction of high-scoring false-positives:
Taking a break from working on P2RA datasets, we’re also working on a review of air sampling for viral pathogen detection. For that study, we’re collecting and analyzing air MGS data that could give us a high-level idea of the likely viral composition of such samples.
+
The first dataset I’m looking at for this work is Prussin et al. (2019), a study of HVAC filter samples in a Virginia daycare center between 2014 and 2015. Samples were eluted from MERV-14 air filters collected every two weeks, with pairs of successive samples combined into four-week sampling periods. Like Brumfield et al, this study conducted both RNA and DNA sequencing; all samples were sequenced on an Illumina NextSeq500 with 2x150bp reads.
+
The raw data
+
The Prussin dataset comprised sequencing data from 14 timepoints spread across the year, from 20th January 2014 to 2nd February 2015. Each sample represents a four-week sampling period. In addition to the 14 on-site samples, there were also two control samples, a negative control (NC) and an “unexposed filter” control (UFC), which were collected on December 23rd 2014.
The 14 positive samples from the dataset yielded 5M-20M (mean 11.3M) RNA-sequencing reads and 10M-42M (mean 21.0M) DNA-sequencing reads per sample, for a total of 159M RNA read pairs and 294M DNA read pairs (46.3 and 87.0 gigabases of sequence, respectively). Controls contributed an additional 1M RNA read pairs and 17.5M DNA read pairs.
+
In positive samples, read qualities were mostly high but tailed off slightly at the 3’ end in some samples, suggesting the need for trimming. Adapter levels were high. Inferred duplication levels were low (2-4%) in DNA reads and moderate (21-44%) in RNA reads. Control sample reads were more problematic, with higher duplication and adapter levels and lower quality.
The average fraction of reads lost at each stage in the preprocessing pipeline is shown in the following table. For positive samples, on average, cleaning and deduplication removed about 19% of total read pairs from RNA libraries and about 7% from DNA libraries. Subsequent ribodepletion removed a further ~32% of total read pairs on average from RNA libraries but <0.5% of total read pairs from DNA libraries.
+
Control samples, meanwhile, lost an average of 63% of RNA read pairs and 21% of DNA read pairs during cleaning and deduplication, consistent with the lower qualities observed above. Subsequent ribodepletion removed an additional 9% of total RNA read pairs and 0.3% of total DNA read pairs.
In both positive and control samples, data cleaning with FASTP was very successful at removing adapters, with very few adapter sequences found by FASTQC at any stage after the raw data. FASTP was also successful at improving read quality.
According to FASTQC, deduplication was moderately effective at reducing measured duplicate levels in on-site samples, with FASTQC-measured levels falling from an average of 34% to 23% for RNA reads and from 2.7% to 2.2% for DNA reads.
As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:
# Plot composition of minor components
+read_comp_minor<-read_comp_long%>%
+filter(classification%in%c("Archaeal", "Viral", "Other"))
+palette_minor<-brewer.pal(9, "Set1")[c(6,7,9)]
+g_comp_minor<-g_comp_base+geom_col(data=read_comp_minor, position ="stack")+
+scale_y_pc_reads()+
+scale_fill_manual(values=palette_minor, name ="Classification")
+g_comp_minor
What we see is a very different picture from the wastewater samples I’ve been analyzing so far. Most notably, the fraction of (non-ribosomal) human reads is much higher. Even the 2015-01-20 samples, which were taken when the daycare center was closed for the winter holidays, showed human read fractions (for both nucleic-acid types) of >12%; non-control samples as a whole averaged 27% for RNA reads and 34% for DNA reads. Compare Brumfield (average 0.08% for RNA and 0.02% for DNA), Yang (mean 0.05% for RNA) or even Rothman (mean 1.8% for non-panel-enriched RNA samples).
+
Conversely, total viral reads are very low: mean 0.033% for RNA reads and 0.019% for DNA reads. Wastewater RNA datasets have typically had much higher total viruses: mean 0.5% for Brumfield, about the same for Crits-Christoph, 5.5% for Yang, 4.5% for Rothman. Brumfield’s DNA data contained substantially fewer viruses than their RNA data, but still more than Prussin: about 0.08% on average.
+
Looking at viral families…was less informative than usual, especially for DNA reads. It turns out that these samples contain a lot of viral reads that Kraken2 was only able to classify to the class level. In DNA reads, samples were dominated by Caudoviricetes phages, though Herviviricetes (which includes herpesviruses) and Naldaviricetes (a class of arthropod-infecting viruses) also put in a respectable showing. In RNA reads, Caudoviricetes was again a major presence, but Alsuviricetes (a family of primarily plant pathogens) was often as or more prevalent, and Revtraviricetes (a class that includes Hepatitis B virus and retroviruses) was also significant.
Of these, the most interesting are the strong presence of Herviviricetes in the DNA reads and Revtraviricetes in the RNA reads, as both of these are families that contain important human pathogens.
+
Digging into the former, it turns out these reads are composed almost exclusively of Herpesviridae at the family level. Within non-control samples, these arise overwhelmingly from Cytomegalovirus. Digging in at the species level, these in turn are primarily attributed to a single CMV, Human betaherpesvirus 5, a.k.a. human cytomegalovirus (HCMV). I was excited to see this: this is the first time a single human pathogen, or even all human pathogens combined, have constituted a significant fraction of all viral reads in a sample I’ve analyzed with this pipeline.
Revtraviricetes reads are similarly dominated by a single viral genus, Alpharetrovirus. Digging in at the species level, we see contributions from a variety of avian oncoviruses. To my (admittedly non-expert) knowledge, none of these infect humans, and I think they are probably primarily arising from local birds (or possibly rats).
Next, I investigated the human-infecting virus read content of these unenriched samples. Using the same workflow I used for Brumfield et al, I identified 2811 RNA read pairs and 12792 DNA read pairs as putatively human viral: 0.003% and 0.005% of surviving reads, respectively.
These results look good on visual inspection, and indeed precision and sensitivity are both very high. For a disjunctive score threshold of 20, my updated workflow achieves an F1 score of 96.7% for RNA sequences and 98.2% for DNA sequences.
Looking into the composition of different read groups, nothing stands out except the predominance of HCMV (human betaherpesvirus 5), which is consistent with the Kraken results above and borne out by the BLASTN alignments:
In non-control samples, applying a disjunctive cutoff at S=20 identifies 2591 RNA reads and 12236 DNA reads as human-viral. This gives an overall relative HV abundance of \(1.63 \times 10^{-5}\) for RNA reads and \(4.16 \times 10^{-5}\) for DNA reads. Reassuringly, relative abundances in control samples are at least 10x lower. We also see a sharp drop in relative abundance for both nucleic-acid types for the period when the daycare was closed (2015-01-20):
+
+Code
# Visualize
+g_phv_agg<-ggplot(read_counts_agg, aes(x=date, color=na_type))+
+geom_point(aes(y=p_reads_hv))+
+scale_y_log10("Relative abundance of human virus reads")+
+scale_x_discrete(name="Collection Date")+
+facet_grid(.~ctrl, scales ="free", space ="free_x")+
+scale_color_na()+theme_rotate
+g_phv_agg
+
+
+
+
+
+
+
These overall RA values are similar to those we’ve seen previously for non-panel-enriched wastewater RNA data. That said, it’s notable that the DNA read RA seen here is much higher than that seen in the only DNA wastewater dataset I’ve analyzed so far (Brumfield):
At the family level, we see that Papillomaviridae, Herpesviridae, and Polyomaviridae are the most abundant families in both DNA and RNA reads, with Adenoviridae and (in RNA reads) Picornaviridae also making a respectable showing. The Herpesviridae reads are, predictably, overwhelmingly from HCMV, while the Papillomaviridae and Polyomaviridae reads are split up among a larger number of related viruses:
+
+Code
# Get viral taxon names for putative HV reads
+viral_taxa$name[viral_taxa$taxid==249588]<-"Mamastrovirus"
+viral_taxa$name[viral_taxa$taxid==194960]<-"Kobuvirus"
+viral_taxa$name[viral_taxa$taxid==688449]<-"Salivirus"
+viral_taxa$name[viral_taxa$taxid==585893]<-"Picobirnaviridae"
+viral_taxa$name[viral_taxa$taxid==333922]<-"Betapapillomavirus"
+viral_taxa$name[viral_taxa$taxid==334207]<-"Betapapillomavirus 3"
+viral_taxa$name[viral_taxa$taxid==369960]<-"Porcine type-C oncovirus"
+viral_taxa$name[viral_taxa$taxid==333924]<-"Betapapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid==687329]<-"Anelloviridae"
+viral_taxa$name[viral_taxa$taxid==325455]<-"Gammapapillomavirus"
+viral_taxa$name[viral_taxa$taxid==333750]<-"Alphapapillomavirus"
+viral_taxa$name[viral_taxa$taxid==694002]<-"Betacoronavirus"
+viral_taxa$name[viral_taxa$taxid==334202]<-"Mupapillomavirus"
+viral_taxa$name[viral_taxa$taxid==197911]<-"Alphainfluenzavirus"
+viral_taxa$name[viral_taxa$taxid==186938]<-"Respirovirus"
+viral_taxa$name[viral_taxa$taxid==333926]<-"Gammapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid==337051]<-"Betapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid==337043]<-"Alphapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid==694003]<-"Betacoronavirus 1"
+viral_taxa$name[viral_taxa$taxid==334204]<-"Mupapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid==334208]<-"Betapapillomavirus 4"
+
+viral_taxa$name[viral_taxa$taxid==334208]<-"Betapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid==334208]<-"Betapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid==334208]<-"Betapapillomavirus 4"
+
+
+mrg_hv_named<-mrg_hv%>%left_join(viral_taxa, by="taxid")
+
+# Discover viral species & genera for HV reads
+raise_rank<-function(read_db, taxid_db, out_rank="species", verbose=FALSE){
+# Get higher ranks than search rank
+ranks<-c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
+rank_match<-which.max(ranks==out_rank)
+high_ranks<-ranks[rank_match:length(ranks)]
+# Merge read DB and taxid DB
+reads<-read_db%>%select(-parent_taxid, -rank, -name)%>%
+left_join(taxid_db, by="taxid")
+# Extract sequences that are already at appropriate rank
+reads_rank<-filter(reads, rank==out_rank)
+# Drop sequences at a higher rank and return unclassified sequences
+reads_norank<-reads%>%filter(rank!=out_rank, !rank%in%high_ranks, !is.na(taxid))
+while(nrow(reads_norank)>0){# As long as there are unclassified sequences...
+# Promote read taxids and re-merge with taxid DB, then re-classify and filter
+reads_remaining<-reads_norank%>%mutate(taxid =parent_taxid)%>%
+select(-parent_taxid, -rank, -name)%>%
+left_join(taxid_db, by="taxid")
+reads_rank<-reads_remaining%>%filter(rank==out_rank)%>%
+bind_rows(reads_rank)
+reads_norank<-reads_remaining%>%
+filter(rank!=out_rank, !rank%in%high_ranks, !is.na(taxid))
+}
+# Finally, extract and append reads that were excluded during the process
+reads_dropped<-reads%>%filter(!seq_id%in%reads_rank$seq_id)
+reads_out<-reads_rank%>%bind_rows(reads_dropped)%>%
+select(-parent_taxid, -rank, -name)%>%
+left_join(taxid_db, by="taxid")
+return(reads_out)
+}
+hv_reads_species<-raise_rank(mrg_hv_named, viral_taxa, "species")
+hv_reads_genus<-raise_rank(mrg_hv_named, viral_taxa, "genus")
+hv_reads_family<-raise_rank(mrg_hv_named, viral_taxa, "family")
Compared to the previous datasets I’ve analyzed, the most notable difference is the absence of enteric viruses: Norovirus, Rotavirus, and Enterovirus are all absent from the list of abundant human-viral taxa, as are ~all other gastrointestinal pathogens.
+
Finally, here are the overall relative abundances of a set of specific viral genera picked out manually on the basis of scientific or medical interest. In the future, I’ll quantify the RA of these genera across all datasets analyzed with this pipeline to date, which should give us a better sense of how these data compare to others’ under this pipeline.
Warning: Transformation introduced infinite values in continuous x-axis
+
+
+
+
+
+
+
+
Conclusion
+
This is the first air-sampling dataset I’ve analyzed using this pipeline, and it was interesting to see the differences from the wastewater datasets I’ve been analyzing so far. Among the more striking differences were:
+
+
A much higher total fraction of human reads;
+
A lower total fraction of viral reads;
+
Near-total absence of enteric viruses and Tobamovirus;
+
Much higher relative abundance of herpesviruses, particularly HCMV, and papillomaviruses among human-infecting virus reads.
+
+
Conversely, one thing that was not notably different, at least in RNA viruses, was the total relative abundance of human-infecting viruses as a whole. Given the lower fraction of reads made up of all viruses, this means that the fraction of total viruses arising from human-infecting viruses is much higher here than we’ve historically seen with wastewater data. In particular, HCMV represents a nontrivial fraction of total viruses for many DNA libraries, the first time I’ve seen a human pathogen show up significantly in the total virus composition data.
+
Going forward, I have two more air sampling datasets to analyze, Rosario et al. 2018 and Leung et al. 2021. It will be interesting to see whether the findings from this dataset generalize to other air sampling contexts.
+
+
+
+
+
Source Code
+
---
+title: "Workflow analysis of Prussin et al. (2019)"
+subtitle: "Wastewater from a manhole in Maryland."
+author: "Will Bradshaw"
+date: 2024-04-12
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ code-link: true
+ df-print: paged
+editor: visual
+title-block-banner: black
+---
+
+```{r}
+#| label: load-packages
+#| include: false
+library(tidyverse)
+library(cowplot)
+library(patchwork)
+library(fastqcr)
+library(RColorBrewer)
+source("../scripts/aux_plot-theme.R")
+theme_base <- theme_base +theme(aspect.ratio =NULL)
+theme_rotate <- theme_base +theme(
+axis.text.x =element_text(hjust =1, angle =45),
+)
+theme_kit <- theme_rotate +theme(
+axis.title.x =element_blank(),
+)
+tnl <-theme(legend.position ="none")
+```
+
+Taking a break from working on [P2RA datasets](https://doi.org/10.1101/2023.12.22.23300450), we're also working on a review of air sampling for viral pathogen detection. For that study, we're collecting and analyzing air MGS data that could give us a high-level idea of the likely viral composition of such samples.
+
+The first dataset I'm looking at for this work is [Prussin et al. (2019)](https://doi.org/10.1186/s40168-019-0672-z), a study of HVAC filter samples in a Virginia daycare center between 2014 and 2015. Samples were eluted from MERV-14 air filters collected every two weeks, with pairs of successive samples combined into four-week sampling periods. Like Brumfield et al, this study conducted both RNA and DNA sequencing; all samples were sequenced on an Illumina NextSeq500 with 2x150bp reads.
+
+# The raw data
+
+The Prussin dataset comprised sequencing data from 14 timepoints spread across the year, from 20th January 2014 to 2nd February 2015. Each sample represents a four-week sampling period. In addition to the 14 on-site samples, there were also two control samples, a negative control (NC) and an "unexposed filter" control (UFC), which were collected on December 23rd 2014.
+
+```{r}
+#| warning: false
+#| label: read-qc-data
+
+# Data input paths
+data_dir <-"../data/2024-04-11_prussin/"
+libraries_path <-file.path(data_dir, "sample-metadata.csv")
+basic_stats_path <-file.path(data_dir, "qc_basic_stats.tsv")
+adapter_stats_path <-file.path(data_dir, "qc_adapter_stats.tsv")
+quality_base_stats_path <-file.path(data_dir, "qc_quality_base_stats.tsv")
+quality_seq_stats_path <-file.path(data_dir, "qc_quality_sequence_stats.tsv")
+
+# Import libraries and extract metadata from sample names
+libraries_raw <-read_csv(libraries_path, show_col_types =FALSE)
+libraries <- libraries_raw %>%
+arrange(desc(na_type)) %>%mutate(na_type =fct_inorder(na_type)) %>%
+arrange(date) %>%rename(start_date = date) %>%
+mutate(end_date = start_date +28) %>%
+mutate(date =fct_inorder(as.character(end_date)),
+ctrl =ifelse(grepl("Negative_Control", sample_alias), "NC",
+ifelse(grepl("Unexposed_Filter", sample_alias),
+"UFC", "Non-Control")),
+ctrl =factor(ctrl, levels =c("Non-Control",
+"NC",
+"UFC")),
+open = (season !="Closed") & (ctrl =="On-Site"))
+
+# Import QC data
+stages <-c("raw_concat", "cleaned", "dedup", "ribo_initial", "ribo_secondary")
+basic_stats <-read_tsv(basic_stats_path, show_col_types =FALSE) %>%
+inner_join(libraries, by="sample") %>%
+mutate(stage =factor(stage, levels = stages),
+sample =fct_inorder(sample))
+adapter_stats <-read_tsv(adapter_stats_path, show_col_types =FALSE) %>%
+mutate(sample =sub("_2$", "", sample)) %>%
+inner_join(libraries, by="sample") %>%
+mutate(stage =factor(stage, levels = stages),
+read_pair =fct_inorder(as.character(read_pair)))
+quality_base_stats <-read_tsv(quality_base_stats_path, show_col_types =FALSE) %>%
+inner_join(libraries, by="sample") %>%
+mutate(stage =factor(stage, levels = stages),
+read_pair =fct_inorder(as.character(read_pair)))
+quality_seq_stats <-read_tsv(quality_seq_stats_path, show_col_types =FALSE) %>%
+inner_join(libraries, by="sample") %>%
+mutate(stage =factor(stage, levels = stages),
+read_pair =fct_inorder(as.character(read_pair)))
+
+# Filter to raw data
+basic_stats_raw <- basic_stats %>%filter(stage =="raw_concat")
+adapter_stats_raw <- adapter_stats %>%filter(stage =="raw_concat")
+quality_base_stats_raw <- quality_base_stats %>%filter(stage =="raw_concat")
+quality_seq_stats_raw <- quality_seq_stats %>%filter(stage =="raw_concat")
+
+# Get key values for readout
+raw_read_counts <- basic_stats_raw %>%group_by(na_type, ctrl) %>%
+summarize(rmin =min(n_read_pairs), rmax=max(n_read_pairs),
+rmean=mean(n_read_pairs), .groups ="drop")
+raw_read_totals <- basic_stats_raw %>%group_by(na_type, ctrl) %>%
+summarize(n_read_pairs =sum(n_read_pairs),
+n_bases_approx =sum(n_bases_approx), .groups ="drop")
+raw_dup <- basic_stats_raw %>%group_by(na_type, ctrl) %>%
+summarize(dmin =min(percent_duplicates), dmax=max(percent_duplicates),
+dmean=mean(percent_duplicates), .groups ="drop")
+
+```
+
+The 14 positive samples from the dataset yielded 5M-20M (mean 11.3M) RNA-sequencing reads and 10M-42M (mean 21.0M) DNA-sequencing reads per sample, for a total of 159M RNA read pairs and 294M DNA read pairs (46.3 and 87.0 gigabases of sequence, respectively). Controls contributed an additional 1M RNA read pairs and 17.5M DNA read pairs.
+
+In positive samples, read qualities were mostly high but tailed off slightly at the 3' end in some samples, suggesting the need for trimming. Adapter levels were high. Inferred duplication levels were low (2-4%) in DNA reads and moderate (21-44%) in RNA reads. Control sample reads were more problematic, with higher duplication and adapter levels and lower quality.
+
+```{r}
+#| fig-width: 9
+#| warning: false
+#| label: plot-basic-stats
+
+# Prepare data
+basic_stats_raw_metrics <- basic_stats_raw %>%
+select(date, na_type, ctrl,
+`# Read pairs`= n_read_pairs,
+`Total base pairs\n(approx)`= n_bases_approx,
+`% Duplicates\n(FASTQC)`= percent_duplicates) %>%
+pivot_longer(-(date:ctrl), names_to ="metric", values_to ="value") %>%
+mutate(metric =fct_inorder(metric))
+
+# Set up plot templates
+scale_fill_na <- purrr::partial(scale_fill_brewer, palette="Set1",
+name="Nucleic acid type")
+g_basic <-ggplot(basic_stats_raw_metrics,
+aes(x=date, y=value, fill=na_type)) +
+geom_col(position ="dodge") +
+scale_x_discrete(name="Collection Date") +
+scale_y_continuous(expand=c(0,0)) +
+expand_limits(y=c(0,100)) +
+scale_fill_na() +
+facet_grid(metric~ctrl, scales ="free", space="free_x", switch="y") +
+ theme_rotate +theme(
+axis.title.y =element_blank(),
+strip.text.y =element_text(face="plain")
+ )
+g_basic
+```
+
+```{r}
+#| label: plot-raw-quality
+
+# Set up plotting templates
+scale_color_na <- purrr::partial(scale_color_brewer, palette="Set1",
+name="Nucleic acid type")
+g_qual_raw <-ggplot(mapping=aes(color=na_type, linetype=read_pair,
+group=interaction(sample,read_pair))) +
+scale_color_na() +scale_linetype_discrete(name ="Read Pair") +
+guides(color=guide_legend(nrow=2,byrow=TRUE),
+linetype =guide_legend(nrow=2,byrow=TRUE)) +
+ theme_base
+
+# Visualize adapters
+g_adapters_raw <- g_qual_raw +
+geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +
+scale_y_continuous(name="% Adapters", limits=c(0,NA),
+breaks =seq(0,100,10), expand=c(0,0)) +
+scale_x_continuous(name="Position", limits=c(0,NA),
+breaks=seq(0,140,20), expand=c(0,0)) +
+facet_grid(ctrl~adapter)
+g_adapters_raw
+
+# Visualize quality
+g_quality_base_raw <- g_qual_raw +
+geom_hline(yintercept=25, linetype="dashed", color="red") +
+geom_hline(yintercept=30, linetype="dashed", color="red") +
+geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +
+scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+scale_x_continuous(name="Position", limits=c(0,NA),
+breaks=seq(0,140,20), expand=c(0,0)) +
+facet_grid(ctrl~.)
+g_quality_base_raw
+
+g_quality_seq_raw <- g_qual_raw +
+geom_vline(xintercept=25, linetype="dashed", color="red") +
+geom_vline(xintercept=30, linetype="dashed", color="red") +
+geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +
+scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+facet_grid(ctrl~., scales ="free_y")
+g_quality_seq_raw
+```
+
+# Preprocessing
+
+The average fraction of reads lost at each stage in the preprocessing pipeline is shown in the following table. For positive samples, on average, cleaning and deduplication removed about 19% of total read pairs from RNA libraries and about 7% from DNA libraries. Subsequent ribodepletion removed a further \~32% of total read pairs on average from RNA libraries but \<0.5% of total read pairs from DNA libraries.
+
+Control samples, meanwhile, lost an average of 63% of RNA read pairs and 21% of DNA read pairs during cleaning and deduplication, consistent with the lower qualities observed above. Subsequent ribodepletion removed an additional 9% of total RNA read pairs and 0.3% of total DNA read pairs.
+
+```{r}
+#| label: preproc-table
+n_reads_rel <- basic_stats %>%
+select(sample, date, na_type, ctrl, stage,
+ percent_duplicates, n_read_pairs) %>%
+group_by(sample, na_type) %>%arrange(sample, na_type, stage) %>%
+mutate(p_reads_retained =replace_na(n_read_pairs /lag(n_read_pairs), 0),
+p_reads_lost =1- p_reads_retained,
+p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
+p_reads_lost_abs =1-p_reads_retained_abs,
+p_reads_lost_abs_marginal =replace_na(p_reads_lost_abs -lag(p_reads_lost_abs), 0))
+n_reads_rel_display <- n_reads_rel %>%
+rename(Stage=stage, `NA Type`=na_type, `Control?`=ctrl) %>%
+group_by(`Control?`, `NA Type`, Stage) %>%
+summarize(`% Total Reads Lost (Cumulative)`=paste0(round(min(p_reads_lost_abs*100),1), "-", round(max(p_reads_lost_abs*100),1), " (mean ", round(mean(p_reads_lost_abs*100),1), ")"),
+`% Total Reads Lost (Marginal)`=paste0(round(min(p_reads_lost_abs_marginal*100),1), "-", round(max(p_reads_lost_abs_marginal*100),1), " (mean ", round(mean(p_reads_lost_abs_marginal*100),1), ")"), .groups="drop") %>%
+filter(Stage !="raw_concat") %>%
+mutate(Stage = Stage %>% as.numeric %>%factor(labels=c("Trimming & filtering", "Deduplication", "Initial ribodepletion", "Secondary ribodepletion")))
+n_reads_rel_display
+```
+
+```{r}
+#| label: preproc-figures
+#| warning: false
+#| fig-height: 8
+#| fig-width: 6
+
+g_stage_trace <-ggplot(basic_stats, aes(x=stage, color=na_type, group=sample)) +
+scale_color_na() +
+facet_wrap(ctrl~na_type, scales="free", ncol=2,
+labeller =label_wrap_gen(multi_line=FALSE)) +
+ theme_kit
+
+# Plot reads over preprocessing
+g_reads_stages <- g_stage_trace +
+geom_line(aes(y=n_read_pairs)) +
+scale_y_continuous("# Read pairs", expand=c(0,0), limits=c(0,NA))
+g_reads_stages
+
+# Plot relative read losses during preprocessing
+g_reads_rel <-ggplot(n_reads_rel, aes(x=stage, color=na_type, group=sample)) +
+geom_line(aes(y=p_reads_lost_abs_marginal)) +
+scale_y_continuous("% Total Reads Lost", expand=c(0,0),
+labels =function(x) x*100) +
+scale_color_na() +
+facet_wrap(ctrl~na_type, scales="free", ncol=2,
+labeller =label_wrap_gen(multi_line=FALSE)) +
+ theme_kit
+g_reads_rel
+```
+
+In both positive and control samples, data cleaning with FASTP was very successful at removing adapters, with very few adapter sequences found by FASTQC at any stage after the raw data. FASTP was also successful at improving read quality.
+
+```{r}
+#| warning: false
+#| label: plot-quality
+#| fig-height: 7
+
+g_qual <-ggplot(mapping=aes(color=na_type, linetype=read_pair,
+group=interaction(sample,read_pair))) +
+scale_color_na() +scale_linetype_discrete(name ="Read Pair") +
+guides(color=guide_legend(nrow=2,byrow=TRUE),
+linetype =guide_legend(nrow=2,byrow=TRUE)) +
+ theme_base
+
+# Visualize adapters
+g_adapters <- g_qual +
+geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +
+scale_y_continuous(name="% Adapters", limits=c(0,20),
+breaks =seq(0,50,10), expand=c(0,0)) +
+scale_x_continuous(name="Position", limits=c(0,NA),
+breaks=seq(0,140,20), expand=c(0,0)) +
+facet_grid(stage~adapter)
+g_adapters
+
+# Visualize quality
+g_quality_base <- g_qual +
+geom_hline(yintercept=25, linetype="dashed", color="red") +
+geom_hline(yintercept=30, linetype="dashed", color="red") +
+geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +
+scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+scale_x_continuous(name="Position", limits=c(0,NA),
+breaks=seq(0,140,20), expand=c(0,0)) +
+facet_grid(stage~.)
+g_quality_base
+
+g_quality_seq <- g_qual +
+geom_vline(xintercept=25, linetype="dashed", color="red") +
+geom_vline(xintercept=30, linetype="dashed", color="red") +
+geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +
+scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+facet_grid(stage~.)
+g_quality_seq
+```
+
+According to FASTQC, deduplication was moderately effective at reducing measured duplicate levels in on-site samples, with FASTQC-measured levels falling from an average of 34% to 23% for RNA reads and from 2.7% to 2.2% for DNA reads.
+
+```{r}
+#| label: preproc-dedup
+#| fig-height: 8
+#| fig-width: 6
+
+stage_dup <- basic_stats %>%group_by(na_type, ctrl, stage) %>%
+summarize(dmin =min(percent_duplicates), dmax=max(percent_duplicates),
+dmean=mean(percent_duplicates), .groups ="drop")
+
+g_dup_stages <- g_stage_trace +
+geom_line(aes(y=percent_duplicates)) +
+scale_y_continuous("% Duplicates", limits=c(0,NA), expand=c(0,0))
+g_dup_stages
+
+g_readlen_stages <- g_stage_trace +geom_line(aes(y=mean_seq_len)) +
+scale_y_continuous("Mean read length (nt)", expand=c(0,0), limits=c(0,NA))
+g_readlen_stages
+```
+
+# High-level composition
+
+As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:
+
+```{r}
+#| label: prepare-composition
+
+# Import Bracken data
+bracken_path <-file.path(data_dir, "bracken_counts.tsv")
+bracken <-read_tsv(bracken_path, show_col_types =FALSE)
+total_assigned <- bracken %>%group_by(sample) %>%summarize(
+name ="Total",
+kraken_assigned_reads =sum(kraken_assigned_reads),
+added_reads =sum(added_reads),
+new_est_reads =sum(new_est_reads),
+fraction_total_reads =sum(fraction_total_reads)
+)
+bracken_spread <- bracken %>%select(name, sample, new_est_reads) %>%
+mutate(name =tolower(name)) %>%
+pivot_wider(id_cols ="sample", names_from ="name",
+values_from ="new_est_reads")
+
+# Count reads
+read_counts_preproc <- basic_stats %>%
+select(sample, na_type, ctrl, date, stage, n_read_pairs) %>%
+pivot_wider(id_cols =c("sample", "na_type", "ctrl", "date"),
+names_from="stage", values_from="n_read_pairs")
+read_counts <- read_counts_preproc %>%
+inner_join(total_assigned %>%select(sample, new_est_reads), by ="sample") %>%
+rename(assigned = new_est_reads) %>%
+inner_join(bracken_spread, by="sample")
+
+# Assess composition
+read_comp <-transmute(read_counts, sample=sample, na_type=na_type,
+ctrl = ctrl, date = date,
+n_filtered = raw_concat-cleaned,
+n_duplicate = cleaned-dedup,
+n_ribosomal = (dedup-ribo_initial) + (ribo_initial-ribo_secondary),
+n_unassigned = ribo_secondary-assigned,
+n_bacterial = bacteria,
+n_archaeal = archaea,
+n_viral = viruses,
+n_human = eukaryota)
+read_comp_long <-pivot_longer(read_comp, -(sample:date),
+names_to ="classification",
+names_prefix ="n_", values_to ="n_reads") %>%
+mutate(classification =fct_inorder(str_to_sentence(classification))) %>%
+group_by(sample) %>%mutate(p_reads = n_reads/sum(n_reads))
+
+# Summarize composition
+read_comp_summ <- read_comp_long %>%
+group_by(na_type, ctrl, classification) %>%
+summarize(n_reads =sum(n_reads), .groups ="drop_last") %>%
+mutate(n_reads =replace_na(n_reads,0),
+p_reads = n_reads/sum(n_reads),
+pc_reads = p_reads*100)
+```
+
+```{r}
+#| label: plot-composition-all
+
+# Prepare plotting templates
+g_comp_base <-ggplot(mapping=aes(x=date, y=p_reads, fill=classification)) +
+scale_x_discrete(name="Collection Date") +
+facet_grid(na_type~ctrl, scales ="free", space ="free_x") +
+ theme_rotate
+scale_y_pc_reads <- purrr::partial(scale_y_continuous, name ="% Reads",
+expand =c(0,0), labels =function(y) y*100)
+
+# Plot overall composition
+g_comp <- g_comp_base +geom_col(data = read_comp_long, position ="stack") +
+scale_y_pc_reads(limits =c(0,1.01), breaks =seq(0,1,0.2)) +
+scale_fill_brewer(palette ="Set1", name ="Classification")
+g_comp
+
+# Plot composition of minor components
+read_comp_minor <- read_comp_long %>%
+filter(classification %in%c("Archaeal", "Viral", "Other"))
+palette_minor <-brewer.pal(9, "Set1")[c(6,7,9)]
+g_comp_minor <- g_comp_base +geom_col(data=read_comp_minor, position ="stack") +
+scale_y_pc_reads() +
+scale_fill_manual(values=palette_minor, name ="Classification")
+g_comp_minor
+
+```
+
+```{r}
+#| label: composition-summary
+
+p_reads_summ_group <- read_comp_long %>%
+mutate(classification =ifelse(classification %in%c("Filtered", "Duplicate", "Unassigned"), "Excluded", as.character(classification)),
+classification =fct_inorder(classification)) %>%
+group_by(classification, sample, na_type, ctrl) %>%
+summarize(p_reads =sum(p_reads), .groups ="drop") %>%
+group_by(classification, na_type, ctrl) %>%
+summarize(pc_min =min(p_reads)*100, pc_max =max(p_reads)*100,
+pc_mean =mean(p_reads)*100, .groups ="drop")
+p_reads_summ_prep <- p_reads_summ_group %>%
+mutate(classification =fct_inorder(classification),
+pc_min = pc_min %>%signif(digits=2) %>%sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+pc_max = pc_max %>%signif(digits=2) %>%sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+pc_mean = pc_mean %>%signif(digits=2) %>%sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+display =paste0(pc_min, "-", pc_max, "% (mean ", pc_mean, "%)"))
+p_reads_summ <- p_reads_summ_prep %>%
+select(ctrl, classification, na_type, display) %>%
+pivot_wider(names_from=na_type, values_from = display) %>%
+arrange(ctrl, classification)
+p_reads_summ
+```
+
+What we see is a very different picture from the wastewater samples I've been analyzing so far. Most notably, the fraction of (non-ribosomal) human reads is *much* higher. Even the 2015-01-20 samples, which were taken when the daycare center was closed for the winter holidays, showed human read fractions (for both nucleic-acid types) of \>12%; non-control samples as a whole averaged 27% for RNA reads and 34% for DNA reads. Compare [Brumfield](https://data.securebio.org/wills-public-notebook/notebooks/2024-04-08_brumfield.html) (average 0.08% for RNA and 0.02% for DNA), [Yang](https://data.securebio.org/wills-public-notebook/notebooks/2024-03-16_yang.html) (mean 0.05% for RNA) or even [Rothman](https://data.securebio.org/wills-public-notebook/notebooks/2024-02-27_rothman-1.html) (mean 1.8% for non-panel-enriched RNA samples).
+
+Conversely, total viral reads are very low: mean 0.033% for RNA reads and 0.019% for DNA reads. Wastewater RNA datasets have typically had much higher total viruses: mean 0.5% for Brumfield, about the same for Crits-Christoph, 5.5% for Yang, 4.5% for Rothman. Brumfield's DNA data contained substantially fewer viruses than their RNA data, but still more than Prussin: about 0.08% on average.
+
+Looking at viral families...was less informative than usual, especially for DNA reads. It turns out that these samples contain a lot of viral reads that Kraken2 was only able to classify to the class level. In DNA reads, samples were dominated by *Caudoviricetes* phages, though *Herviviricetes* (which includes herpesviruses) and *Naldaviricetes* (a class of arthropod-infecting viruses) also put in a respectable showing. In RNA reads, *Caudoviricetes* was again a major presence, but *Alsuviricetes* (a family of primarily plant pathogens) was often as or more prevalent, and *Revtraviricetes* (a class that includes Hepatitis B virus and retroviruses) was also significant.
+
+```{r}
+#| label: extract-viral-taxa
+
+# Get viral taxonomy
+viral_taxa_path <-file.path(data_dir, "viral-taxids.tsv.gz")
+viral_taxa <-read_tsv(viral_taxa_path, show_col_types =FALSE)
+
+# Import Kraken reports & extract viral taxa
+samples <-as.character(basic_stats_raw$sample)
+col_names <-c("pc_reads_total", "n_reads_clade", "n_reads_direct",
+"rank", "taxid", "name")
+report_paths <-paste0(data_dir, "kraken/", samples, ".report.gz")
+kraken_reports <-lapply(1:length(samples), function(n)
+read_tsv(report_paths[n], col_names = col_names, show_col_types =FALSE) %>%
+mutate(sample = samples[n])) %>% bind_rows
+kraken_reports_viral <-filter(kraken_reports, taxid %in% viral_taxa$taxid) %>%
+group_by(sample) %>%
+mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])
+kraken_reports_viral_cleaned <- kraken_reports_viral %>%
+inner_join(libraries, by="sample") %>%
+select(-pc_reads_total, -n_reads_direct) %>%
+select(name, taxid, p_reads_viral, n_reads_clade, everything())
+
+viral_classes <- kraken_reports_viral_cleaned %>%filter(rank =="C")
+viral_families <- kraken_reports_viral_cleaned %>%filter(rank =="F")
+
+```
+
+```{r}
+#| label: viral-class-composition
+
+# Identify major viral classes
+viral_classes_major_tab <- viral_classes %>%
+group_by(name, taxid) %>%
+summarize(p_reads_viral_max =max(p_reads_viral), .groups="drop") %>%
+filter(p_reads_viral_max >=0.04)
+viral_classes_major_list <- viral_classes_major_tab %>%pull(name)
+viral_classes_major <- viral_classes %>%
+filter(name %in% viral_classes_major_list) %>%
+select(name, taxid, sample, na_type, ctrl, date, p_reads_viral)
+viral_classes_minor <- viral_classes_major %>%
+group_by(sample, na_type, ctrl, date) %>%
+summarize(p_reads_viral_major =sum(p_reads_viral), .groups ="drop") %>%
+mutate(name ="Other", taxid=NA, p_reads_viral =1-p_reads_viral_major) %>%
+select(name, taxid, sample, na_type, ctrl, date, p_reads_viral)
+viral_classes_display <-bind_rows(viral_classes_major, viral_classes_minor) %>%
+arrange(desc(p_reads_viral)) %>%
+mutate(name =factor(name, levels=c(viral_classes_major_list, "Other"))) %>%
+rename(p_reads = p_reads_viral, classification=name)
+
+palette_viral <-c(brewer.pal(12, "Set3"), brewer.pal(8, "Set2"))
+g_classes <- g_comp_base +
+geom_col(data=viral_classes_display, position ="stack") +
+scale_y_continuous(name="% Viral Reads", limits=c(0,1.01), breaks =seq(0,1,0.2),
+expand=c(0,0), labels =function(y) y*100) +
+scale_fill_manual(values=palette_viral, name ="Viral class")
+g_classes
+
+```
+
+Of these, the most interesting are the strong presence of *Herviviricetes* in the DNA reads and *Revtraviricetes* in the RNA reads, as both of these are families that contain important human pathogens.
+
+Digging into the former, it turns out these reads are composed almost exclusively of *Herpesviridae* at the family level. Within non-control samples, these arise overwhelmingly from *Cytomegalovirus.* Digging in at the species level, these in turn are primarily attributed to a single CMV, Human betaherpesvirus 5, a.k.a. human cytomegalovirus (HCMV). I was excited to see this: this is the first time a single human pathogen, or even all human pathogens combined, have constituted a significant fraction of all viral reads in a sample I've analyzed with this pipeline.
+
+```{r}
+#| label: herviviricetes
+
+# Get all read counts in class
+hervi_taxid <-2731363
+hervi_desc_taxids_old <- hervi_taxid
+hervi_desc_taxids_new <-unique(c(hervi_desc_taxids_old, viral_taxa %>%filter(parent_taxid %in% hervi_desc_taxids_old) %>%pull(taxid)))
+while (length(hervi_desc_taxids_new) >length(hervi_desc_taxids_old)){
+ hervi_desc_taxids_old <- hervi_desc_taxids_new
+ hervi_desc_taxids_new <-unique(c(hervi_desc_taxids_old, viral_taxa %>%filter(parent_taxid %in% hervi_desc_taxids_old) %>%pull(taxid)))
+}
+hervi_counts <- kraken_reports_viral_cleaned %>%
+filter(taxid %in% hervi_desc_taxids_new) %>%
+mutate(p_reads_hervi = n_reads_clade/n_reads_clade[1])
+
+# Get genus composition
+hervi_genera <- hervi_counts %>%filter(rank =="S", na_type =="DNA")
+hervi_genera_major_tab <- hervi_genera %>%
+group_by(name, taxid) %>%
+summarize(p_reads_hervi_max =max(p_reads_hervi), .groups="drop") %>%
+filter(p_reads_hervi_max >=0.04)
+hervi_genera_major_list <- hervi_genera_major_tab %>%pull(name)
+hervi_genera_major <- hervi_genera %>%
+filter(name %in% hervi_genera_major_list) %>%
+select(name, taxid, sample, na_type, ctrl, date, p_reads_hervi)
+hervi_genera_minor <- hervi_genera_major %>%
+group_by(sample, na_type, ctrl, date) %>%
+summarize(p_reads_hervi_major =sum(p_reads_hervi), .groups ="drop") %>%
+mutate(name ="Other", taxid=NA, p_reads_hervi =1-p_reads_hervi_major) %>%
+select(name, taxid, sample, na_type, ctrl, date, p_reads_hervi)
+hervi_genera_display <-bind_rows(hervi_genera_major, hervi_genera_minor) %>%
+arrange(desc(p_reads_hervi)) %>%
+mutate(name =factor(name, levels=c(hervi_genera_major_list, "Other"))) %>%
+rename(p_reads = p_reads_hervi, classification=name)
+
+# Plot
+g_hervi_genera <- g_comp_base +
+geom_col(data=hervi_genera_display, position ="stack") +
+scale_y_continuous(name="% Herviviricetes Reads", limits=c(0,1.01),
+breaks =seq(0,1,0.2),
+expand=c(0,0), labels =function(y) y*100) +
+scale_fill_manual(values=palette_viral, name ="Viral species")
+g_hervi_genera
+
+```
+
+*Revtraviricetes* reads are similarly dominated by a single viral genus, *Alpharetrovirus.* Digging in at the species level, we see contributions from a variety of avian oncoviruses. To my (admittedly non-expert) knowledge, none of these infect humans, and I think they are probably primarily arising from local birds (or possibly rats).
+
+```{r}
+#| label: revtraviricetes
+
+# Get all read counts in class
+revtra_taxid <-2732514
+revtra_desc_taxids_old <- revtra_taxid
+revtra_desc_taxids_new <-unique(c(revtra_desc_taxids_old, viral_taxa %>%filter(parent_taxid %in% revtra_desc_taxids_old) %>%pull(taxid)))
+while (length(revtra_desc_taxids_new) >length(revtra_desc_taxids_old)){
+ revtra_desc_taxids_old <- revtra_desc_taxids_new
+ revtra_desc_taxids_new <-unique(c(revtra_desc_taxids_old, viral_taxa %>%filter(parent_taxid %in% revtra_desc_taxids_old) %>%pull(taxid)))
+}
+revtra_counts <- kraken_reports_viral_cleaned %>%
+filter(taxid %in% revtra_desc_taxids_new) %>%
+mutate(p_reads_revtra = n_reads_clade/n_reads_clade[1])
+
+# Get genus composition
+revtra_species <- revtra_counts %>%filter(rank =="S", na_type =="RNA")
+revtra_species_major_tab <- revtra_species %>%
+group_by(name, taxid) %>%
+summarize(p_reads_revtra_max =max(p_reads_revtra), .groups="drop") %>%
+filter(p_reads_revtra_max >=0.04)
+revtra_species_major_list <- revtra_species_major_tab %>%pull(name)
+revtra_species_major <- revtra_species %>%
+filter(name %in% revtra_species_major_list) %>%
+select(name, taxid, sample, na_type, ctrl, date, p_reads_revtra)
+revtra_species_minor <- revtra_species_major %>%
+group_by(sample, na_type, ctrl, date) %>%
+summarize(p_reads_revtra_major =sum(p_reads_revtra), .groups ="drop") %>%
+mutate(name ="Other", taxid=NA, p_reads_revtra =1-p_reads_revtra_major) %>%
+select(name, taxid, sample, na_type, ctrl, date, p_reads_revtra)
+revtra_species_display <-bind_rows(revtra_species_major, revtra_species_minor) %>%
+arrange(desc(p_reads_revtra)) %>%
+mutate(name =factor(name, levels=c(revtra_species_major_list, "Other"))) %>%
+rename(p_reads = p_reads_revtra, classification=name)
+
+# Plot
+g_revtra_species <- g_comp_base +
+geom_col(data=revtra_species_display, position ="stack") +
+scale_y_continuous(name="% Revtraviricetes Reads", limits=c(0,1.01),
+breaks =seq(0,1,0.2),
+expand=c(0,0), labels =function(y) y*100) +
+scale_fill_manual(values=palette_viral, name ="Viral species")
+g_revtra_species
+
+```
+
+# Human-infecting virus reads: validation
+
+Next, I investigated the human-infecting virus read content of these unenriched samples. Using the same workflow I used for Brumfield et al, I identified 2811 RNA read pairs and 12792 DNA read pairs as putatively human viral: 0.003% and 0.005% of surviving reads, respectively.
+
+```{r}
+#| label: hv-read-counts
+hv_reads_filtered_path <-file.path(data_dir, "hv_hits_putative_filtered.tsv.gz")
+hv_reads_filtered <-read_tsv(hv_reads_filtered_path, show_col_types =FALSE) %>%
+inner_join(libraries, by="sample") %>%
+arrange(date, na_type)
+n_hv_filtered <- hv_reads_filtered %>%group_by(sample, date, na_type, ctrl) %>% count %>%inner_join(basic_stats %>%filter(stage =="ribo_initial") %>%select(sample, n_read_pairs), by="sample") %>%rename(n_putative = n, n_total = n_read_pairs) %>%mutate(p_reads = n_putative/n_total, pc_reads = p_reads *100)
+n_hv_filtered_summ <- n_hv_filtered %>%group_by(na_type, ctrl) %>%summarize(n_putative =sum(n_putative), n_total =sum(n_total), .groups="drop") %>%mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)
+```
+
+```{r}
+#| label: plot-hv-scores
+#| warning: false
+#| fig-width: 8
+
+mrg <- hv_reads_filtered %>%
+mutate(kraken_label =ifelse(assigned_hv, "Kraken2 HV\nassignment",
+ifelse(hit_hv, "Kraken2 HV\nhit",
+"No hit or\nassignment"))) %>%
+group_by(sample, na_type) %>%arrange(desc(adj_score_fwd), desc(adj_score_rev)) %>%
+mutate(seq_num =row_number(),
+adj_score_max =pmax(adj_score_fwd, adj_score_rev))
+# Import Bowtie2/Kraken data and perform filtering steps
+g_mrg <-ggplot(mrg, aes(x=adj_score_fwd, y=adj_score_rev)) +
+geom_point(alpha=0.5, shape=16) +
+scale_x_continuous(name="S(forward read)", limits=c(0,65), breaks=seq(0,100,10), expand =c(0,0)) +
+scale_y_continuous(name="S(reverse read)", limits=c(0,65), breaks=seq(0,100,10), expand =c(0,0)) +
+facet_grid(na_type~kraken_label, labeller =labeller(kit =label_wrap_gen(20))) +
+ theme_base +theme(aspect.ratio=1)
+g_mrg
+
+g_hist_0 <-ggplot(mrg, aes(x=adj_score_max)) +geom_histogram(binwidth=5,boundary=0,position="dodge") +facet_grid(na_type~kraken_label, scales="free_y") +scale_x_continuous(name ="Maximum adjusted alignment score") +scale_y_continuous(name="# Read pairs") +scale_fill_brewer(palette ="Dark2") + theme_base +geom_vline(xintercept=20, linetype="dashed", color="red")
+g_hist_0
+```
+
+As previously described, I ran BLASTN on these reads via a dedicated EC2 instance, using the same parameters I've used for previous datasets.
+
+```{r}
+#| label: make-blast-fasta
+mrg_fasta <- mrg %>%
+mutate(seq_head =paste0(">", seq_id)) %>%
+ ungroup %>%
+select(header1=seq_head, seq1=query_seq_fwd,
+header2=seq_head, seq2=query_seq_rev) %>%
+mutate(header1=paste0(header1, "_1"), header2=paste0(header2, "_2"))
+mrg_fasta_out <-do.call(paste, c(mrg_fasta, sep="\n")) %>%paste(collapse="\n")
+write(mrg_fasta_out, file.path(data_dir, "blast/putative-viral.fasta"))
+```
+
+```{r}
+#| label: process-blast-data
+#| warning: false
+
+# Import BLAST results
+# Pre-filtering BLAST results to save space
+# blast_results_path <- file.path(data_dir, "blast/putative-viral.blast.gz")
+# blast_cols <- c("qseqid", "sseqid", "sgi", "staxid", "qlen", "evalue", "bitscore", "qcovs", "length", "pident", "mismatch", "gapopen", "sstrand", "qstart", "qend", "sstart", "send")
+# blast_results <- read_tsv(blast_results_path, show_col_types = FALSE,
+# col_names = blast_cols, col_types = cols(.default="c"))
+blast_results_path <-file.path(data_dir, "blast/putative-viral-best.blast.gz")
+blast_results <-read_tsv(blast_results_path, show_col_types =FALSE)
+
+# Filter for best hit for each query/subject combination
+blast_results_best <- blast_results %>%group_by(qseqid, staxid) %>%
+filter(bitscore ==max(bitscore)) %>%
+filter(length ==max(length)) %>%filter(row_number() ==1)
+
+# Rank hits for each query and filter for high-ranking hits
+blast_results_ranked <- blast_results_best %>%
+group_by(qseqid) %>%mutate(rank =dense_rank(desc(bitscore)))
+blast_results_highrank <- blast_results_ranked %>%filter(rank <=5) %>%
+mutate(read_pair =str_split(qseqid, "_") %>%sapply(nth, n=-1),
+seq_id =str_split(qseqid, "_") %>%sapply(nth, n=1)) %>%
+mutate(bitscore =as.numeric(bitscore))
+
+# Summarize by read pair and taxid
+blast_results_paired <- blast_results_highrank %>%
+group_by(seq_id, staxid) %>%
+summarize(bitscore_max =max(bitscore), bitscore_min =min(bitscore),
+n_reads =n(), .groups ="drop")
+
+# Add viral status
+blast_results_viral <-mutate(blast_results_paired, viral = staxid %in% viral_taxa$taxid) %>%
+mutate(viral_full = viral & n_reads ==2)
+
+# Compare to Kraken & Bowtie assignments
+mrg_assign <- mrg %>%select(sample, seq_id, taxid, assigned_taxid, adj_score_max, na_type)
+blast_results_assign <-left_join(blast_results_viral, mrg_assign, by="seq_id") %>%
+mutate(taxid_match_bowtie = (staxid == taxid),
+taxid_match_kraken = (staxid == assigned_taxid),
+taxid_match_any = taxid_match_bowtie | taxid_match_kraken)
+blast_results_out <- blast_results_assign %>%
+group_by(seq_id) %>%
+summarize(viral_status =ifelse(any(viral_full), 2,
+ifelse(any(taxid_match_any), 2,
+ifelse(any(viral), 1, 0))),
+.groups ="drop")
+```
+
+```{r}
+#| label: plot-blast-results
+#| fig-height: 6
+#| warning: false
+
+# Merge BLAST results with unenriched read data
+mrg_blast <-full_join(mrg, blast_results_out, by="seq_id") %>%
+mutate(viral_status =replace_na(viral_status, 0),
+viral_status_out =ifelse(viral_status ==0, FALSE, TRUE))
+mrg_blast_rna <- mrg_blast %>%filter(na_type =="RNA")
+mrg_blast_dna <- mrg_blast %>%filter(na_type =="DNA")
+
+# Plot RNA
+g_mrg_blast_rna <- mrg_blast_rna %>%
+ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +
+geom_point(alpha=0.5, shape=16) +
+scale_x_continuous(name="S(forward read)", limits=c(0,65), breaks=seq(0,100,10), expand =c(0,0)) +
+scale_y_continuous(name="S(reverse read)", limits=c(0,65), breaks=seq(0,100,10), expand =c(0,0)) +
+scale_color_brewer(palette ="Set1", name ="Viral status") +
+facet_grid(viral_status_out~kraken_label, labeller =labeller(kit =label_wrap_gen(20))) +
+ theme_base +labs(title="RNA") +
+theme(aspect.ratio=1, plot.title =element_text(size=rel(2), hjust=0))
+g_mrg_blast_rna
+
+# Plot DNA
+g_mrg_blast_dna <- mrg_blast_dna %>%
+ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +
+geom_point(alpha=0.5, shape=16) +
+scale_x_continuous(name="S(forward read)", limits=c(0,65), breaks=seq(0,100,10), expand =c(0,0)) +
+scale_y_continuous(name="S(reverse read)", limits=c(0,65), breaks=seq(0,100,10), expand =c(0,0)) +
+scale_color_brewer(palette ="Set1", name ="Viral status") +
+facet_grid(viral_status_out~kraken_label, labeller =labeller(kit =label_wrap_gen(20))) +
+ theme_base +labs(title="DNA") +
+theme(aspect.ratio=1, plot.title =element_text(size=rel(2), hjust=0))
+g_mrg_blast_dna
+```
+
+```{r}
+#| label: plot-blast-histogram
+g_hist_1 <-ggplot(mrg_blast, aes(x=adj_score_max)) +geom_histogram(binwidth=5,boundary=0,position="dodge") +facet_grid(na_type~viral_status_out, scales ="free_y") +scale_x_continuous(name ="Maximum adjusted alignment score") +scale_y_continuous(name="# Read pairs") +scale_fill_brewer(palette ="Dark2") + theme_base +geom_vline(xintercept=20, linetype="dashed", color="red")
+g_hist_1
+```
+
+These results look good on visual inspection, and indeed precision and sensitivity are both very high. For a disjunctive score threshold of 20, my updated workflow achieves an F1 score of 96.7% for RNA sequences and 98.2% for DNA sequences.
+
+```{r}
+#| label: plot-f1
+test_sens_spec <-function(tab, score_threshold, conjunctive, include_special){
+if (!include_special) tab <-filter(tab, viral_status_out %in%c("TRUE", "FALSE"))
+ tab_retained <- tab %>%mutate(
+conjunctive = conjunctive,
+retain_score_conjunctive = (adj_score_fwd > score_threshold & adj_score_rev > score_threshold),
+retain_score_disjunctive = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),
+retain_score =ifelse(conjunctive, retain_score_conjunctive, retain_score_disjunctive),
+retain = assigned_hv | hit_hv | retain_score) %>%
+group_by(viral_status_out, retain) %>% count
+ pos_tru <- tab_retained %>%filter(viral_status_out =="TRUE", retain) %>%pull(n) %>% sum
+ pos_fls <- tab_retained %>%filter(viral_status_out !="TRUE", retain) %>%pull(n) %>% sum
+ neg_tru <- tab_retained %>%filter(viral_status_out !="TRUE", !retain) %>%pull(n) %>% sum
+ neg_fls <- tab_retained %>%filter(viral_status_out =="TRUE", !retain) %>%pull(n) %>% sum
+ sensitivity <- pos_tru / (pos_tru + neg_fls)
+ specificity <- neg_tru / (neg_tru + pos_fls)
+ precision <- pos_tru / (pos_tru + pos_fls)
+ f1 <-2* precision * sensitivity / (precision + sensitivity)
+ out <-tibble(threshold=score_threshold, include_special = include_special,
+conjunctive = conjunctive, sensitivity=sensitivity,
+specificity=specificity, precision=precision, f1=f1)
+return(out)
+}
+range_f1 <-function(intab, inc_special, inrange=15:45){
+ tss <- purrr::partial(test_sens_spec, tab=intab, include_special=inc_special)
+ stats_conj <-lapply(inrange, tss, conjunctive=TRUE) %>% bind_rows
+ stats_disj <-lapply(inrange, tss, conjunctive=FALSE) %>% bind_rows
+ stats_all <-bind_rows(stats_conj, stats_disj) %>%
+pivot_longer(!(threshold:conjunctive), names_to="metric", values_to="value") %>%
+mutate(conj_label =ifelse(conjunctive, "Conjunctive", "Disjunctive"))
+return(stats_all)
+}
+inc_special <-FALSE
+stats_rna <-range_f1(mrg_blast_rna, inc_special) %>%mutate(na_type ="RNA")
+stats_dna <-range_f1(mrg_blast_dna, inc_special) %>%mutate(na_type ="DNA")
+stats_0 <-bind_rows(stats_rna, stats_dna) %>%mutate(attempt=0)
+threshold_opt_0 <- stats_0 %>%group_by(conj_label,attempt,na_type) %>%
+filter(metric =="f1") %>%
+filter(value ==max(value)) %>%filter(threshold ==min(threshold))
+g_stats_0 <-ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +
+geom_vline(data = threshold_opt_0, mapping =aes(xintercept=threshold),
+color ="red", linetype ="dashed") +
+geom_line() +
+scale_y_continuous(name ="Value", limits=c(0,1), breaks =seq(0,1,0.2), expand =c(0,0)) +
+scale_x_continuous(name ="Threshold", expand =c(0,0)) +
+scale_color_brewer(palette="Set3") +
+facet_grid(na_type~conj_label) +
+ theme_base
+g_stats_0
+```
+
+Looking into the composition of different read groups, nothing stands out except the predominance of HCMV (human betaherpesvirus 5), which is consistent with the Kraken results above and borne out by the BLASTN alignments:
+
+```{r}
+#| label: fp
+#| fig-height: 6
+
+viral_taxa$name[viral_taxa$taxid ==211787] <-"Human papillomavirus type 92"
+viral_taxa$name[viral_taxa$taxid ==509154] <-"Porcine endogenous retrovirus C"
+
+major_threshold <-0.05
+fp <- mrg_blast %>%
+group_by(na_type, viral_status_out,
+highscore = adj_score_max >=20, taxid) %>% count %>%
+group_by(na_type, viral_status_out, highscore) %>%mutate(p=n/sum(n)) %>%
+left_join(viral_taxa, by="taxid") %>%
+arrange(desc(p)) %>%
+mutate(name =ifelse(taxid ==194958, "Porcine endogenous retrovirus A", name))
+fp_major_tab <- fp %>%filter(p > major_threshold) %>%arrange(desc(p))
+fp_major_list <- fp_major_tab %>%pull(name) %>% sort %>% unique %>%c(., "Other")
+fp_major <- fp %>%mutate(major = p > major_threshold) %>%
+mutate(name_display =ifelse(major, name, "Other")) %>%
+group_by(na_type, viral_status_out, highscore, name_display) %>%
+summarize(n=sum(n), p=sum(p), .groups ="drop") %>%
+mutate(name_display =factor(name_display, levels = fp_major_list),
+score_display =ifelse(highscore, "S >= 20", "S < 20"),
+status_display =ifelse(viral_status_out, "True positive", "False positive"))
+g_fp <-ggplot(fp_major, aes(x=score_display, y=p, fill=name_display)) +
+geom_col(position="stack") +
+scale_x_discrete(name ="True positive?") +
+scale_y_continuous(name ="% reads", limits =c(0,1.01),
+breaks =seq(0,1,0.2), expand =c(0,0)) +
+scale_fill_manual(values = palette_viral, name ="Viral\ntaxon") +
+facet_grid(na_type~status_display) +
+guides(fill=guide_legend(ncol=3)) +
+ theme_kit
+g_fp
+
+```
+
+```{r}
+#| label: hcmv-blast-hits
+
+# Configure
+ref_taxid <-10359
+
+# Get taxon names
+tax_names_path <-file.path(data_dir, "taxid-names.tsv.gz")
+tax_names <-read_tsv(tax_names_path, show_col_types =FALSE)
+
+# Add missing names
+tax_names_new <-tribble(~staxid, ~name,
+3050295, "Cytomegalovirus humanbeta5",
+459231, "FLAG-tagging vector pFLAG97-TSR")
+tax_names <-bind_rows(tax_names, tax_names_new)
+ref_name <- tax_names %>%filter(staxid == ref_taxid) %>%pull(name)
+
+# Get major matches
+fp_staxid <- mrg_blast %>%filter(taxid == ref_taxid) %>%
+group_by(na_type, highscore = adj_score_max >=20) %>%mutate(n_seq =n()) %>%
+left_join(blast_results_paired, by="seq_id") %>%
+mutate(staxid =as.integer(staxid)) %>%
+left_join(tax_names, by="staxid") %>%rename(sname=name) %>%
+left_join(tax_names %>%rename(taxid=staxid), by="taxid")
+fp_staxid_count <- fp_staxid %>%
+group_by(viral_status_out, highscore, na_type,
+ taxid, name, staxid, sname, n_seq) %>%
+ count %>%
+group_by(viral_status_out, highscore, na_type, taxid, name) %>%
+mutate(p=n/n_seq)
+fp_staxid_count_major <- fp_staxid_count %>%
+filter(n>1, p>0.1, !is.na(staxid)) %>%
+mutate(score_display =ifelse(highscore, "S >= 20", "S < 20"),
+status_display =ifelse(viral_status_out,
+"True positive", "False positive"))
+
+# Plot
+g <-ggplot(fp_staxid_count_major, aes(x=p, y=sname)) +
+geom_col() +
+facet_grid(na_type~status_display+score_display, scales="free",
+labeller =label_wrap_gen(multi_line =FALSE)) +
+scale_x_continuous(name="% mapped reads", limits=c(0,1), breaks=seq(0,1,0.2),
+expand=c(0,0)) +
+labs(title=paste0(ref_name, " (taxid ", ref_taxid, ")")) +
+ theme_base +theme(
+axis.title.y =element_blank(),
+plot.title =element_text(size=rel(1.5), hjust=0, face="plain"))
+g
+```
+
+# Human-infecting viruses: overall relative abundance
+
+```{r}
+#| label: count-hv-reads
+
+# Get raw read counts
+read_counts_raw <- basic_stats_raw %>%
+select(sample, date, na_type, ctrl, n_reads_raw = n_read_pairs)
+
+# Get HV read counts
+mrg_hv <- mrg %>%mutate(hv_status = assigned_hv | hit_hv | adj_score_max >=20)
+read_counts_hv <- mrg_hv %>%filter(hv_status) %>%group_by(sample) %>%
+count(name="n_reads_hv")
+read_counts <- read_counts_raw %>%left_join(read_counts_hv, by="sample") %>%
+mutate(n_reads_hv =replace_na(n_reads_hv, 0))
+
+# Aggregate
+read_counts_total <- read_counts %>%group_by(na_type, ctrl) %>%
+summarize(n_reads_raw =sum(n_reads_raw),
+n_reads_hv =sum(n_reads_hv), .groups="drop") %>%
+mutate(sample="All samples", date ="All dates")
+read_counts_agg <- read_counts %>%arrange(date) %>%
+arrange(sample) %>%
+bind_rows(read_counts_total) %>%
+mutate(sample =fct_inorder(sample),
+p_reads_hv = n_reads_hv/n_reads_raw)
+```
+
+In non-control samples, applying a disjunctive cutoff at S=20 identifies 2591 RNA reads and 12236 DNA reads as human-viral. This gives an overall relative HV abundance of $1.63 \times 10^{-5}$ for RNA reads and $4.16 \times 10^{-5}$ for DNA reads. Reassuringly, relative abundances in control samples are at least 10x lower. We also see a sharp drop in relative abundance for both nucleic-acid types for the period when the daycare was closed (2015-01-20):
+
+```{r}
+#| label: plot-hv-ra
+#| warning: false
+# Visualize
+g_phv_agg <-ggplot(read_counts_agg, aes(x=date, color=na_type)) +
+geom_point(aes(y=p_reads_hv)) +
+scale_y_log10("Relative abundance of human virus reads") +
+scale_x_discrete(name="Collection Date") +
+facet_grid(.~ctrl, scales ="free", space ="free_x") +
+scale_color_na() + theme_rotate
+g_phv_agg
+```
+
+These overall RA values are similar to those we've seen previously for non-panel-enriched wastewater RNA data. That said, it's notable that the DNA read RA seen here is much higher than that seen in the only DNA wastewater dataset I've analyzed so far (Brumfield):
+
+```{r}
+#| label: ra-hv-past
+
+# Collate past RA values
+ra_past <-tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
+"Brumfield", 5e-5, "RNA", FALSE,
+"Brumfield", 3.66e-7, "DNA", FALSE,
+"Spurbeck", 5.44e-6, "RNA", FALSE,
+"Yang", 3.62e-4, "RNA", FALSE,
+"Rothman (unenriched)", 1.87e-5, "RNA", FALSE,
+"Rothman (panel-enriched)", 3.3e-5, "RNA", TRUE,
+"Crits-Christoph (unenriched)", 1.37e-5, "RNA", FALSE,
+"Crits-Christoph (panel-enriched)", 1.26e-2, "RNA", TRUE)
+
+# Collate new RA values
+ra_new <-tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
+"Prussin (non-control)", 1.63e-5, "RNA", FALSE,
+"Prussin (non-control)", 4.16e-5, "DNA", FALSE)
+
+# Plot
+ra_comp <-bind_rows(ra_past, ra_new) %>%mutate(dataset =fct_inorder(dataset))
+g_ra_comp <-ggplot(ra_comp, aes(y=dataset, x=ra, color=na_type)) +
+geom_point() +
+scale_color_na() +
+scale_x_log10(name="Relative abundance of human virus reads") +
+ theme_base +theme(axis.title.y =element_blank())
+g_ra_comp
+```
+
+# Human-infecting viruses: taxonomy and composition
+
+At the family level, we see that *Papillomaviridae*, *Herpesviridae*, and *Polyomaviridae* are the most abundant families in both DNA and RNA reads, with *Adenoviridae* and (in RNA reads) *Picornaviridae* also making a respectable showing. The *Herpesviridae* reads are, predictably, overwhelmingly from HCMV, while the *Papillomaviridae* and *Polyomaviridae* reads are split up among a larger number of related viruses:
+
+```{r}
+#| label: raise-hv-taxa
+
+# Get viral taxon names for putative HV reads
+viral_taxa$name[viral_taxa$taxid ==249588] <-"Mamastrovirus"
+viral_taxa$name[viral_taxa$taxid ==194960] <-"Kobuvirus"
+viral_taxa$name[viral_taxa$taxid ==688449] <-"Salivirus"
+viral_taxa$name[viral_taxa$taxid ==585893] <-"Picobirnaviridae"
+viral_taxa$name[viral_taxa$taxid ==333922] <-"Betapapillomavirus"
+viral_taxa$name[viral_taxa$taxid ==334207] <-"Betapapillomavirus 3"
+viral_taxa$name[viral_taxa$taxid ==369960] <-"Porcine type-C oncovirus"
+viral_taxa$name[viral_taxa$taxid ==333924] <-"Betapapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid ==687329] <-"Anelloviridae"
+viral_taxa$name[viral_taxa$taxid ==325455] <-"Gammapapillomavirus"
+viral_taxa$name[viral_taxa$taxid ==333750] <-"Alphapapillomavirus"
+viral_taxa$name[viral_taxa$taxid ==694002] <-"Betacoronavirus"
+viral_taxa$name[viral_taxa$taxid ==334202] <-"Mupapillomavirus"
+viral_taxa$name[viral_taxa$taxid ==197911] <-"Alphainfluenzavirus"
+viral_taxa$name[viral_taxa$taxid ==186938] <-"Respirovirus"
+viral_taxa$name[viral_taxa$taxid ==333926] <-"Gammapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid ==337051] <-"Betapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid ==337043] <-"Alphapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid ==694003] <-"Betacoronavirus 1"
+viral_taxa$name[viral_taxa$taxid ==334204] <-"Mupapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid ==334208] <-"Betapapillomavirus 4"
+
+viral_taxa$name[viral_taxa$taxid ==334208] <-"Betapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid ==334208] <-"Betapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid ==334208] <-"Betapapillomavirus 4"
+
+
+mrg_hv_named <- mrg_hv %>%left_join(viral_taxa, by="taxid")
+
+# Discover viral species & genera for HV reads
+raise_rank <-function(read_db, taxid_db, out_rank ="species", verbose =FALSE){
+# Get higher ranks than search rank
+ ranks <-c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
+ rank_match <-which.max(ranks == out_rank)
+ high_ranks <- ranks[rank_match:length(ranks)]
+# Merge read DB and taxid DB
+ reads <- read_db %>%select(-parent_taxid, -rank, -name) %>%
+left_join(taxid_db, by="taxid")
+# Extract sequences that are already at appropriate rank
+ reads_rank <-filter(reads, rank == out_rank)
+# Drop sequences at a higher rank and return unclassified sequences
+ reads_norank <- reads %>%filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+while(nrow(reads_norank) >0){ # As long as there are unclassified sequences...
+# Promote read taxids and re-merge with taxid DB, then re-classify and filter
+ reads_remaining <- reads_norank %>%mutate(taxid = parent_taxid) %>%
+select(-parent_taxid, -rank, -name) %>%
+left_join(taxid_db, by="taxid")
+ reads_rank <- reads_remaining %>%filter(rank == out_rank) %>%
+bind_rows(reads_rank)
+ reads_norank <- reads_remaining %>%
+filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+ }
+# Finally, extract and append reads that were excluded during the process
+ reads_dropped <- reads %>%filter(!seq_id %in% reads_rank$seq_id)
+ reads_out <- reads_rank %>%bind_rows(reads_dropped) %>%
+select(-parent_taxid, -rank, -name) %>%
+left_join(taxid_db, by="taxid")
+return(reads_out)
+}
+hv_reads_species <-raise_rank(mrg_hv_named, viral_taxa, "species")
+hv_reads_genus <-raise_rank(mrg_hv_named, viral_taxa, "genus")
+hv_reads_family <-raise_rank(mrg_hv_named, viral_taxa, "family")
+```
+
+```{r}
+#| label: hv-family
+#| fig-height: 6
+
+threshold_major_family <-0.06
+
+# Count reads for each human-viral family
+hv_family_counts <- hv_reads_family %>%
+group_by(sample, date, ctrl, na_type, name, taxid) %>%
+count(name ="n_reads_hv") %>%
+group_by(sample, date, ctrl, na_type) %>%
+mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+hv_family_major_tab <- hv_family_counts %>%group_by(name) %>%
+filter(p_reads_hv ==max(p_reads_hv)) %>%filter(row_number() ==1) %>%
+arrange(desc(p_reads_hv)) %>%filter(p_reads_hv > threshold_major_family)
+hv_family_counts_major <- hv_family_counts %>%
+mutate(name_display =ifelse(name %in% hv_family_major_tab$name, name, "Other")) %>%
+group_by(sample, date, ctrl, na_type, name_display) %>%
+summarize(n_reads_hv =sum(n_reads_hv), p_reads_hv =sum(p_reads_hv),
+.groups="drop") %>%
+mutate(name_display =factor(name_display,
+levels =c(hv_family_major_tab$name, "Other")))
+hv_family_counts_display <- hv_family_counts_major %>%
+rename(p_reads = p_reads_hv, classification = name_display)
+
+# Plot
+g_hv_family <- g_comp_base +
+geom_col(data=hv_family_counts_display, position ="stack") +
+scale_y_continuous(name="% HV Reads", limits=c(0,1.01),
+breaks =seq(0,1,0.2),
+expand=c(0,0), labels =function(y) y*100) +
+scale_fill_manual(values=palette_viral, name ="Viral family")
+g_hv_family
+```
+
+```{r}
+#| label: hv-genus
+#| fig-height: 6
+
+threshold_major_genus <-0.06
+
+# Count reads for each human-viral genus
+hv_genus_counts <- hv_reads_genus %>%
+group_by(sample, date, ctrl, na_type, name, taxid) %>%
+count(name ="n_reads_hv") %>%
+group_by(sample, date, ctrl, na_type) %>%
+mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+hv_genus_major_tab <- hv_genus_counts %>%group_by(name) %>%
+filter(p_reads_hv ==max(p_reads_hv)) %>%filter(row_number() ==1) %>%
+arrange(desc(p_reads_hv)) %>%filter(p_reads_hv > threshold_major_genus)
+hv_genus_counts_major <- hv_genus_counts %>%
+mutate(name_display =ifelse(name %in% hv_genus_major_tab$name, name, "Other")) %>%
+group_by(sample, date, ctrl, na_type, name_display) %>%
+summarize(n_reads_hv =sum(n_reads_hv), p_reads_hv =sum(p_reads_hv),
+.groups="drop") %>%
+mutate(name_display =factor(name_display,
+levels =c(hv_genus_major_tab$name, "Other")))
+hv_genus_counts_display <- hv_genus_counts_major %>%
+rename(p_reads = p_reads_hv, classification = name_display)
+
+# Plot
+g_hv_genus <- g_comp_base +
+geom_col(data=hv_genus_counts_display, position ="stack") +
+scale_y_continuous(name="% HV Reads", limits=c(0,1.01),
+breaks =seq(0,1,0.2),
+expand=c(0,0), labels =function(y) y*100) +
+scale_fill_manual(values=palette_viral, name ="Viral genus")
+g_hv_genus
+```
+
+```{r}
+#| label: hv-species
+#| fig-height: 6
+
+threshold_major_species <-0.10
+
+# Count reads for each human-viral species
+hv_species_counts <- hv_reads_species %>%
+group_by(sample, date, ctrl, na_type, name, taxid) %>%
+count(name ="n_reads_hv") %>%
+group_by(sample, date, ctrl, na_type) %>%
+mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+hv_species_major_tab <- hv_species_counts %>%group_by(name) %>%
+filter(p_reads_hv ==max(p_reads_hv)) %>%filter(row_number() ==1) %>%
+arrange(desc(p_reads_hv)) %>%filter(p_reads_hv > threshold_major_species)
+hv_species_counts_major <- hv_species_counts %>%
+mutate(name_display =ifelse(name %in% hv_species_major_tab$name, name, "Other")) %>%
+group_by(sample, date, ctrl, na_type, name_display) %>%
+summarize(n_reads_hv =sum(n_reads_hv), p_reads_hv =sum(p_reads_hv),
+.groups="drop") %>%
+mutate(name_display =factor(name_display,
+levels =c(hv_species_major_tab$name, "Other")))
+hv_species_counts_display <- hv_species_counts_major %>%
+rename(p_reads = p_reads_hv, classification = name_display)
+
+# Plot
+g_hv_species <- g_comp_base +
+geom_col(data=hv_species_counts_display, position ="stack") +
+scale_y_continuous(name="% HV Reads", limits=c(0,1.01),
+breaks =seq(0,1,0.2),
+expand=c(0,0), labels =function(y) y*100) +
+scale_fill_manual(values=palette_viral, name ="Viral species")
+g_hv_species
+```
+
+Compared to the previous datasets I've analyzed, the most notable difference is the absence of enteric viruses: *Norovirus*, *Rotavirus*, and *Enterovirus* are all absent from the list of abundant human-viral taxa, as are \~all other gastrointestinal pathogens.
+
+Finally, here are the overall relative abundances of a set of specific viral genera picked out manually on the basis of scientific or medical interest. In the future, I'll quantify the RA of these genera across all datasets analyzed with this pipeline to date, which should give us a better sense of how these data compare to others' under this pipeline.
+
+```{r}
+#| fig-height: 5
+#| label: ra-genera
+
+# Define reference genera
+path_genera_rna <-c("Mamastrovirus", "Enterovirus", "Salivirus", "Kobuvirus", "Norovirus", "Sapovirus", "Rotavirus", "Alphacoronavirus", "Betacoronavirus",
+"Alphainfluenzavirus", "Betainfluenzavirus", "Lentivirus")
+path_genera_dna <-c("Mastadenovirus", "Alphapolyomavirus", "Betapolyomavirus", "Alphapapillomavirus", "Betapapillomavirus", "Orthopoxvirus", "Simplexvirus",
+"Lymphocryptovirus", "Cytomegalovirus", "Dependoparvovirus")
+path_genera <-bind_rows(tibble(name=path_genera_rna, genome_type="RNA genome"),
+tibble(name=path_genera_dna, genome_type="DNA genome")) %>%
+left_join(viral_taxa, by="name")
+
+# Count in each sample
+n_path_genera <- hv_reads_genus %>%
+group_by(sample, date, na_type, ctrl, name, taxid) %>%
+count(name="n_reads_viral") %>%
+inner_join(path_genera, by=c("name", "taxid")) %>%
+left_join(read_counts_raw, by=c("sample", "date", "na_type", "ctrl")) %>%
+mutate(p_reads_viral = n_reads_viral/n_reads_raw)
+
+# Pivot out and back to add zero lines
+n_path_genera_out <- n_path_genera %>% ungroup %>%select(sample, name, n_reads_viral) %>%
+pivot_wider(names_from="name", values_from="n_reads_viral", values_fill=0) %>%
+pivot_longer(-sample, names_to="name", values_to="n_reads_viral") %>%
+left_join(read_counts_raw, by="sample") %>%
+left_join(path_genera, by="name") %>%
+mutate(p_reads_viral = n_reads_viral/n_reads_raw)
+
+## Aggregate across dates
+n_path_genera_stype <- n_path_genera_out %>%
+group_by(na_type, ctrl,
+ name, taxid, genome_type) %>%
+summarize(n_reads_raw =sum(n_reads_raw),
+n_reads_viral =sum(n_reads_viral), .groups ="drop") %>%
+mutate(sample="All samples", date="All dates",
+p_reads_viral = n_reads_viral/n_reads_raw)
+
+# Plot
+g_path_genera <-ggplot(n_path_genera_stype,
+aes(y=name, x=p_reads_viral, color=na_type)) +
+geom_point() +
+scale_x_log10(name="Relative abundance") +
+scale_color_na() +
+facet_grid(genome_type~ctrl, scales="free_y") +
+ theme_base +theme(axis.title.y =element_blank())
+g_path_genera
+```
+
+# Conclusion
+
+This is the first air-sampling dataset I've analyzed using this pipeline, and it was interesting to see the differences from the wastewater datasets I've been analyzing so far. Among the more striking differences were:
+
+- A much higher total fraction of human reads;
+
+- A lower total fraction of viral reads;
+
+- Near-total absence of enteric viruses and *Tobamovirus*;
+
+- Much higher relative abundance of herpesviruses, particularly HCMV, and papillomaviruses among human-infecting virus reads.
+
+Conversely, one thing that was not notably different, at least in RNA viruses, was the total relative abundance of human-infecting viruses as a whole. Given the lower fraction of reads made up of all viruses, this means that the fraction of total viruses arising from human-infecting viruses is much higher here than we've historically seen with wastewater data. In particular, HCMV represents a nontrivial fraction of total viruses for many DNA libraries, the first time I've seen a human pathogen show up significantly in the total virus composition data.
+
+Going forward, I have two more air sampling datasets to analyze, [Rosario et al. 2018](https://doi.org/10.1186/s40168-021-01044-7) and [Leung et al. 2021](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01044-7). It will be interesting to see whether the findings from this dataset generalize to other air sampling contexts.
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/fp-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/fp-1.png
new file mode 100644
index 0000000..d4da2b2
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/fp-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/hcmv-blast-hits-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/hcmv-blast-hits-1.png
new file mode 100644
index 0000000..ae3420d
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/hcmv-blast-hits-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/herviviricetes-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/herviviricetes-1.png
new file mode 100644
index 0000000..08bf492
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/herviviricetes-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/hv-family-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/hv-family-1.png
new file mode 100644
index 0000000..86dc214
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/hv-family-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/hv-genus-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/hv-genus-1.png
new file mode 100644
index 0000000..1b6a602
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/hv-genus-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/hv-species-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/hv-species-1.png
new file mode 100644
index 0000000..71e50ee
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/hv-species-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-basic-stats-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-basic-stats-1.png
new file mode 100644
index 0000000..099c819
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-basic-stats-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-blast-histogram-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-blast-histogram-1.png
new file mode 100644
index 0000000..e0f337a
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-blast-histogram-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-blast-results-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-blast-results-1.png
new file mode 100644
index 0000000..3ddb057
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-blast-results-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-blast-results-2.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-blast-results-2.png
new file mode 100644
index 0000000..c5b12cc
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-blast-results-2.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-composition-all-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-composition-all-1.png
new file mode 100644
index 0000000..3a58309
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-composition-all-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-composition-all-2.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-composition-all-2.png
new file mode 100644
index 0000000..c637f32
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-composition-all-2.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-f1-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-f1-1.png
new file mode 100644
index 0000000..b333551
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-f1-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-hv-ra-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-hv-ra-1.png
new file mode 100644
index 0000000..00218b5
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-hv-ra-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-hv-scores-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-hv-scores-1.png
new file mode 100644
index 0000000..5656dc2
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-hv-scores-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-hv-scores-2.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-hv-scores-2.png
new file mode 100644
index 0000000..7dfec04
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-hv-scores-2.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-quality-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-quality-1.png
new file mode 100644
index 0000000..c180038
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-quality-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-quality-2.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-quality-2.png
new file mode 100644
index 0000000..5fe1533
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-quality-2.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-quality-3.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-quality-3.png
new file mode 100644
index 0000000..494e6df
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-quality-3.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-raw-quality-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-raw-quality-1.png
new file mode 100644
index 0000000..edb8967
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-raw-quality-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-raw-quality-2.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-raw-quality-2.png
new file mode 100644
index 0000000..aeea424
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-raw-quality-2.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-raw-quality-3.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-raw-quality-3.png
new file mode 100644
index 0000000..9684e76
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/plot-raw-quality-3.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-dedup-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-dedup-1.png
new file mode 100644
index 0000000..f2c263b
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-dedup-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-dedup-2.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-dedup-2.png
new file mode 100644
index 0000000..d4ab835
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-dedup-2.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-figures-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-figures-1.png
new file mode 100644
index 0000000..f9fa1d5
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-figures-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-figures-2.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-figures-2.png
new file mode 100644
index 0000000..3d1c679
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/preproc-figures-2.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/ra-genera-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/ra-genera-1.png
new file mode 100644
index 0000000..d762ae5
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/ra-genera-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/ra-hv-past-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/ra-hv-past-1.png
new file mode 100644
index 0000000..b4c0531
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/ra-hv-past-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/revtraviricetes-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/revtraviricetes-1.png
new file mode 100644
index 0000000..882664c
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/revtraviricetes-1.png differ
diff --git a/docs/notebooks/2024-04-12_prussin_files/figure-html/viral-class-composition-1.png b/docs/notebooks/2024-04-12_prussin_files/figure-html/viral-class-composition-1.png
new file mode 100644
index 0000000..a4c44c7
Binary files /dev/null and b/docs/notebooks/2024-04-12_prussin_files/figure-html/viral-class-composition-1.png differ
diff --git a/docs/search.json b/docs/search.json
index e2639fc..c48b1c3 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -32,7 +32,7 @@
"href": "index.html",
"title": "Will's Public NAO Notebook",
"section": "",
- "text": "Workflow analysis of Brumfield et al. (2022)\n\n\nWastewater from a manhole in Maryland.\n\n\n\n\n\nApr 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Spurbeck et al. (2023)\n\n\nCave carpa.\n\n\n\n\n\nApr 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nFollowup analysis of Yang et al. (2020)\n\n\nDigging into deduplication.\n\n\n\n\n\nMar 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Yang et al. (2020)\n\n\nWastewater from Xinjiang.\n\n\n\n\n\nMar 16, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nImproving read deduplication in the MGS workflow\n\n\nRemoving reverse-complement duplicates of human-viral reads.\n\n\n\n\n\nMar 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 2\n\n\nPanel-enriched samples.\n\n\n\n\n\nFeb 29, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 1\n\n\nUnenriched samples.\n\n\n\n\n\nFeb 27, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 3\n\n\nFixing the virus pipeline.\n\n\n\n\n\nFeb 15, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 2\n\n\nAbundance and composition of human-infecting viruses.\n\n\n\n\n\nFeb 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 1\n\n\nPreprocessing and composition.\n\n\n\n\n\nFeb 4, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nAutomating BLAST validation of human viral read assignment\n\n\nExperiments with BLASTN remote mode\n\n\n\n\n\nJan 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nProject Runway RNA-seq testing data: removing livestock reads\n\n\n\n\n\n\n\n\nDec 22, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Project Runway RNA-seq testing data\n\n\nApplying a new workflow to some oldish data.\n\n\n\n\n\nDec 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nEstimating the effect of read depth on duplication rate for Project Runway DNA data\n\n\nHow deep can we go?\n\n\n\n\n\nNov 8, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
+ "text": "Workflow analysis of Prussin et al. (2019)\n\n\nWastewater from a manhole in Maryland.\n\n\n\n\n\nApr 12, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Brumfield et al. (2022)\n\n\nWastewater from a manhole in Maryland.\n\n\n\n\n\nApr 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Spurbeck et al. (2023)\n\n\nCave carpa.\n\n\n\n\n\nApr 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nFollowup analysis of Yang et al. (2020)\n\n\nDigging into deduplication.\n\n\n\n\n\nMar 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Yang et al. (2020)\n\n\nWastewater from Xinjiang.\n\n\n\n\n\nMar 16, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nImproving read deduplication in the MGS workflow\n\n\nRemoving reverse-complement duplicates of human-viral reads.\n\n\n\n\n\nMar 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 2\n\n\nPanel-enriched samples.\n\n\n\n\n\nFeb 29, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 1\n\n\nUnenriched samples.\n\n\n\n\n\nFeb 27, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 3\n\n\nFixing the virus pipeline.\n\n\n\n\n\nFeb 15, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 2\n\n\nAbundance and composition of human-infecting viruses.\n\n\n\n\n\nFeb 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 1\n\n\nPreprocessing and composition.\n\n\n\n\n\nFeb 4, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nAutomating BLAST validation of human viral read assignment\n\n\nExperiments with BLASTN remote mode\n\n\n\n\n\nJan 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nProject Runway RNA-seq testing data: removing livestock reads\n\n\n\n\n\n\n\n\nDec 22, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Project Runway RNA-seq testing data\n\n\nApplying a new workflow to some oldish data.\n\n\n\n\n\nDec 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nEstimating the effect of read depth on duplication rate for Project Runway DNA data\n\n\nHow deep can we go?\n\n\n\n\n\nNov 8, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
},
{
"objectID": "notebooks/2023-10-12_fastp-vs-adapterremoval.html",
@@ -278,5 +278,12 @@
"title": "Workflow analysis of Brumfield et al. (2022)",
"section": "",
"text": "Continuing my analysis of datasets from the P2RA preprint, I analyzed the data from Brumfield et al. (2022). This study conducted longitudinal monitoring of SARS-CoV-2 via qPCR from a single manhole in Maryland, combined with comprehensive microbiome analysis using metagenomics during a major COVID spike in early 2021. In total, they sequenced six samples from February to April 2021.\nOne unusual feature that makes this study interesting is that they conducted both RNA and DNA sequencing on each study. Prior to sequencing, samples underwent concentration with the InnovaPrep Concentrating Pipette, followed by separate DNA & RNA extraction using different kits. RNA samples underwent rRNA depletion prior to library prep. All samples were sequenced on an Illumina HiSeq4000 with 2x150bp reads.\nThe raw data\nThe 6 samples from the Brumfield dataset yielded 24M-45M (mean 33M) DNA-sequencing reads and 29M-64M (mean 46M) RNA-sequencing reads per sample, for a total of 196M DNA read pairs and 276M RNA read pairs (59 and 83 gigabases of sequence, respectively). Read qualities were mostly high but tailed off at the 3’ end in some samples, suggesting the need for trimming. Adapter levels were very high. Inferred duplication levels were 44-58% in DNA reads and 90-96% in RNA reads, i.e. very high.\n\nCode# Data input paths\ndata_dir <- \"../data/2024-03-19_brumfield/\"\nlibraries_path <- file.path(data_dir, \"sample-metadata.csv\")\nbasic_stats_path <- file.path(data_dir, \"qc_basic_stats.tsv\")\nadapter_stats_path <- file.path(data_dir, \"qc_adapter_stats.tsv\")\nquality_base_stats_path <- file.path(data_dir, \"qc_quality_base_stats.tsv\")\nquality_seq_stats_path <- file.path(data_dir, \"qc_quality_sequence_stats.tsv\")\n\n# Import libraries and extract metadata from sample names\nlibraries <- read_csv(libraries_path, show_col_types = FALSE) %>%\n arrange(date) %>% mutate(date = fct_inorder(as.character(date))) %>%\n arrange(desc(na_type)) %>% mutate(na_type = fct_inorder(na_type))\n\n# Import QC data\nstages <- c(\"raw_concat\", \"cleaned\", \"dedup\", \"ribo_initial\", \"ribo_secondary\")\nbasic_stats <- read_tsv(basic_stats_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>% \n mutate(stage = factor(stage, levels = stages),\n sample = fct_inorder(sample))\nadapter_stats <- read_tsv(adapter_stats_path, show_col_types = FALSE) %>%\n mutate(sample = sub(\"_2$\", \"\", sample)) %>%\n inner_join(libraries, by=\"sample\") %>%\n mutate(stage = factor(stage, levels = stages),\n read_pair = fct_inorder(as.character(read_pair)))\nquality_base_stats <- read_tsv(quality_base_stats_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>%\n mutate(stage = factor(stage, levels = stages),\n read_pair = fct_inorder(as.character(read_pair)))\nquality_seq_stats <- read_tsv(quality_seq_stats_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>%\n mutate(stage = factor(stage, levels = stages),\n read_pair = fct_inorder(as.character(read_pair)))\n\n# Filter to raw data\nbasic_stats_raw <- basic_stats %>% filter(stage == \"raw_concat\")\nadapter_stats_raw <- adapter_stats %>% filter(stage == \"raw_concat\")\nquality_base_stats_raw <- quality_base_stats %>% filter(stage == \"raw_concat\")\nquality_seq_stats_raw <- quality_seq_stats %>% filter(stage == \"raw_concat\")\n\n\n\nCode# Visualize basic stats\nscale_fill_na <- purrr::partial(scale_fill_brewer, palette=\"Set1\", name=\"Nucleic acid type\")\ng_basic <- ggplot(basic_stats_raw, aes(x=date, fill=na_type, group=sample)) +\n scale_fill_na() + theme_kit\ng_nreads_raw <- g_basic + geom_col(aes(y=n_read_pairs), position = \"dodge\") +\n scale_y_continuous(name=\"# Read pairs\", expand=c(0,0))\ng_nreads_raw_2 <- g_nreads_raw + theme(legend.position = \"none\")\nlegend_group <- get_legend(g_nreads_raw)\n\ng_nbases_raw <- g_basic + geom_col(aes(y=n_bases_approx), position = \"dodge\") +\n scale_y_continuous(name=\"Total base pairs (approx)\", expand=c(0,0)) + \n theme(legend.position = \"none\")\ng_ndup_raw <- g_basic + geom_col(aes(y=percent_duplicates), position = \"dodge\") +\n scale_y_continuous(name=\"% Duplicates (FASTQC)\", expand=c(0,0), limits = c(0,100), breaks = seq(0,100,20)) +\n theme(legend.position = \"none\")\n\ng_basic_raw <- plot_grid(g_nreads_raw_2 + g_nbases_raw + g_ndup_raw, legend_group, ncol = 1, rel_heights = c(1,0.1))\ng_basic_raw\n\n\n\n\n\n\n\n\nCodescale_color_na <- purrr::partial(scale_color_brewer,palette=\"Set1\",name=\"Nucleic acid type\")\ng_qual_raw <- ggplot(mapping=aes(color=na_type, linetype=read_pair, \n group=interaction(sample,read_pair))) + \n scale_color_na() + scale_linetype_discrete(name = \"Read Pair\") +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n theme_base\n\n# Visualize adapters\ng_adapters_raw <- g_qual_raw + \n geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +\n scale_y_continuous(name=\"% Adapters\", limits=c(0,NA),\n breaks = seq(0,100,10), expand=c(0,0)) +\n scale_x_continuous(name=\"Position\", limits=c(0,NA),\n breaks=seq(0,140,20), expand=c(0,0)) +\n facet_wrap(~adapter)\ng_adapters_raw\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base_raw <- g_qual_raw +\n geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +\n scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n scale_x_continuous(name=\"Position\", limits=c(0,NA),\n breaks=seq(0,140,20), expand=c(0,0))\ng_quality_base_raw\n\n\n\n\n\n\nCodeg_quality_seq_raw <- g_qual_raw +\n geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +\n scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n scale_y_continuous(name=\"# Sequences\", expand=c(0,0))\ng_quality_seq_raw\n\n\n\n\n\n\n\nPreprocessing\nThe average fraction of reads lost at each stage in the preprocessing pipeline is shown in the following table. On average, cleaning & deduplication together removed about 50% of total reads from DNA libraries and about 92% from RNA libraries, primarily during deduplication. Only a few reads (low single-digit percentages at most) were lost during subsequent ribodepletion, even in RNA reads, consistent with the samples having undergone rRNA depletion prior to sequencing.\n\nCoden_reads_rel <- basic_stats %>% \n select(sample, na_type, stage, percent_duplicates, n_read_pairs) %>%\n group_by(sample, na_type) %>% arrange(sample, na_type, stage) %>%\n mutate(p_reads_retained = n_read_pairs / lag(n_read_pairs),\n p_reads_lost = 1 - p_reads_retained,\n p_reads_retained_abs = n_read_pairs / n_read_pairs[1],\n p_reads_lost_abs = 1-p_reads_retained_abs,\n p_reads_lost_abs_marginal = p_reads_lost_abs - lag(p_reads_lost_abs))\nn_reads_rel_display <- n_reads_rel %>% rename(Stage=stage, `NA Type`=na_type) %>% \n group_by(`NA Type`, Stage) %>% \n summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), \"-\", round(max(p_reads_lost_abs*100),1), \" (mean \", round(mean(p_reads_lost_abs*100),1), \")\"),\n `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), \"-\", round(max(p_reads_lost_abs_marginal*100),1), \" (mean \", round(mean(p_reads_lost_abs_marginal*100),1), \")\"), .groups=\"drop\") %>% \n filter(Stage != \"raw_concat\") %>%\n mutate(Stage = Stage %>% as.numeric %>% factor(labels=c(\"Trimming & filtering\", \"Deduplication\", \"Initial ribodepletion\", \"Secondary ribodepletion\")))\nn_reads_rel_display\n\n\n \n\n\n\n\nCode# Plot reads over preprocessing\ng_reads_stages <- ggplot(basic_stats, aes(x=stage, y=n_read_pairs,fill=na_type,group=sample)) +\n geom_col(position=\"dodge\") + scale_fill_na() +\n scale_y_continuous(\"# Read pairs\", expand=c(0,0)) +\n theme_kit\ng_reads_stages\n\n\n\n\n\n\nCode# Plot relative read losses during preprocessing\ng_reads_rel <- ggplot(n_reads_rel, aes(x=stage, y=p_reads_lost_abs_marginal,fill=na_type,group=sample)) +\n geom_col(position=\"dodge\") + scale_fill_na() +\n scale_y_continuous(\"% Total Reads Lost\", expand=c(0,0), labels = function(x) x*100) +\n theme_kit\ng_reads_rel\n\n\n\n\n\n\n\nData cleaning with FASTP was very successful at removing adapters, with very few adapter sequences found by FASTQC at any stage after the raw data. FASTP was also successful at improving read quality.\n\nCodeg_qual <- ggplot(mapping=aes(color=na_type, linetype=read_pair, \n group=interaction(sample,read_pair))) + \n scale_color_na() + scale_linetype_discrete(name = \"Read Pair\") +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n theme_base\n\n# Visualize adapters\ng_adapters <- g_qual + \n geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +\n scale_y_continuous(name=\"% Adapters\", limits=c(0,20),\n breaks = seq(0,50,10), expand=c(0,0)) +\n scale_x_continuous(name=\"Position\", limits=c(0,NA),\n breaks=seq(0,140,20), expand=c(0,0)) +\n facet_grid(stage~adapter)\ng_adapters\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base <- g_qual +\n geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +\n scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n scale_x_continuous(name=\"Position\", limits=c(0,NA),\n breaks=seq(0,140,20), expand=c(0,0)) +\n facet_grid(stage~.)\ng_quality_base\n\n\n\n\n\n\nCodeg_quality_seq <- g_qual +\n geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +\n scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n scale_y_continuous(name=\"# Sequences\", expand=c(0,0)) +\n facet_grid(stage~.)\ng_quality_seq\n\n\n\n\n\n\n\nAccording to FASTQC, deduplication was moderately effective at reducing measured duplicate levels, with FASTQC-measured levels falling from an average of 50% to one of 16% for DNA reads, and from 93% to 42% for RNA reads:\n\nCodeg_dup_stages <- ggplot(basic_stats, aes(x=stage, y=percent_duplicates, fill=na_type, group=sample)) +\n geom_col(position=\"dodge\") + scale_fill_na() +\n scale_y_continuous(\"% Duplicates\", limits=c(0,100), breaks=seq(0,100,20), expand=c(0,0)) +\n theme_kit\ng_readlen_stages <- ggplot(basic_stats, aes(x=stage, y=mean_seq_len, fill=na_type, group=sample)) +\n geom_col(position=\"dodge\") + scale_fill_na() +\n scale_y_continuous(\"Mean read length (nt)\", expand=c(0,0)) +\n theme_kit\nlegend_loc <- get_legend(g_dup_stages)\ng_dup_stages\n\n\n\n\n\n\nCodeg_readlen_stages\n\n\n\n\n\n\n\nHigh-level composition\nAs before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:\n\nCode# Import Bracken data\nbracken_path <- file.path(data_dir, \"bracken_counts.tsv\")\nbracken <- read_tsv(bracken_path, show_col_types = FALSE)\ntotal_assigned <- bracken %>% group_by(sample) %>% summarize(\n name = \"Total\",\n kraken_assigned_reads = sum(kraken_assigned_reads),\n added_reads = sum(added_reads),\n new_est_reads = sum(new_est_reads),\n fraction_total_reads = sum(fraction_total_reads)\n)\nbracken_spread <- bracken %>% select(name, sample, new_est_reads) %>%\n mutate(name = tolower(name)) %>%\n pivot_wider(id_cols = \"sample\", names_from = \"name\", values_from = \"new_est_reads\")\n# Count reads\nread_counts_preproc <- basic_stats %>% \n select(sample, na_type, date, stage, n_read_pairs) %>%\n pivot_wider(id_cols = c(\"sample\", \"na_type\", \"date\"), names_from=\"stage\", values_from=\"n_read_pairs\")\nread_counts <- read_counts_preproc %>%\n inner_join(total_assigned %>% select(sample, new_est_reads), by = \"sample\") %>%\n rename(assigned = new_est_reads) %>%\n inner_join(bracken_spread, by=\"sample\")\n# Assess composition\nread_comp <- transmute(read_counts, sample=sample, na_type=na_type, date=date,\n n_filtered = raw_concat-cleaned,\n n_duplicate = cleaned-dedup,\n n_ribosomal = (dedup-ribo_initial) + (ribo_initial-ribo_secondary),\n n_unassigned = ribo_secondary-assigned,\n n_bacterial = bacteria,\n n_archaeal = archaea,\n n_viral = viruses,\n n_human = eukaryota)\nread_comp_long <- pivot_longer(read_comp, -(sample:date), names_to = \"classification\",\n names_prefix = \"n_\", values_to = \"n_reads\") %>%\n mutate(classification = fct_inorder(str_to_sentence(classification))) %>%\n group_by(sample) %>% mutate(p_reads = n_reads/sum(n_reads))\nread_comp_summ <- read_comp_long %>% \n group_by(na_type, classification) %>%\n summarize(n_reads = sum(n_reads), .groups = \"drop_last\") %>%\n mutate(n_reads = replace_na(n_reads,0),\n p_reads = n_reads/sum(n_reads),\n pc_reads = p_reads*100)\n \n# Plot overall composition\ng_comp <- ggplot(read_comp_long, aes(x=sample, y=p_reads, fill=classification)) +\n geom_col(position = \"stack\") +\n scale_y_continuous(name = \"% Reads\", limits = c(0,1.01), breaks = seq(0,1,0.2),\n expand = c(0,0), labels = function(x) x*100) +\n scale_fill_brewer(palette = \"Set1\", name = \"Classification\") +\n facet_wrap(.~na_type, scales=\"free\", ncol=5) +\n theme_kit\ng_comp\n\n\n\n\n\n\nCode# Plot composition of minor components\nread_comp_minor <- read_comp_long %>% filter(classification %in% c(\"Archaeal\", \"Viral\", \"Human\", \"Other\"))\npalette_minor <- brewer.pal(9, \"Set1\")[6:9]\ng_comp_minor <- ggplot(read_comp_minor, aes(x=sample, y=p_reads, fill=classification)) +\n geom_col(position = \"stack\") +\n scale_y_continuous(name = \"% Reads\",\n expand = c(0,0), labels = function(x) x*100) +\n scale_fill_manual(values=palette_minor, name = \"Classification\") +\n facet_wrap(.~na_type, scales = \"free_x\", ncol=5) +\n theme_kit\ng_comp_minor\n\n\n\n\n\n\n\n\nCodep_reads_summ_group <- read_comp_long %>%\n mutate(classification = ifelse(classification %in% c(\"Filtered\", \"Duplicate\", \"Unassigned\"), \"Excluded\", as.character(classification)),\n classification = fct_inorder(classification)) %>%\n group_by(classification, sample, na_type) %>%\n summarize(p_reads = sum(p_reads), .groups = \"drop\") %>%\n group_by(classification, na_type) %>%\n summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, \n pc_mean = mean(p_reads)*100, .groups = \"drop\")\np_reads_summ_prep <- p_reads_summ_group %>%\n mutate(classification = fct_inorder(classification),\n pc_min = pc_min %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n pc_max = pc_max %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n pc_mean = pc_mean %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n display = paste0(pc_min, \"-\", pc_max, \"% (mean \", pc_mean, \"%)\"))\np_reads_summ <- p_reads_summ_prep %>%\n select(classification, na_type, display) %>%\n pivot_wider(names_from=na_type, values_from = display)\np_reads_summ\n\n\n \n\n\n\nAs previously noted, RNA samples were overwhelmingly composed of duplicates. Despite this, the RNA samples achieved a decently high level of total viral reads, with an average of 0.55% of reads mapping to viruses (higher than Crits-Christoph). Levels of total viral reads were much lower in DNA libraries, which were dominated by unassigned and (non-ribosomal) bacterial sequences.\nWithin viral reads, RNA libraries were (as usual) dominated by plant viruses, particularly Virgaviridae – though one sample, unusually, has a substantial minority of reads from Fiersviridae, a family of RNA phages. Detected DNA viruses come from a wider range of families, but the most abundant families (Suoliviridae, Intestiviridae, Autographiviridae, Myoviridae) are all DNA phages.\n\nCode# Get viral taxonomy\nviral_taxa_path <- file.path(data_dir, \"viral-taxids.tsv.gz\")\nviral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)\n\n# Import Kraken reports & extract & count viral families\nsamples <- as.character(basic_stats_raw$sample)\ncol_names <- c(\"pc_reads_total\", \"n_reads_clade\", \"n_reads_direct\", \"rank\", \"taxid\", \"name\")\nreport_paths <- paste0(data_dir, \"kraken/\", samples, \".report.gz\")\nkraken_reports <- lapply(1:length(samples), \n function(n) read_tsv(report_paths[n], col_names = col_names,\n show_col_types = FALSE) %>%\n mutate(sample = samples[n])) %>% bind_rows\nkraken_reports_viral <- filter(kraken_reports, taxid %in% viral_taxa$taxid) %>%\n group_by(sample) %>%\n mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])\nviral_families <- kraken_reports_viral %>% filter(rank == \"F\") %>%\n inner_join(libraries, by=\"sample\")\n\n# Identify major viral families\nviral_families_major_tab <- viral_families %>% group_by(name, taxid, na_type) %>%\n summarize(p_reads_viral_max = max(p_reads_viral), .groups=\"drop\") %>%\n filter(p_reads_viral_max >= 0.04)\nviral_families_major_list <- viral_families_major_tab %>% pull(name)\nviral_families_major <- viral_families %>% filter(name %in% viral_families_major_list) %>%\n select(name, taxid, sample, na_type, date, p_reads_viral)\nviral_families_minor <- viral_families_major %>% group_by(sample, date, na_type) %>%\n summarize(p_reads_viral_major = sum(p_reads_viral), .groups = \"drop\") %>%\n mutate(name = \"Other\", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%\n select(name, taxid, sample, na_type, date, p_reads_viral)\nviral_families_display <- bind_rows(viral_families_major, viral_families_minor) %>%\n arrange(desc(p_reads_viral)) %>% mutate(name = factor(name, levels=c(viral_families_major_list, \"Other\")))\ng_families <- ggplot(viral_families_display, aes(x=date, y=p_reads_viral, fill=name)) +\n geom_col(position=\"stack\") +\n scale_fill_brewer(palette = \"Set3\", name = \"Viral family\") +\n scale_y_continuous(name=\"% Viral Reads\", limits=c(0,1), breaks = seq(0,1,0.2),\n expand=c(0,0), labels = function(y) y*100) +\n facet_wrap(~na_type) +\n theme_kit\ng_families\n\n\n\n\n\n\n\nGiven the very high level of duplicates in the RNA data, I also repeated the analysis from my second Yang et al. entry, re-running taxonomic composition analysis on pre-deduplication data and comparing the effects of deduplication on composition:\n\nCodeclass_levels <- c(\"Unassigned\", \"Bacterial\", \"Archaeal\", \"Viral\", \"Human\")\nsubset_factor <- 0.05\n\n# 1. Remove filtered & duplicate reads from original Bracken output & renormalize\nread_comp_renorm <- read_comp_long %>%\n filter(! classification %in% c(\"Filtered\", \"Duplicate\")) %>%\n group_by(sample) %>%\n mutate(p_reads = n_reads/sum(n_reads),\n classification = classification %>% as.character %>%\n ifelse(. == \"Ribosomal\", \"Bacterial\", .)) %>%\n group_by(sample, na_type, date, classification) %>%\n summarize(n_reads = sum(n_reads), p_reads = sum(p_reads), .groups = \"drop\") %>%\n mutate(classification = factor(classification, levels = class_levels))\n \n# 2. Import pre-deduplicated \nbracken_path_predup <- file.path(data_dir, \"bracken_counts_subset.tsv\")\nbracken_predup <- read_tsv(bracken_path_predup, show_col_types = FALSE)\ntotal_assigned_predup <- bracken_predup %>% group_by(sample) %>% summarize(\n name = \"Total\",\n kraken_assigned_reads = sum(kraken_assigned_reads),\n added_reads = sum(added_reads),\n new_est_reads = sum(new_est_reads),\n fraction_total_reads = sum(fraction_total_reads)\n)\nbracken_spread_predup <- bracken_predup %>% select(name, sample, new_est_reads) %>%\n mutate(name = tolower(name)) %>%\n pivot_wider(id_cols = \"sample\", names_from = \"name\", values_from = \"new_est_reads\")\n# Count reads\nread_counts_predup <- read_counts_preproc %>%\n select(sample, na_type, date, raw_concat, cleaned) %>%\n mutate(raw_concat = raw_concat * subset_factor, cleaned = cleaned * subset_factor) %>%\n inner_join(total_assigned_predup %>% select(sample, new_est_reads), by = \"sample\") %>%\n rename(assigned = new_est_reads) %>%\n inner_join(bracken_spread_predup, by=\"sample\")\n# Assess composition\nread_comp_predup <- transmute(read_counts_predup, sample=sample, na_type = na_type,\n date=date,\n n_filtered = raw_concat-cleaned,\n n_unassigned = cleaned-assigned,\n n_bacterial = bacteria,\n n_archaeal = archaea,\n n_viral = viruses,\n n_human = eukaryota)\nread_comp_predup_long <- pivot_longer(read_comp_predup, -(sample:date), \n names_to = \"classification\",\n names_prefix = \"n_\", values_to = \"n_reads\") %>%\n mutate(classification = fct_inorder(str_to_sentence(classification))) %>%\n group_by(sample) %>% mutate(p_reads = n_reads/sum(n_reads)) %>%\n filter(! classification %in% c(\"Filtered\", \"Duplicate\")) %>%\n group_by(sample) %>%\n mutate(p_reads = n_reads/sum(n_reads))\n\n# 3. Combine\nread_comp_comb <- bind_rows(read_comp_predup_long %>% mutate(deduplicated = FALSE),\n read_comp_renorm %>% mutate(deduplicated = TRUE)) %>%\n mutate(label = ifelse(deduplicated, \"Post-dedup\", \"Pre-dedup\") %>% fct_inorder)\n\n# Plot overall composition\ng_comp_predup <- ggplot(read_comp_comb, aes(x=label, y=p_reads, fill=classification)) +\n geom_col(position = \"stack\") +\n scale_y_continuous(name = \"% Reads\", limits = c(0,1.01), breaks = seq(0,1,0.2),\n expand = c(0,0), labels = function(x) x*100) +\n scale_fill_brewer(palette = \"Set1\", name = \"Classification\") +\n facet_grid(na_type~date) +\n theme_kit\ng_comp_predup\n\n\n\n\n\n\n\nIn general, deduplication has little effect on the composition of DNA samples, which remain primarily bacterial and unassigned. A few RNA samples show a reduction in the bacterial (more precisely, bacterial + rRNA) fraction after deduplication, consistent with the presence of some true biological duplicate sequences (most likely rRNA) that are being collapsed. Surprisingly, however, the largest effect is in the opposite direction, with several samples showing large increases in bacterial sequences following deduplication. This suggests that some highly abundant non-bacterial sequence is being collapsed in these samples, increasing the apparent abundance of bacteria.\nThe increase in bacterial reads in these samples comes at the expense of (1) human reads and (2) the unassigned fraction. The former suggests the presence of human rRNA sequences that are being erroneously incorporated into the post-deduplication “bacterial” fraction; however, this effect is smaller than the decrease in unassigned reads. One possibility might be other non-human eukaryotic rRNA sequences, which Kraken is unable to assign using our current database.\nHuman-infecting virus reads: validation\nNext, I investigated the human-infecting virus read content of these unenriched samples. To get good results here, I needed to make some changes to the pipeline, including adding Trimmomatic as an additional primer-trimming step to prevent further primer-contamination-based false positives. However, for the sake of time I’m not describing them in detail here; if you want to see more I encourage you to check the commit log at the workflow repo.\nHaving made these changes, the workflow identified a total of 14,073 RNA read pairs and 70 DNA read pairs across all samples (0.09% and 0.00009% of surviving reads, respectively).\n\nCodehv_reads_filtered_path <- file.path(data_dir, \"hv_hits_putative_filtered.tsv.gz\")\nhv_reads_filtered <- read_tsv(hv_reads_filtered_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>% \n arrange(date, na_type)\nn_hv_filtered <- hv_reads_filtered %>% group_by(sample, date, na_type) %>% count %>% inner_join(basic_stats %>% filter(stage == \"ribo_initial\") %>% select(sample, n_read_pairs), by=\"sample\") %>% rename(n_putative = n, n_total = n_read_pairs) %>% mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)\nn_hv_filtered_summ <- n_hv_filtered %>% group_by(na_type) %>% summarize(n_putative = sum(n_putative), n_total = sum(n_total)) %>% mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)\n\n\n\nCodemrg <- hv_reads_filtered %>%\n mutate(kraken_label = ifelse(assigned_hv, \"Kraken2 HV\\nassignment\",\n ifelse(hit_hv, \"Kraken2 HV\\nhit\",\n \"No hit or\\nassignment\"))) %>%\n group_by(sample, na_type) %>% arrange(desc(adj_score_fwd), desc(adj_score_rev)) %>%\n mutate(seq_num = row_number(),\n adj_score_max = pmax(adj_score_fwd, adj_score_rev))\n# Import Bowtie2/Kraken data and perform filtering steps\ng_mrg <- ggplot(mrg, aes(x=adj_score_fwd, y=adj_score_rev)) +\n geom_point(alpha=0.5, shape=16) + \n scale_x_continuous(name=\"S(forward read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_y_continuous(name=\"S(reverse read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n facet_grid(na_type~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +\n theme_base + theme(aspect.ratio=1)\ng_mrg\n\n\n\n\n\n\nCodeg_hist_0 <- ggplot(mrg, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position=\"dodge\") + facet_grid(na_type~kraken_label, scales=\"free_y\") + scale_x_continuous(name = \"Maximum adjusted alignment score\") + scale_y_continuous(name=\"# Read pairs\") + scale_fill_brewer(palette = \"Dark2\") + theme_base + geom_vline(xintercept=20, linetype=\"dashed\", color=\"red\")\ng_hist_0\n\n\n\n\n\n\n\nAs previously described, I ran BLASTN on these reads via a dedicated EC2 instance, using the same parameters I’ve used for previous datasets.\n\nCodemrg_fasta <- mrg %>%\n mutate(seq_head = paste0(\">\", seq_id)) %>%\n ungroup %>%\n select(header1=seq_head, seq1=query_seq_fwd, \n header2=seq_head, seq2=query_seq_rev) %>%\n mutate(header1=paste0(header1, \"_1\"), header2=paste0(header2, \"_2\"))\nmrg_fasta_out <- do.call(paste, c(mrg_fasta, sep=\"\\n\")) %>% paste(collapse=\"\\n\")\nwrite(mrg_fasta_out, file.path(data_dir, \"brumfield-putative-viral.fasta\"))\n\n\n\nCode# Import BLAST results\nblast_results_path <- file.path(data_dir, \"blast/brumfield-putative-viral.blast.gz\")\nblast_cols <- c(\"qseqid\", \"sseqid\", \"sgi\", \"staxid\", \"qlen\", \"evalue\", \"bitscore\", \"qcovs\", \"length\", \"pident\", \"mismatch\", \"gapopen\", \"sstrand\", \"qstart\", \"qend\", \"sstart\", \"send\")\nblast_results <- read_tsv(blast_results_path, show_col_types = FALSE, \n col_names = blast_cols, col_types = cols(.default=\"c\"))\n\n# Add viral status\nblast_results_viral <- mutate(blast_results, viral = staxid %in% viral_taxa$taxid)\n\n# Filter for best hit for each query/subject, then rank hits for each query\nblast_results_best <- blast_results_viral %>% group_by(qseqid, staxid) %>% \n filter(bitscore == max(bitscore)) %>%\n filter(length == max(length)) %>% filter(row_number() == 1)\nblast_results_ranked <- blast_results_best %>% \n group_by(qseqid) %>% mutate(rank = dense_rank(desc(bitscore)))\nblast_results_highrank <- blast_results_ranked %>% filter(rank <= 5) %>%\n mutate(read_pair = str_split(qseqid, \"_\") %>% sapply(nth, n=-1), \n seq_id = str_split(qseqid, \"_\") %>% sapply(nth, n=1)) %>%\n mutate(bitscore = as.numeric(bitscore))\n\n# Summarize by read pair and taxid\nblast_results_paired <- blast_results_highrank %>%\n group_by(seq_id, staxid, viral) %>%\n summarize(bitscore_max = max(bitscore), bitscore_min = min(bitscore),\n n_reads = n(), .groups = \"drop\") %>%\n mutate(viral_full = viral & n_reads == 2)\n\n# Compare to Kraken & Bowtie assignments\nmrg_assign <- mrg %>% select(sample, seq_id, taxid, assigned_taxid, adj_score_max, na_type)\nblast_results_assign <- left_join(blast_results_paired, mrg_assign, by=\"seq_id\") %>%\n mutate(taxid_match_bowtie = (staxid == taxid),\n taxid_match_kraken = (staxid == assigned_taxid),\n taxid_match_any = taxid_match_bowtie | taxid_match_kraken)\nblast_results_out <- blast_results_assign %>%\n group_by(seq_id) %>%\n summarize(viral_status = ifelse(any(viral_full), 2,\n ifelse(any(taxid_match_any), 2,\n ifelse(any(viral), 1, 0))),\n .groups = \"drop\")\n\n\n\nCode# Merge BLAST results with unenriched read data\nmrg_blast <- full_join(mrg, blast_results_out, by=\"seq_id\") %>%\n mutate(viral_status = replace_na(viral_status, 0),\n viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))\nmrg_blast_rna <- mrg_blast %>% filter(na_type == \"RNA\")\nmrg_blast_dna <- mrg_blast %>% filter(na_type == \"DNA\")\n\n# Plot RNA\ng_mrg_blast_rna <- mrg_blast_rna %>%\n ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +\n geom_point(alpha=0.5, shape=16) +\n scale_x_continuous(name=\"S(forward read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_y_continuous(name=\"S(reverse read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_color_brewer(palette = \"Set1\", name = \"Viral status\") +\n facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +\n theme_base + labs(title=\"RNA\") + \n theme(aspect.ratio=1, plot.title = element_text(size=rel(2), hjust=0))\ng_mrg_blast_rna\n\n\n\n\n\n\nCode# Plot DNA\ng_mrg_blast_dna <- mrg_blast_dna %>%\n ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +\n geom_point(alpha=0.5, shape=16) +\n scale_x_continuous(name=\"S(forward read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_y_continuous(name=\"S(reverse read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_color_brewer(palette = \"Set1\", name = \"Viral status\") +\n facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +\n theme_base + labs(title=\"DNA\") + \n theme(aspect.ratio=1, plot.title = element_text(size=rel(2), hjust=0))\ng_mrg_blast_dna\n\n\n\n\n\n\n\n\nCodeg_hist_1 <- ggplot(mrg_blast, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position=\"dodge\") + facet_grid(na_type~viral_status_out, scales = \"free_y\") + scale_x_continuous(name = \"Maximum adjusted alignment score\") + scale_y_continuous(name=\"# Read pairs\") + scale_fill_brewer(palette = \"Dark2\") + theme_base + geom_vline(xintercept=20, linetype=\"dashed\", color=\"red\")\ng_hist_1\n\n\n\n\n\n\n\nThese results look very good on visual inspection, and indeed precision and sensitivity are both very high. For a disjunctive score threshold of 20, my updated workflow achieves an F1 score of 99.0% for RNA sequences and 99.2% for DNA sequences; more than high enough to continue on to human viral analysis.\n\nCodetest_sens_spec <- function(tab, score_threshold, conjunctive, include_special){\n if (!include_special) tab <- filter(tab, viral_status_out %in% c(\"TRUE\", \"FALSE\"))\n tab_retained <- tab %>% mutate(\n conjunctive = conjunctive,\n retain_score_conjunctive = (adj_score_fwd > score_threshold & adj_score_rev > score_threshold), \n retain_score_disjunctive = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),\n retain_score = ifelse(conjunctive, retain_score_conjunctive, retain_score_disjunctive),\n retain = assigned_hv | hit_hv | retain_score) %>%\n group_by(viral_status_out, retain) %>% count\n pos_tru <- tab_retained %>% filter(viral_status_out == \"TRUE\", retain) %>% pull(n) %>% sum\n pos_fls <- tab_retained %>% filter(viral_status_out != \"TRUE\", retain) %>% pull(n) %>% sum\n neg_tru <- tab_retained %>% filter(viral_status_out != \"TRUE\", !retain) %>% pull(n) %>% sum\n neg_fls <- tab_retained %>% filter(viral_status_out == \"TRUE\", !retain) %>% pull(n) %>% sum\n sensitivity <- pos_tru / (pos_tru + neg_fls)\n specificity <- neg_tru / (neg_tru + pos_fls)\n precision <- pos_tru / (pos_tru + pos_fls)\n f1 <- 2 * precision * sensitivity / (precision + sensitivity)\n out <- tibble(threshold=score_threshold, include_special = include_special, \n conjunctive = conjunctive, sensitivity=sensitivity, \n specificity=specificity, precision=precision, f1=f1)\n return(out)\n}\nrange_f1 <- function(intab, inc_special, inrange=15:45){\n tss <- purrr::partial(test_sens_spec, tab=intab, include_special=inc_special)\n stats_conj <- lapply(inrange, tss, conjunctive=TRUE) %>% bind_rows\n stats_disj <- lapply(inrange, tss, conjunctive=FALSE) %>% bind_rows\n stats_all <- bind_rows(stats_conj, stats_disj) %>%\n pivot_longer(!(threshold:conjunctive), names_to=\"metric\", values_to=\"value\") %>%\n mutate(conj_label = ifelse(conjunctive, \"Conjunctive\", \"Disjunctive\"))\n return(stats_all)\n}\ninc_special <- FALSE\nstats_rna <- range_f1(mrg_blast_rna, inc_special) %>% mutate(na_type = \"RNA\")\nstats_dna <- range_f1(mrg_blast_dna, inc_special) %>% mutate(na_type = \"DNA\")\nstats_0 <- bind_rows(stats_rna, stats_dna) %>% mutate(attempt=0)\nthreshold_opt_0 <- stats_0 %>% group_by(conj_label,attempt,na_type) %>% \n filter(metric == \"f1\") %>%\n filter(value == max(value)) %>% filter(threshold == min(threshold))\ng_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +\n geom_vline(data = threshold_opt_0, mapping = aes(xintercept=threshold),\n color = \"red\", linetype = \"dashed\") +\n geom_line() +\n scale_y_continuous(name = \"Value\", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +\n scale_x_continuous(name = \"Threshold\", expand = c(0,0)) +\n scale_color_brewer(palette=\"Set3\") +\n facet_grid(na_type~conj_label) +\n theme_base\ng_stats_0\n\n\n\n\n\n\n\nHuman-infecting virus reads: analysis\n\nCode# Get raw read counts\nread_counts_raw <- basic_stats_raw %>%\n select(sample, date, na_type, n_reads_raw = n_read_pairs)\n\n# Get HV read counts\nmrg_hv <- mrg %>% mutate(hv_status = assigned_hv | hit_hv | adj_score_max >= 20)\nread_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% count(name=\"n_reads_hv\")\nread_counts <- read_counts_raw %>% left_join(read_counts_hv, by=\"sample\") %>%\n mutate(n_reads_hv = replace_na(n_reads_hv, 0))\n\n# Aggregate\nread_counts_total <- read_counts %>% group_by(na_type) %>%\n summarize(n_reads_raw = sum(n_reads_raw),\n n_reads_hv = sum(n_reads_hv)) %>%\n mutate(sample= \"All samples\", date = \"All dates\")\nread_counts_agg <- read_counts %>% arrange(date) %>% \n mutate(date = as.character(date)) %>% arrange(sample) %>%\n bind_rows(read_counts_total) %>% \n mutate(sample = fct_inorder(sample),\n date = fct_inorder(date),\n p_reads_hv = n_reads_hv/n_reads_raw)\n\n# Get old counts\nn_hv_reads_old <- c(691, 121, 18, 224, 2, 26, 6, 21, 4, 29, 12, 22)\nn_hv_reads_old[13] <- sum(n_hv_reads_old[which(read_counts$na_type==\"RNA\")])\nn_hv_reads_old[14] <- sum(n_hv_reads_old[which(read_counts$na_type==\"DNA\")])\n\nread_counts_comp <- read_counts_agg %>%\n mutate(n_reads_hv_old = n_hv_reads_old,\n p_reads_hv_old = n_reads_hv_old/n_reads_raw)\n\n\nApplying a disjunctive cutoff at S=20 identifies 13,809 RNA reads and 66 DNA reads as human-viral. This gives an overall relative HV abundance of \\(5.00 \\times 10^{-5}\\) for RNA reads and \\(3.66 \\times 10^{-7}\\) for DNA reads. In contrast, the overall HV in the public dashboard is \\(3.93 \\times 10^{-6}\\) for RNA reads and \\(4.54 \\times 10^{-7}\\); the measured HV has thus increased 12.7x for RNA reads, but decreased slightly for DNA reads.\n\nCode# Visualize\ng_phv_agg <- ggplot(read_counts_agg, aes(x=date, color=na_type)) +\n geom_point(aes(y=p_reads_hv)) +\n scale_y_log10(\"Relative abundance of human virus reads\") +\n scale_color_na() + theme_kit\ng_phv_agg\n\n\n\n\n\n\n\nDigging into individual viruses, we see that fecal-oral viruses predominate, especially Mamastrovirus, Rotavirus, Salivirus, and fecal-oral Enterovirus species (especially Enterovirus C, which includes poliovirus):\n\nCode# Get viral taxon names for putative HV reads\nviral_taxa$name[viral_taxa$taxid == 249588] <- \"Mamastrovirus\"\nviral_taxa$name[viral_taxa$taxid == 194960] <- \"Kobuvirus\"\nviral_taxa$name[viral_taxa$taxid == 688449] <- \"Salivirus\"\nviral_taxa$name[viral_taxa$taxid == 585893] <- \"Picobirnaviridae\"\nviral_taxa$name[viral_taxa$taxid == 333922] <- \"Betapapillomavirus\"\n\nviral_taxa$name[viral_taxa$taxid == 334207] <- \"Betapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 369960] <- \"Porcine type-C oncovirus\"\nviral_taxa$name[viral_taxa$taxid == 333924] <- \"Betapapillomavirus 2\"\n\nmrg_hv_named <- mrg_hv %>% left_join(viral_taxa, by=\"taxid\")\n\n# Discover viral species & genera for HV reads\nraise_rank <- function(read_db, taxid_db, out_rank = \"species\", verbose = FALSE){\n # Get higher ranks than search rank\n ranks <- c(\"subspecies\", \"species\", \"subgenus\", \"genus\", \"subfamily\", \"family\", \"suborder\", \"order\", \"class\", \"subphylum\", \"phylum\", \"kingdom\", \"superkingdom\")\n rank_match <- which.max(ranks == out_rank)\n high_ranks <- ranks[rank_match:length(ranks)]\n # Merge read DB and taxid DB\n reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n # Extract sequences that are already at appropriate rank\n reads_rank <- filter(reads, rank == out_rank)\n # Drop sequences at a higher rank and return unclassified sequences\n reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...\n # Promote read taxids and re-merge with taxid DB, then re-classify and filter\n reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%\n select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%\n bind_rows(reads_rank)\n reads_norank <- reads_remaining %>%\n filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n }\n # Finally, extract and append reads that were excluded during the process\n reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)\n reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%\n select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n return(reads_out)\n}\nhv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, \"species\")\nhv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, \"genus\")\nhv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, \"family\")\n\n\n\nCode# Count reads for each human-viral family\nhv_family_counts <- hv_reads_family %>% \n group_by(sample, date, na_type, name, taxid) %>%\n count(name = \"n_reads_hv\") %>%\n group_by(sample, date, na_type) %>%\n mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nhv_family_major_tab <- hv_family_counts %>% group_by(name) %>% \n filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%\n arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > 0.01)\nhv_family_counts_major <- hv_family_counts %>%\n mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, \"Other\")) %>%\n group_by(sample, date, na_type, name_display) %>%\n summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), \n .groups=\"drop\") %>%\n mutate(name_display = factor(name_display, \n levels = c(hv_family_major_tab$name, \"Other\")))\n\n# Plot family composition\npalette <- c(brewer.pal(12, \"Set3\"), \"#AAAAAA\")\ng_hv_families <- ggplot(hv_family_counts_major, \n aes(x=date, y=p_reads_hv, fill=name_display)) +\n geom_col(position=\"stack\") +\n scale_fill_manual(name=\"Viral\\nfamily\", values=palette) +\n scale_y_continuous(name=\"% HV Reads\", limits=c(0,1), breaks=seq(0,1,0.2)) +\n facet_grid(.~na_type) +\n guides(fill=guide_legend(ncol=3)) +\n theme_kit\ng_hv_families\n\n\n\n\n\n\n\n\nCode# Count reads for each human-viral genus\nhv_genus_counts <- hv_reads_genus %>% \n group_by(sample, date, na_type, name, taxid) %>%\n count(name = \"n_reads_hv\") %>%\n group_by(sample, date, na_type) %>%\n mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking genera and group others\nhv_genus_major_tab <- hv_genus_counts %>% group_by(name) %>% \n filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%\n arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > 0.02)\nhv_genus_counts_major <- hv_genus_counts %>%\n mutate(name_display = ifelse(name %in% hv_genus_major_tab$name, name, \"Other\")) %>%\n group_by(sample, date, na_type, name_display) %>%\n summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), \n .groups=\"drop\") %>%\n mutate(name_display = factor(name_display, \n levels = c(hv_genus_major_tab$name, \"Other\")))\n\n# Plot genus composition\npalette <- c(brewer.pal(12, \"Set3\"), \"#AAAAAA\")\ng_hv_genera <- ggplot(hv_genus_counts_major, \n aes(x=date, y=p_reads_hv, fill=name_display)) +\n geom_col(position=\"stack\") +\n scale_fill_manual(name=\"Viral\\ngenus\", values=palette) +\n scale_y_continuous(name=\"% HV Reads\", limits=c(0,1), breaks=seq(0,1,0.2)) +\n facet_grid(.~na_type) +\n guides(fill=guide_legend(ncol=3)) +\n theme_kit\ng_hv_genera\n\n\n\n\n\n\n\n\nCode# Count reads for each human-viral species\nhv_species_counts <- hv_reads_species %>% \n group_by(sample, date, na_type, name, taxid) %>%\n count(name = \"n_reads_hv\") %>%\n group_by(sample, date, na_type) %>%\n mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking species and group others\nhv_species_major_tab <- hv_species_counts %>% group_by(name, taxid) %>% \n filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%\n arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > 0.05)\nhv_species_counts_major <- hv_species_counts %>%\n mutate(name_display = ifelse(name %in% hv_species_major_tab$name, name, \"Other\")) %>%\n group_by(sample, date, na_type, name_display) %>%\n summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), \n .groups=\"drop\") %>%\n mutate(name_display = factor(name_display, \n levels = c(hv_species_major_tab$name, \"Other\")))\n\n# Plot species composition\npalette <- c(brewer.pal(12, \"Set3\"), \"#AAAAAA\")\ng_hv_species <- ggplot(hv_species_counts_major, \n aes(x=date, y=p_reads_hv, fill=name_display)) +\n geom_col(position=\"stack\") +\n scale_fill_manual(name=\"Viral\\nspecies\", values=palette) +\n scale_y_continuous(name=\"% HV Reads\", limits=c(0,1), breaks=seq(0,1,0.2)) +\n facet_grid(.~na_type) +\n guides(fill=guide_legend(ncol=3)) +\n theme_kit\ng_hv_species\n\n\n\n\n\n\n\nBy far the most common virus according to my pipeline (with over 91% of human-viral reads) is Rotavirus A; second (with 8%) is Orthopicobirnavirus hominis. Together, these two enteric RNA viruses dominate HV RNA reads in all samples. Both appear to be real; at least, BLASTN primarily maps these reads to the same or closely-related viruses to that identified by Bowtie2:\n\nCode# Import names db\ntax_names_path <- file.path(data_dir, \"taxid-names.tsv.gz\")\ntax_names <- read_tsv(tax_names_path, show_col_types = FALSE)\n\n# Add missing names\ntax_names_new <- tibble(\n staxid = c(557241, 557245, 557232, 557247, 557242, 557245),\n name = c(\"Canine rotavirus A79-10/G3P[3]\", \"Human rotavirus Ro1845\", \"Canine rotavirus K9\",\n \"Human rotavirus HCR3A\", \"Feline rotavirus Cat97/G3P[3]\", \"Feline rotavirus Cat2/G3P[9]\")\n)\ntax_names <- bind_rows(tax_names, tax_names_new)\n\nrota_ids <- hv_reads_species %>% filter(taxid==28875) %>% pull(seq_id)\nrota_hits <- blast_results_paired %>%\n filter(seq_id %in% rota_ids) %>%\n mutate(staxid = as.integer(staxid))\nrota_hits_sorted <- rota_hits %>% group_by(staxid) %>% count %>% arrange(desc(n)) %>%\n left_join(tax_names, by=\"staxid\") %>%\n mutate(name = fct_inorder(name))\nrota_hits_sorted_head <- rota_hits_sorted %>% head(10) %>%\n mutate(name=fct_inorder(as.character(name)))\ng_rota <- ggplot(rota_hits_sorted_head,\n aes(x=n, y=fct_inorder(name))) + geom_col() +\n scale_x_continuous(name = \"# Mapped read pairs\") + theme_base +\n theme(axis.title.y = element_blank())\ng_rota\n\n\n\n\n\n\n\n\nCode# Add missing names\ntax_names_new <- tibble(\n staxid = c(442302, 3003954, 585895),\n name = c(\"Porcine picobirnavirus\", \"ticpantry virus 7\", \"uncultured picobirnavirus\")\n)\ntax_names <- bind_rows(tax_names, tax_names_new)\n\npico_ids <- hv_reads_species %>% filter(taxid==2956252) %>% pull(seq_id)\npico_hits <- blast_results_paired %>%\n filter(seq_id %in% pico_ids) %>%\n mutate(staxid = as.integer(staxid))\npico_hits_sorted <- pico_hits %>% group_by(staxid) %>% count %>% arrange(desc(n)) %>%\n left_join(tax_names, by=\"staxid\") %>%\n mutate(name = fct_inorder(name))\npico_hits_sorted_head <- pico_hits_sorted %>% head(10) %>%\n mutate(name=fct_inorder(as.character(name)))\ng_pico <- ggplot(pico_hits_sorted_head,\n aes(x=n, y=fct_inorder(name))) + geom_col() +\n scale_x_continuous(name = \"# Mapped read pairs\") + theme_base +\n theme(axis.title.y = element_blank())\ng_pico\n\n\n\n\n\n\n\nConclusion\nI’m quite happy with the quality of the human-viral selection process for this dataset, which ended up achieving very high precision and sensitivity. The most striking result coming out of this analysis is probably the drastic difference in total HV abundance between RNA and DNA reads, with the former >100x higher despite very similar processing methods. In the past we’ve attributed low HV abundance in the DNA datasets we’ve analyzed to the lack of viral enrichment in available DNA datasets; in this case, however, there is very little difference between the DNA and RNA processing methods, so this explanation doesn’t really hold. This makes me more pessimistic about achieving good HV relative abundance with DNA sequencing, even with improved viral enrichment methods."
+ },
+ {
+ "objectID": "notebooks/2024-04-12_prussin.html",
+ "href": "notebooks/2024-04-12_prussin.html",
+ "title": "Workflow analysis of Prussin et al. (2019)",
+ "section": "",
+ "text": "Taking a break from working on P2RA datasets, we’re also working on a review of air sampling for viral pathogen detection. For that study, we’re collecting and analyzing air MGS data that could give us a high-level idea of the likely viral composition of such samples.\nThe first dataset I’m looking at for this work is Prussin et al. (2019), a study of HVAC filter samples in a Virginia daycare center between 2014 and 2015. Samples were eluted from MERV-14 air filters collected every two weeks, with pairs of successive samples combined into four-week sampling periods. Like Brumfield et al, this study conducted both RNA and DNA sequencing; all samples were sequenced on an Illumina NextSeq500 with 2x150bp reads.\nThe raw data\nThe Prussin dataset comprised sequencing data from 14 timepoints spread across the year, from 20th January 2014 to 2nd February 2015. Each sample represents a four-week sampling period. In addition to the 14 on-site samples, there were also two control samples, a negative control (NC) and an “unexposed filter” control (UFC), which were collected on December 23rd 2014.\n\nCode# Data input paths\ndata_dir <- \"../data/2024-04-11_prussin/\"\nlibraries_path <- file.path(data_dir, \"sample-metadata.csv\")\nbasic_stats_path <- file.path(data_dir, \"qc_basic_stats.tsv\")\nadapter_stats_path <- file.path(data_dir, \"qc_adapter_stats.tsv\")\nquality_base_stats_path <- file.path(data_dir, \"qc_quality_base_stats.tsv\")\nquality_seq_stats_path <- file.path(data_dir, \"qc_quality_sequence_stats.tsv\")\n\n# Import libraries and extract metadata from sample names\nlibraries_raw <- read_csv(libraries_path, show_col_types = FALSE)\nlibraries <- libraries_raw %>%\n arrange(desc(na_type)) %>% mutate(na_type = fct_inorder(na_type)) %>%\n arrange(date) %>% rename(start_date = date) %>%\n mutate(end_date = start_date + 28) %>%\n mutate(date = fct_inorder(as.character(end_date)),\n ctrl = ifelse(grepl(\"Negative_Control\", sample_alias), \"NC\",\n ifelse(grepl(\"Unexposed_Filter\", sample_alias),\n \"UFC\", \"Non-Control\")),\n ctrl = factor(ctrl, levels = c(\"Non-Control\",\n \"NC\",\n \"UFC\")),\n open = (season != \"Closed\") & (ctrl == \"On-Site\"))\n\n# Import QC data\nstages <- c(\"raw_concat\", \"cleaned\", \"dedup\", \"ribo_initial\", \"ribo_secondary\")\nbasic_stats <- read_tsv(basic_stats_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>% \n mutate(stage = factor(stage, levels = stages),\n sample = fct_inorder(sample))\nadapter_stats <- read_tsv(adapter_stats_path, show_col_types = FALSE) %>%\n mutate(sample = sub(\"_2$\", \"\", sample)) %>%\n inner_join(libraries, by=\"sample\") %>%\n mutate(stage = factor(stage, levels = stages),\n read_pair = fct_inorder(as.character(read_pair)))\nquality_base_stats <- read_tsv(quality_base_stats_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>%\n mutate(stage = factor(stage, levels = stages),\n read_pair = fct_inorder(as.character(read_pair)))\nquality_seq_stats <- read_tsv(quality_seq_stats_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>%\n mutate(stage = factor(stage, levels = stages),\n read_pair = fct_inorder(as.character(read_pair)))\n\n# Filter to raw data\nbasic_stats_raw <- basic_stats %>% filter(stage == \"raw_concat\")\nadapter_stats_raw <- adapter_stats %>% filter(stage == \"raw_concat\")\nquality_base_stats_raw <- quality_base_stats %>% filter(stage == \"raw_concat\")\nquality_seq_stats_raw <- quality_seq_stats %>% filter(stage == \"raw_concat\")\n\n# Get key values for readout\nraw_read_counts <- basic_stats_raw %>% group_by(na_type, ctrl) %>% \n summarize(rmin = min(n_read_pairs), rmax=max(n_read_pairs),\n rmean=mean(n_read_pairs), .groups = \"drop\")\nraw_read_totals <- basic_stats_raw %>% group_by(na_type, ctrl) %>% \n summarize(n_read_pairs = sum(n_read_pairs), \n n_bases_approx = sum(n_bases_approx), .groups = \"drop\")\nraw_dup <- basic_stats_raw %>% group_by(na_type, ctrl) %>% \n summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),\n dmean=mean(percent_duplicates), .groups = \"drop\")\n\n\nThe 14 positive samples from the dataset yielded 5M-20M (mean 11.3M) RNA-sequencing reads and 10M-42M (mean 21.0M) DNA-sequencing reads per sample, for a total of 159M RNA read pairs and 294M DNA read pairs (46.3 and 87.0 gigabases of sequence, respectively). Controls contributed an additional 1M RNA read pairs and 17.5M DNA read pairs.\nIn positive samples, read qualities were mostly high but tailed off slightly at the 3’ end in some samples, suggesting the need for trimming. Adapter levels were high. Inferred duplication levels were low (2-4%) in DNA reads and moderate (21-44%) in RNA reads. Control sample reads were more problematic, with higher duplication and adapter levels and lower quality.\n\nCode# Prepare data\nbasic_stats_raw_metrics <- basic_stats_raw %>%\n select(date, na_type, ctrl,\n `# Read pairs` = n_read_pairs,\n `Total base pairs\\n(approx)` = n_bases_approx,\n `% Duplicates\\n(FASTQC)` = percent_duplicates) %>%\n pivot_longer(-(date:ctrl), names_to = \"metric\", values_to = \"value\") %>%\n mutate(metric = fct_inorder(metric))\n\n# Set up plot templates\nscale_fill_na <- purrr::partial(scale_fill_brewer, palette=\"Set1\", \n name=\"Nucleic acid type\")\ng_basic <- ggplot(basic_stats_raw_metrics, \n aes(x=date, y=value, fill=na_type)) +\n geom_col(position = \"dodge\") +\n scale_x_discrete(name=\"Collection Date\") +\n scale_y_continuous(expand=c(0,0)) +\n expand_limits(y=c(0,100)) +\n scale_fill_na() + \n facet_grid(metric~ctrl, scales = \"free\", space=\"free_x\", switch=\"y\") +\n theme_rotate + theme(\n axis.title.y = element_blank(),\n strip.text.y = element_text(face=\"plain\")\n )\ng_basic\n\n\n\n\n\n\n\n\nCode# Set up plotting templates\nscale_color_na <- purrr::partial(scale_color_brewer, palette=\"Set1\",\n name=\"Nucleic acid type\")\ng_qual_raw <- ggplot(mapping=aes(color=na_type, linetype=read_pair, \n group=interaction(sample,read_pair))) + \n scale_color_na() + scale_linetype_discrete(name = \"Read Pair\") +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n theme_base\n\n# Visualize adapters\ng_adapters_raw <- g_qual_raw + \n geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +\n scale_y_continuous(name=\"% Adapters\", limits=c(0,NA),\n breaks = seq(0,100,10), expand=c(0,0)) +\n scale_x_continuous(name=\"Position\", limits=c(0,NA),\n breaks=seq(0,140,20), expand=c(0,0)) +\n facet_grid(ctrl~adapter)\ng_adapters_raw\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base_raw <- g_qual_raw +\n geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +\n scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n scale_x_continuous(name=\"Position\", limits=c(0,NA),\n breaks=seq(0,140,20), expand=c(0,0)) +\n facet_grid(ctrl~.)\ng_quality_base_raw\n\n\n\n\n\n\nCodeg_quality_seq_raw <- g_qual_raw +\n geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +\n scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n scale_y_continuous(name=\"# Sequences\", expand=c(0,0)) +\n facet_grid(ctrl~., scales = \"free_y\")\ng_quality_seq_raw\n\n\n\n\n\n\n\nPreprocessing\nThe average fraction of reads lost at each stage in the preprocessing pipeline is shown in the following table. For positive samples, on average, cleaning and deduplication removed about 19% of total read pairs from RNA libraries and about 7% from DNA libraries. Subsequent ribodepletion removed a further ~32% of total read pairs on average from RNA libraries but <0.5% of total read pairs from DNA libraries.\nControl samples, meanwhile, lost an average of 63% of RNA read pairs and 21% of DNA read pairs during cleaning and deduplication, consistent with the lower qualities observed above. Subsequent ribodepletion removed an additional 9% of total RNA read pairs and 0.3% of total DNA read pairs.\n\nCoden_reads_rel <- basic_stats %>% \n select(sample, date, na_type, ctrl, stage, \n percent_duplicates, n_read_pairs) %>%\n group_by(sample, na_type) %>% arrange(sample, na_type, stage) %>%\n mutate(p_reads_retained = replace_na(n_read_pairs / lag(n_read_pairs), 0),\n p_reads_lost = 1 - p_reads_retained,\n p_reads_retained_abs = n_read_pairs / n_read_pairs[1],\n p_reads_lost_abs = 1-p_reads_retained_abs,\n p_reads_lost_abs_marginal = replace_na(p_reads_lost_abs - lag(p_reads_lost_abs), 0))\nn_reads_rel_display <- n_reads_rel %>% \n rename(Stage=stage, `NA Type`=na_type, `Control?`=ctrl) %>% \n group_by(`Control?`, `NA Type`, Stage) %>% \n summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), \"-\", round(max(p_reads_lost_abs*100),1), \" (mean \", round(mean(p_reads_lost_abs*100),1), \")\"),\n `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), \"-\", round(max(p_reads_lost_abs_marginal*100),1), \" (mean \", round(mean(p_reads_lost_abs_marginal*100),1), \")\"), .groups=\"drop\") %>% \n filter(Stage != \"raw_concat\") %>%\n mutate(Stage = Stage %>% as.numeric %>% factor(labels=c(\"Trimming & filtering\", \"Deduplication\", \"Initial ribodepletion\", \"Secondary ribodepletion\")))\nn_reads_rel_display\n\n\n \n\n\n\n\nCodeg_stage_trace <- ggplot(basic_stats, aes(x=stage, color=na_type, group=sample)) +\n scale_color_na() +\n facet_wrap(ctrl~na_type, scales=\"free\", ncol=2,\n labeller = label_wrap_gen(multi_line=FALSE)) +\n theme_kit\n\n# Plot reads over preprocessing\ng_reads_stages <- g_stage_trace +\n geom_line(aes(y=n_read_pairs)) +\n scale_y_continuous(\"# Read pairs\", expand=c(0,0), limits=c(0,NA))\ng_reads_stages\n\n\n\n\n\n\nCode# Plot relative read losses during preprocessing\ng_reads_rel <- ggplot(n_reads_rel, aes(x=stage, color=na_type, group=sample)) +\n geom_line(aes(y=p_reads_lost_abs_marginal)) +\n scale_y_continuous(\"% Total Reads Lost\", expand=c(0,0), \n labels = function(x) x*100) +\n scale_color_na() +\n facet_wrap(ctrl~na_type, scales=\"free\", ncol=2,\n labeller = label_wrap_gen(multi_line=FALSE)) +\n theme_kit\ng_reads_rel\n\n\n\n\n\n\n\nIn both positive and control samples, data cleaning with FASTP was very successful at removing adapters, with very few adapter sequences found by FASTQC at any stage after the raw data. FASTP was also successful at improving read quality.\n\nCodeg_qual <- ggplot(mapping=aes(color=na_type, linetype=read_pair, \n group=interaction(sample,read_pair))) + \n scale_color_na() + scale_linetype_discrete(name = \"Read Pair\") +\n guides(color=guide_legend(nrow=2,byrow=TRUE),\n linetype = guide_legend(nrow=2,byrow=TRUE)) +\n theme_base\n\n# Visualize adapters\ng_adapters <- g_qual + \n geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +\n scale_y_continuous(name=\"% Adapters\", limits=c(0,20),\n breaks = seq(0,50,10), expand=c(0,0)) +\n scale_x_continuous(name=\"Position\", limits=c(0,NA),\n breaks=seq(0,140,20), expand=c(0,0)) +\n facet_grid(stage~adapter)\ng_adapters\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base <- g_qual +\n geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +\n scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n scale_x_continuous(name=\"Position\", limits=c(0,NA),\n breaks=seq(0,140,20), expand=c(0,0)) +\n facet_grid(stage~.)\ng_quality_base\n\n\n\n\n\n\nCodeg_quality_seq <- g_qual +\n geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +\n scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n scale_y_continuous(name=\"# Sequences\", expand=c(0,0)) +\n facet_grid(stage~.)\ng_quality_seq\n\n\n\n\n\n\n\nAccording to FASTQC, deduplication was moderately effective at reducing measured duplicate levels in on-site samples, with FASTQC-measured levels falling from an average of 34% to 23% for RNA reads and from 2.7% to 2.2% for DNA reads.\n\nCodestage_dup <- basic_stats %>% group_by(na_type, ctrl, stage) %>% \n summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),\n dmean=mean(percent_duplicates), .groups = \"drop\")\n\ng_dup_stages <- g_stage_trace +\n geom_line(aes(y=percent_duplicates)) +\n scale_y_continuous(\"% Duplicates\", limits=c(0,NA), expand=c(0,0))\ng_dup_stages\n\n\n\n\n\n\nCodeg_readlen_stages <- g_stage_trace + geom_line(aes(y=mean_seq_len)) +\n scale_y_continuous(\"Mean read length (nt)\", expand=c(0,0), limits=c(0,NA))\ng_readlen_stages\n\n\n\n\n\n\n\nHigh-level composition\nAs before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:\n\nCode# Import Bracken data\nbracken_path <- file.path(data_dir, \"bracken_counts.tsv\")\nbracken <- read_tsv(bracken_path, show_col_types = FALSE)\ntotal_assigned <- bracken %>% group_by(sample) %>% summarize(\n name = \"Total\",\n kraken_assigned_reads = sum(kraken_assigned_reads),\n added_reads = sum(added_reads),\n new_est_reads = sum(new_est_reads),\n fraction_total_reads = sum(fraction_total_reads)\n)\nbracken_spread <- bracken %>% select(name, sample, new_est_reads) %>%\n mutate(name = tolower(name)) %>%\n pivot_wider(id_cols = \"sample\", names_from = \"name\", \n values_from = \"new_est_reads\")\n\n# Count reads\nread_counts_preproc <- basic_stats %>% \n select(sample, na_type, ctrl, date, stage, n_read_pairs) %>%\n pivot_wider(id_cols = c(\"sample\", \"na_type\", \"ctrl\", \"date\"),\n names_from=\"stage\", values_from=\"n_read_pairs\")\nread_counts <- read_counts_preproc %>%\n inner_join(total_assigned %>% select(sample, new_est_reads), by = \"sample\") %>%\n rename(assigned = new_est_reads) %>%\n inner_join(bracken_spread, by=\"sample\")\n\n# Assess composition\nread_comp <- transmute(read_counts, sample=sample, na_type=na_type,\n ctrl = ctrl, date = date,\n n_filtered = raw_concat-cleaned,\n n_duplicate = cleaned-dedup,\n n_ribosomal = (dedup-ribo_initial) + (ribo_initial-ribo_secondary),\n n_unassigned = ribo_secondary-assigned,\n n_bacterial = bacteria,\n n_archaeal = archaea,\n n_viral = viruses,\n n_human = eukaryota)\nread_comp_long <- pivot_longer(read_comp, -(sample:date), \n names_to = \"classification\",\n names_prefix = \"n_\", values_to = \"n_reads\") %>%\n mutate(classification = fct_inorder(str_to_sentence(classification))) %>%\n group_by(sample) %>% mutate(p_reads = n_reads/sum(n_reads))\n\n# Summarize composition\nread_comp_summ <- read_comp_long %>% \n group_by(na_type, ctrl, classification) %>%\n summarize(n_reads = sum(n_reads), .groups = \"drop_last\") %>%\n mutate(n_reads = replace_na(n_reads,0),\n p_reads = n_reads/sum(n_reads),\n pc_reads = p_reads*100)\n\n\n\nCode# Prepare plotting templates\ng_comp_base <- ggplot(mapping=aes(x=date, y=p_reads, fill=classification)) +\n scale_x_discrete(name=\"Collection Date\") +\n facet_grid(na_type~ctrl, scales = \"free\", space = \"free_x\") +\n theme_rotate\nscale_y_pc_reads <- purrr::partial(scale_y_continuous, name = \"% Reads\",\n expand = c(0,0), labels = function(y) y*100)\n\n# Plot overall composition\ng_comp <- g_comp_base + geom_col(data = read_comp_long, position = \"stack\") +\n scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +\n scale_fill_brewer(palette = \"Set1\", name = \"Classification\")\ng_comp\n\n\n\n\n\n\nCode# Plot composition of minor components\nread_comp_minor <- read_comp_long %>% \n filter(classification %in% c(\"Archaeal\", \"Viral\", \"Other\"))\npalette_minor <- brewer.pal(9, \"Set1\")[c(6,7,9)]\ng_comp_minor <- g_comp_base + geom_col(data=read_comp_minor, position = \"stack\") +\n scale_y_pc_reads() +\n scale_fill_manual(values=palette_minor, name = \"Classification\")\ng_comp_minor\n\n\n\n\n\n\n\n\nCodep_reads_summ_group <- read_comp_long %>%\n mutate(classification = ifelse(classification %in% c(\"Filtered\", \"Duplicate\", \"Unassigned\"), \"Excluded\", as.character(classification)),\n classification = fct_inorder(classification)) %>%\n group_by(classification, sample, na_type, ctrl) %>%\n summarize(p_reads = sum(p_reads), .groups = \"drop\") %>%\n group_by(classification, na_type, ctrl) %>%\n summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, \n pc_mean = mean(p_reads)*100, .groups = \"drop\")\np_reads_summ_prep <- p_reads_summ_group %>%\n mutate(classification = fct_inorder(classification),\n pc_min = pc_min %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n pc_max = pc_max %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n pc_mean = pc_mean %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n display = paste0(pc_min, \"-\", pc_max, \"% (mean \", pc_mean, \"%)\"))\np_reads_summ <- p_reads_summ_prep %>%\n select(ctrl, classification, na_type, display) %>%\n pivot_wider(names_from=na_type, values_from = display) %>% \n arrange(ctrl, classification)\np_reads_summ\n\n\n \n\n\n\nWhat we see is a very different picture from the wastewater samples I’ve been analyzing so far. Most notably, the fraction of (non-ribosomal) human reads is much higher. Even the 2015-01-20 samples, which were taken when the daycare center was closed for the winter holidays, showed human read fractions (for both nucleic-acid types) of >12%; non-control samples as a whole averaged 27% for RNA reads and 34% for DNA reads. Compare Brumfield (average 0.08% for RNA and 0.02% for DNA), Yang (mean 0.05% for RNA) or even Rothman (mean 1.8% for non-panel-enriched RNA samples).\nConversely, total viral reads are very low: mean 0.033% for RNA reads and 0.019% for DNA reads. Wastewater RNA datasets have typically had much higher total viruses: mean 0.5% for Brumfield, about the same for Crits-Christoph, 5.5% for Yang, 4.5% for Rothman. Brumfield’s DNA data contained substantially fewer viruses than their RNA data, but still more than Prussin: about 0.08% on average.\nLooking at viral families…was less informative than usual, especially for DNA reads. It turns out that these samples contain a lot of viral reads that Kraken2 was only able to classify to the class level. In DNA reads, samples were dominated by Caudoviricetes phages, though Herviviricetes (which includes herpesviruses) and Naldaviricetes (a class of arthropod-infecting viruses) also put in a respectable showing. In RNA reads, Caudoviricetes was again a major presence, but Alsuviricetes (a family of primarily plant pathogens) was often as or more prevalent, and Revtraviricetes (a class that includes Hepatitis B virus and retroviruses) was also significant.\n\nCode# Get viral taxonomy\nviral_taxa_path <- file.path(data_dir, \"viral-taxids.tsv.gz\")\nviral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)\n\n# Import Kraken reports & extract viral taxa\nsamples <- as.character(basic_stats_raw$sample)\ncol_names <- c(\"pc_reads_total\", \"n_reads_clade\", \"n_reads_direct\",\n \"rank\", \"taxid\", \"name\")\nreport_paths <- paste0(data_dir, \"kraken/\", samples, \".report.gz\")\nkraken_reports <- lapply(1:length(samples), function(n) \n read_tsv(report_paths[n], col_names = col_names, show_col_types = FALSE) %>%\n mutate(sample = samples[n])) %>% bind_rows\nkraken_reports_viral <- filter(kraken_reports, taxid %in% viral_taxa$taxid) %>%\n group_by(sample) %>%\n mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])\nkraken_reports_viral_cleaned <- kraken_reports_viral %>%\n inner_join(libraries, by=\"sample\") %>%\n select(-pc_reads_total, -n_reads_direct) %>%\n select(name, taxid, p_reads_viral, n_reads_clade, everything())\n\nviral_classes <- kraken_reports_viral_cleaned %>% filter(rank == \"C\")\nviral_families <- kraken_reports_viral_cleaned %>% filter(rank == \"F\")\n\n\n\nCode# Identify major viral classes\nviral_classes_major_tab <- viral_classes %>% \n group_by(name, taxid) %>%\n summarize(p_reads_viral_max = max(p_reads_viral), .groups=\"drop\") %>%\n filter(p_reads_viral_max >= 0.04)\nviral_classes_major_list <- viral_classes_major_tab %>% pull(name)\nviral_classes_major <- viral_classes %>% \n filter(name %in% viral_classes_major_list) %>%\n select(name, taxid, sample, na_type, ctrl, date, p_reads_viral)\nviral_classes_minor <- viral_classes_major %>% \n group_by(sample, na_type, ctrl, date) %>%\n summarize(p_reads_viral_major = sum(p_reads_viral), .groups = \"drop\") %>%\n mutate(name = \"Other\", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%\n select(name, taxid, sample, na_type, ctrl, date, p_reads_viral)\nviral_classes_display <- bind_rows(viral_classes_major, viral_classes_minor) %>%\n arrange(desc(p_reads_viral)) %>% \n mutate(name = factor(name, levels=c(viral_classes_major_list, \"Other\"))) %>%\n rename(p_reads = p_reads_viral, classification=name)\n\npalette_viral <- c(brewer.pal(12, \"Set3\"), brewer.pal(8, \"Set2\"))\ng_classes <- g_comp_base + \n geom_col(data=viral_classes_display, position = \"stack\") +\n scale_y_continuous(name=\"% Viral Reads\", limits=c(0,1.01), breaks = seq(0,1,0.2),\n expand=c(0,0), labels = function(y) y*100) +\n scale_fill_manual(values=palette_viral, name = \"Viral class\")\ng_classes\n\n\n\n\n\n\n\nOf these, the most interesting are the strong presence of Herviviricetes in the DNA reads and Revtraviricetes in the RNA reads, as both of these are families that contain important human pathogens.\nDigging into the former, it turns out these reads are composed almost exclusively of Herpesviridae at the family level. Within non-control samples, these arise overwhelmingly from Cytomegalovirus. Digging in at the species level, these in turn are primarily attributed to a single CMV, Human betaherpesvirus 5, a.k.a. human cytomegalovirus (HCMV). I was excited to see this: this is the first time a single human pathogen, or even all human pathogens combined, have constituted a significant fraction of all viral reads in a sample I’ve analyzed with this pipeline.\n\nCode# Get all read counts in class\nhervi_taxid <- 2731363\nhervi_desc_taxids_old <- hervi_taxid\nhervi_desc_taxids_new <- unique(c(hervi_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% hervi_desc_taxids_old) %>% pull(taxid)))\nwhile (length(hervi_desc_taxids_new) > length(hervi_desc_taxids_old)){\n hervi_desc_taxids_old <- hervi_desc_taxids_new\n hervi_desc_taxids_new <- unique(c(hervi_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% hervi_desc_taxids_old) %>% pull(taxid)))\n}\nhervi_counts <- kraken_reports_viral_cleaned %>%\n filter(taxid %in% hervi_desc_taxids_new) %>%\n mutate(p_reads_hervi = n_reads_clade/n_reads_clade[1])\n\n# Get genus composition\nhervi_genera <- hervi_counts %>% filter(rank == \"S\", na_type == \"DNA\")\nhervi_genera_major_tab <- hervi_genera %>% \n group_by(name, taxid) %>%\n summarize(p_reads_hervi_max = max(p_reads_hervi), .groups=\"drop\") %>%\n filter(p_reads_hervi_max >= 0.04)\nhervi_genera_major_list <- hervi_genera_major_tab %>% pull(name)\nhervi_genera_major <- hervi_genera %>% \n filter(name %in% hervi_genera_major_list) %>%\n select(name, taxid, sample, na_type, ctrl, date, p_reads_hervi)\nhervi_genera_minor <- hervi_genera_major %>% \n group_by(sample, na_type, ctrl, date) %>%\n summarize(p_reads_hervi_major = sum(p_reads_hervi), .groups = \"drop\") %>%\n mutate(name = \"Other\", taxid=NA, p_reads_hervi = 1-p_reads_hervi_major) %>%\n select(name, taxid, sample, na_type, ctrl, date, p_reads_hervi)\nhervi_genera_display <- bind_rows(hervi_genera_major, hervi_genera_minor) %>%\n arrange(desc(p_reads_hervi)) %>% \n mutate(name = factor(name, levels=c(hervi_genera_major_list, \"Other\"))) %>%\n rename(p_reads = p_reads_hervi, classification=name)\n\n# Plot\ng_hervi_genera <- g_comp_base + \n geom_col(data=hervi_genera_display, position = \"stack\") +\n scale_y_continuous(name=\"% Herviviricetes Reads\", limits=c(0,1.01), \n breaks = seq(0,1,0.2),\n expand=c(0,0), labels = function(y) y*100) +\n scale_fill_manual(values=palette_viral, name = \"Viral species\")\ng_hervi_genera\n\n\n\n\n\n\n\nRevtraviricetes reads are similarly dominated by a single viral genus, Alpharetrovirus. Digging in at the species level, we see contributions from a variety of avian oncoviruses. To my (admittedly non-expert) knowledge, none of these infect humans, and I think they are probably primarily arising from local birds (or possibly rats).\n\nCode# Get all read counts in class\nrevtra_taxid <- 2732514\nrevtra_desc_taxids_old <- revtra_taxid\nrevtra_desc_taxids_new <- unique(c(revtra_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% revtra_desc_taxids_old) %>% pull(taxid)))\nwhile (length(revtra_desc_taxids_new) > length(revtra_desc_taxids_old)){\n revtra_desc_taxids_old <- revtra_desc_taxids_new\n revtra_desc_taxids_new <- unique(c(revtra_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% revtra_desc_taxids_old) %>% pull(taxid)))\n}\nrevtra_counts <- kraken_reports_viral_cleaned %>%\n filter(taxid %in% revtra_desc_taxids_new) %>%\n mutate(p_reads_revtra = n_reads_clade/n_reads_clade[1])\n\n# Get genus composition\nrevtra_species <- revtra_counts %>% filter(rank == \"S\", na_type == \"RNA\")\nrevtra_species_major_tab <- revtra_species %>% \n group_by(name, taxid) %>%\n summarize(p_reads_revtra_max = max(p_reads_revtra), .groups=\"drop\") %>%\n filter(p_reads_revtra_max >= 0.04)\nrevtra_species_major_list <- revtra_species_major_tab %>% pull(name)\nrevtra_species_major <- revtra_species %>% \n filter(name %in% revtra_species_major_list) %>%\n select(name, taxid, sample, na_type, ctrl, date, p_reads_revtra)\nrevtra_species_minor <- revtra_species_major %>% \n group_by(sample, na_type, ctrl, date) %>%\n summarize(p_reads_revtra_major = sum(p_reads_revtra), .groups = \"drop\") %>%\n mutate(name = \"Other\", taxid=NA, p_reads_revtra = 1-p_reads_revtra_major) %>%\n select(name, taxid, sample, na_type, ctrl, date, p_reads_revtra)\nrevtra_species_display <- bind_rows(revtra_species_major, revtra_species_minor) %>%\n arrange(desc(p_reads_revtra)) %>% \n mutate(name = factor(name, levels=c(revtra_species_major_list, \"Other\"))) %>%\n rename(p_reads = p_reads_revtra, classification=name)\n\n# Plot\ng_revtra_species <- g_comp_base + \n geom_col(data=revtra_species_display, position = \"stack\") +\n scale_y_continuous(name=\"% Revtraviricetes Reads\", limits=c(0,1.01), \n breaks = seq(0,1,0.2),\n expand=c(0,0), labels = function(y) y*100) +\n scale_fill_manual(values=palette_viral, name = \"Viral species\")\ng_revtra_species\n\n\n\n\n\n\n\nHuman-infecting virus reads: validation\nNext, I investigated the human-infecting virus read content of these unenriched samples. Using the same workflow I used for Brumfield et al, I identified 2811 RNA read pairs and 12792 DNA read pairs as putatively human viral: 0.003% and 0.005% of surviving reads, respectively.\n\nCodehv_reads_filtered_path <- file.path(data_dir, \"hv_hits_putative_filtered.tsv.gz\")\nhv_reads_filtered <- read_tsv(hv_reads_filtered_path, show_col_types = FALSE) %>%\n inner_join(libraries, by=\"sample\") %>% \n arrange(date, na_type)\nn_hv_filtered <- hv_reads_filtered %>% group_by(sample, date, na_type, ctrl) %>% count %>% inner_join(basic_stats %>% filter(stage == \"ribo_initial\") %>% select(sample, n_read_pairs), by=\"sample\") %>% rename(n_putative = n, n_total = n_read_pairs) %>% mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)\nn_hv_filtered_summ <- n_hv_filtered %>% group_by(na_type, ctrl) %>% summarize(n_putative = sum(n_putative), n_total = sum(n_total), .groups=\"drop\") %>% mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)\n\n\n\nCodemrg <- hv_reads_filtered %>%\n mutate(kraken_label = ifelse(assigned_hv, \"Kraken2 HV\\nassignment\",\n ifelse(hit_hv, \"Kraken2 HV\\nhit\",\n \"No hit or\\nassignment\"))) %>%\n group_by(sample, na_type) %>% arrange(desc(adj_score_fwd), desc(adj_score_rev)) %>%\n mutate(seq_num = row_number(),\n adj_score_max = pmax(adj_score_fwd, adj_score_rev))\n# Import Bowtie2/Kraken data and perform filtering steps\ng_mrg <- ggplot(mrg, aes(x=adj_score_fwd, y=adj_score_rev)) +\n geom_point(alpha=0.5, shape=16) + \n scale_x_continuous(name=\"S(forward read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_y_continuous(name=\"S(reverse read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n facet_grid(na_type~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +\n theme_base + theme(aspect.ratio=1)\ng_mrg\n\n\n\n\n\n\nCodeg_hist_0 <- ggplot(mrg, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position=\"dodge\") + facet_grid(na_type~kraken_label, scales=\"free_y\") + scale_x_continuous(name = \"Maximum adjusted alignment score\") + scale_y_continuous(name=\"# Read pairs\") + scale_fill_brewer(palette = \"Dark2\") + theme_base + geom_vline(xintercept=20, linetype=\"dashed\", color=\"red\")\ng_hist_0\n\n\n\n\n\n\n\nAs previously described, I ran BLASTN on these reads via a dedicated EC2 instance, using the same parameters I’ve used for previous datasets.\n\nCodemrg_fasta <- mrg %>%\n mutate(seq_head = paste0(\">\", seq_id)) %>%\n ungroup %>%\n select(header1=seq_head, seq1=query_seq_fwd, \n header2=seq_head, seq2=query_seq_rev) %>%\n mutate(header1=paste0(header1, \"_1\"), header2=paste0(header2, \"_2\"))\nmrg_fasta_out <- do.call(paste, c(mrg_fasta, sep=\"\\n\")) %>% paste(collapse=\"\\n\")\nwrite(mrg_fasta_out, file.path(data_dir, \"blast/putative-viral.fasta\"))\n\n\n\nCode# Import BLAST results\n# Pre-filtering BLAST results to save space\n# blast_results_path <- file.path(data_dir, \"blast/putative-viral.blast.gz\")\n# blast_cols <- c(\"qseqid\", \"sseqid\", \"sgi\", \"staxid\", \"qlen\", \"evalue\", \"bitscore\", \"qcovs\", \"length\", \"pident\", \"mismatch\", \"gapopen\", \"sstrand\", \"qstart\", \"qend\", \"sstart\", \"send\")\n# blast_results <- read_tsv(blast_results_path, show_col_types = FALSE,\n# col_names = blast_cols, col_types = cols(.default=\"c\"))\nblast_results_path <- file.path(data_dir, \"blast/putative-viral-best.blast.gz\")\nblast_results <- read_tsv(blast_results_path, show_col_types = FALSE)\n\n# Filter for best hit for each query/subject combination\nblast_results_best <- blast_results %>% group_by(qseqid, staxid) %>% \n filter(bitscore == max(bitscore)) %>%\n filter(length == max(length)) %>% filter(row_number() == 1)\n\n# Rank hits for each query and filter for high-ranking hits\nblast_results_ranked <- blast_results_best %>% \n group_by(qseqid) %>% mutate(rank = dense_rank(desc(bitscore)))\nblast_results_highrank <- blast_results_ranked %>% filter(rank <= 5) %>%\n mutate(read_pair = str_split(qseqid, \"_\") %>% sapply(nth, n=-1), \n seq_id = str_split(qseqid, \"_\") %>% sapply(nth, n=1)) %>%\n mutate(bitscore = as.numeric(bitscore))\n\n# Summarize by read pair and taxid\nblast_results_paired <- blast_results_highrank %>%\n group_by(seq_id, staxid) %>%\n summarize(bitscore_max = max(bitscore), bitscore_min = min(bitscore),\n n_reads = n(), .groups = \"drop\")\n\n# Add viral status\nblast_results_viral <- mutate(blast_results_paired, viral = staxid %in% viral_taxa$taxid) %>%\n mutate(viral_full = viral & n_reads == 2)\n\n# Compare to Kraken & Bowtie assignments\nmrg_assign <- mrg %>% select(sample, seq_id, taxid, assigned_taxid, adj_score_max, na_type)\nblast_results_assign <- left_join(blast_results_viral, mrg_assign, by=\"seq_id\") %>%\n mutate(taxid_match_bowtie = (staxid == taxid),\n taxid_match_kraken = (staxid == assigned_taxid),\n taxid_match_any = taxid_match_bowtie | taxid_match_kraken)\nblast_results_out <- blast_results_assign %>%\n group_by(seq_id) %>%\n summarize(viral_status = ifelse(any(viral_full), 2,\n ifelse(any(taxid_match_any), 2,\n ifelse(any(viral), 1, 0))),\n .groups = \"drop\")\n\n\n\nCode# Merge BLAST results with unenriched read data\nmrg_blast <- full_join(mrg, blast_results_out, by=\"seq_id\") %>%\n mutate(viral_status = replace_na(viral_status, 0),\n viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))\nmrg_blast_rna <- mrg_blast %>% filter(na_type == \"RNA\")\nmrg_blast_dna <- mrg_blast %>% filter(na_type == \"DNA\")\n\n# Plot RNA\ng_mrg_blast_rna <- mrg_blast_rna %>%\n ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +\n geom_point(alpha=0.5, shape=16) +\n scale_x_continuous(name=\"S(forward read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_y_continuous(name=\"S(reverse read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_color_brewer(palette = \"Set1\", name = \"Viral status\") +\n facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +\n theme_base + labs(title=\"RNA\") + \n theme(aspect.ratio=1, plot.title = element_text(size=rel(2), hjust=0))\ng_mrg_blast_rna\n\n\n\n\n\n\nCode# Plot DNA\ng_mrg_blast_dna <- mrg_blast_dna %>%\n ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +\n geom_point(alpha=0.5, shape=16) +\n scale_x_continuous(name=\"S(forward read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_y_continuous(name=\"S(reverse read)\", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +\n scale_color_brewer(palette = \"Set1\", name = \"Viral status\") +\n facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +\n theme_base + labs(title=\"DNA\") + \n theme(aspect.ratio=1, plot.title = element_text(size=rel(2), hjust=0))\ng_mrg_blast_dna\n\n\n\n\n\n\n\n\nCodeg_hist_1 <- ggplot(mrg_blast, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position=\"dodge\") + facet_grid(na_type~viral_status_out, scales = \"free_y\") + scale_x_continuous(name = \"Maximum adjusted alignment score\") + scale_y_continuous(name=\"# Read pairs\") + scale_fill_brewer(palette = \"Dark2\") + theme_base + geom_vline(xintercept=20, linetype=\"dashed\", color=\"red\")\ng_hist_1\n\n\n\n\n\n\n\nThese results look good on visual inspection, and indeed precision and sensitivity are both very high. For a disjunctive score threshold of 20, my updated workflow achieves an F1 score of 96.7% for RNA sequences and 98.2% for DNA sequences.\n\nCodetest_sens_spec <- function(tab, score_threshold, conjunctive, include_special){\n if (!include_special) tab <- filter(tab, viral_status_out %in% c(\"TRUE\", \"FALSE\"))\n tab_retained <- tab %>% mutate(\n conjunctive = conjunctive,\n retain_score_conjunctive = (adj_score_fwd > score_threshold & adj_score_rev > score_threshold), \n retain_score_disjunctive = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),\n retain_score = ifelse(conjunctive, retain_score_conjunctive, retain_score_disjunctive),\n retain = assigned_hv | hit_hv | retain_score) %>%\n group_by(viral_status_out, retain) %>% count\n pos_tru <- tab_retained %>% filter(viral_status_out == \"TRUE\", retain) %>% pull(n) %>% sum\n pos_fls <- tab_retained %>% filter(viral_status_out != \"TRUE\", retain) %>% pull(n) %>% sum\n neg_tru <- tab_retained %>% filter(viral_status_out != \"TRUE\", !retain) %>% pull(n) %>% sum\n neg_fls <- tab_retained %>% filter(viral_status_out == \"TRUE\", !retain) %>% pull(n) %>% sum\n sensitivity <- pos_tru / (pos_tru + neg_fls)\n specificity <- neg_tru / (neg_tru + pos_fls)\n precision <- pos_tru / (pos_tru + pos_fls)\n f1 <- 2 * precision * sensitivity / (precision + sensitivity)\n out <- tibble(threshold=score_threshold, include_special = include_special, \n conjunctive = conjunctive, sensitivity=sensitivity, \n specificity=specificity, precision=precision, f1=f1)\n return(out)\n}\nrange_f1 <- function(intab, inc_special, inrange=15:45){\n tss <- purrr::partial(test_sens_spec, tab=intab, include_special=inc_special)\n stats_conj <- lapply(inrange, tss, conjunctive=TRUE) %>% bind_rows\n stats_disj <- lapply(inrange, tss, conjunctive=FALSE) %>% bind_rows\n stats_all <- bind_rows(stats_conj, stats_disj) %>%\n pivot_longer(!(threshold:conjunctive), names_to=\"metric\", values_to=\"value\") %>%\n mutate(conj_label = ifelse(conjunctive, \"Conjunctive\", \"Disjunctive\"))\n return(stats_all)\n}\ninc_special <- FALSE\nstats_rna <- range_f1(mrg_blast_rna, inc_special) %>% mutate(na_type = \"RNA\")\nstats_dna <- range_f1(mrg_blast_dna, inc_special) %>% mutate(na_type = \"DNA\")\nstats_0 <- bind_rows(stats_rna, stats_dna) %>% mutate(attempt=0)\nthreshold_opt_0 <- stats_0 %>% group_by(conj_label,attempt,na_type) %>% \n filter(metric == \"f1\") %>%\n filter(value == max(value)) %>% filter(threshold == min(threshold))\ng_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +\n geom_vline(data = threshold_opt_0, mapping = aes(xintercept=threshold),\n color = \"red\", linetype = \"dashed\") +\n geom_line() +\n scale_y_continuous(name = \"Value\", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +\n scale_x_continuous(name = \"Threshold\", expand = c(0,0)) +\n scale_color_brewer(palette=\"Set3\") +\n facet_grid(na_type~conj_label) +\n theme_base\ng_stats_0\n\n\n\n\n\n\n\nLooking into the composition of different read groups, nothing stands out except the predominance of HCMV (human betaherpesvirus 5), which is consistent with the Kraken results above and borne out by the BLASTN alignments:\n\nCodeviral_taxa$name[viral_taxa$taxid == 211787] <- \"Human papillomavirus type 92\"\nviral_taxa$name[viral_taxa$taxid == 509154] <- \"Porcine endogenous retrovirus C\"\n\nmajor_threshold <- 0.05\nfp <- mrg_blast %>% \n group_by(na_type, viral_status_out, \n highscore = adj_score_max >= 20, taxid) %>% count %>% \n group_by(na_type, viral_status_out, highscore) %>% mutate(p=n/sum(n)) %>% \n left_join(viral_taxa, by=\"taxid\") %>%\n arrange(desc(p)) %>%\n mutate(name = ifelse(taxid == 194958, \"Porcine endogenous retrovirus A\", name))\nfp_major_tab <- fp %>% filter(p > major_threshold) %>% arrange(desc(p))\nfp_major_list <- fp_major_tab %>% pull(name) %>% sort %>% unique %>% c(., \"Other\")\nfp_major <- fp %>% mutate(major = p > major_threshold) %>% \n mutate(name_display = ifelse(major, name, \"Other\")) %>%\n group_by(na_type, viral_status_out, highscore, name_display) %>% \n summarize(n=sum(n), p=sum(p), .groups = \"drop\") %>%\n mutate(name_display = factor(name_display, levels = fp_major_list),\n score_display = ifelse(highscore, \"S >= 20\", \"S < 20\"),\n status_display = ifelse(viral_status_out, \"True positive\", \"False positive\"))\ng_fp <- ggplot(fp_major, aes(x=score_display, y=p, fill=name_display)) +\n geom_col(position=\"stack\") +\n scale_x_discrete(name = \"True positive?\") +\n scale_y_continuous(name = \"% reads\", limits = c(0,1.01), \n breaks = seq(0,1,0.2), expand = c(0,0)) +\n scale_fill_manual(values = palette_viral, name = \"Viral\\ntaxon\") +\n facet_grid(na_type~status_display) +\n guides(fill=guide_legend(ncol=3)) +\n theme_kit\ng_fp\n\n\n\n\n\n\n\n\nCode# Configure\nref_taxid <- 10359\n\n# Get taxon names\ntax_names_path <- file.path(data_dir, \"taxid-names.tsv.gz\")\ntax_names <- read_tsv(tax_names_path, show_col_types = FALSE)\n\n# Add missing names\ntax_names_new <- tribble(~staxid, ~name,\n 3050295, \"Cytomegalovirus humanbeta5\",\n 459231, \"FLAG-tagging vector pFLAG97-TSR\")\ntax_names <- bind_rows(tax_names, tax_names_new)\nref_name <- tax_names %>% filter(staxid == ref_taxid) %>% pull(name)\n\n# Get major matches\nfp_staxid <- mrg_blast %>% filter(taxid == ref_taxid) %>%\n group_by(na_type, highscore = adj_score_max >= 20) %>% mutate(n_seq = n()) %>%\n left_join(blast_results_paired, by=\"seq_id\") %>%\n mutate(staxid = as.integer(staxid)) %>%\n left_join(tax_names, by=\"staxid\") %>% rename(sname=name) %>%\n left_join(tax_names %>% rename(taxid=staxid), by=\"taxid\")\nfp_staxid_count <- fp_staxid %>%\n group_by(viral_status_out, highscore, na_type, \n taxid, name, staxid, sname, n_seq) %>%\n count %>%\n group_by(viral_status_out, highscore, na_type, taxid, name) %>%\n mutate(p=n/n_seq)\nfp_staxid_count_major <- fp_staxid_count %>%\n filter(n>1, p>0.1, !is.na(staxid)) %>%\n mutate(score_display = ifelse(highscore, \"S >= 20\", \"S < 20\"),\n status_display = ifelse(viral_status_out, \n \"True positive\", \"False positive\"))\n\n# Plot\ng <- ggplot(fp_staxid_count_major, aes(x=p, y=sname)) + \n geom_col() + \n facet_grid(na_type~status_display+score_display, scales=\"free\",\n labeller = label_wrap_gen(multi_line = FALSE)) +\n scale_x_continuous(name=\"% mapped reads\", limits=c(0,1), breaks=seq(0,1,0.2),\n expand=c(0,0)) +\n labs(title=paste0(ref_name, \" (taxid \", ref_taxid, \")\")) +\n theme_base + theme(\n axis.title.y = element_blank(),\n plot.title = element_text(size=rel(1.5), hjust=0, face=\"plain\"))\ng\n\n\n\n\n\n\n\nHuman-infecting viruses: overall relative abundance\n\nCode# Get raw read counts\nread_counts_raw <- basic_stats_raw %>%\n select(sample, date, na_type, ctrl, n_reads_raw = n_read_pairs)\n\n# Get HV read counts\nmrg_hv <- mrg %>% mutate(hv_status = assigned_hv | hit_hv | adj_score_max >= 20)\nread_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% \n count(name=\"n_reads_hv\")\nread_counts <- read_counts_raw %>% left_join(read_counts_hv, by=\"sample\") %>%\n mutate(n_reads_hv = replace_na(n_reads_hv, 0))\n\n# Aggregate\nread_counts_total <- read_counts %>% group_by(na_type, ctrl) %>%\n summarize(n_reads_raw = sum(n_reads_raw),\n n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n mutate(sample= \"All samples\", date = \"All dates\")\nread_counts_agg <- read_counts %>% arrange(date) %>%\n arrange(sample) %>%\n bind_rows(read_counts_total) %>% \n mutate(sample = fct_inorder(sample),\n p_reads_hv = n_reads_hv/n_reads_raw)\n\n\nIn non-control samples, applying a disjunctive cutoff at S=20 identifies 2591 RNA reads and 12236 DNA reads as human-viral. This gives an overall relative HV abundance of \\(1.63 \\times 10^{-5}\\) for RNA reads and \\(4.16 \\times 10^{-5}\\) for DNA reads. Reassuringly, relative abundances in control samples are at least 10x lower. We also see a sharp drop in relative abundance for both nucleic-acid types for the period when the daycare was closed (2015-01-20):\n\nCode# Visualize\ng_phv_agg <- ggplot(read_counts_agg, aes(x=date, color=na_type)) +\n geom_point(aes(y=p_reads_hv)) +\n scale_y_log10(\"Relative abundance of human virus reads\") +\n scale_x_discrete(name=\"Collection Date\") +\n facet_grid(.~ctrl, scales = \"free\", space = \"free_x\") +\n scale_color_na() + theme_rotate\ng_phv_agg\n\n\n\n\n\n\n\nThese overall RA values are similar to those we’ve seen previously for non-panel-enriched wastewater RNA data. That said, it’s notable that the DNA read RA seen here is much higher than that seen in the only DNA wastewater dataset I’ve analyzed so far (Brumfield):\n\nCode# Collate past RA values\nra_past <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,\n \"Brumfield\", 5e-5, \"RNA\", FALSE,\n \"Brumfield\", 3.66e-7, \"DNA\", FALSE,\n \"Spurbeck\", 5.44e-6, \"RNA\", FALSE,\n \"Yang\", 3.62e-4, \"RNA\", FALSE,\n \"Rothman (unenriched)\", 1.87e-5, \"RNA\", FALSE,\n \"Rothman (panel-enriched)\", 3.3e-5, \"RNA\", TRUE,\n \"Crits-Christoph (unenriched)\", 1.37e-5, \"RNA\", FALSE,\n \"Crits-Christoph (panel-enriched)\", 1.26e-2, \"RNA\", TRUE)\n\n# Collate new RA values\nra_new <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,\n \"Prussin (non-control)\", 1.63e-5, \"RNA\", FALSE,\n \"Prussin (non-control)\", 4.16e-5, \"DNA\", FALSE)\n\n# Plot\nra_comp <- bind_rows(ra_past, ra_new) %>% mutate(dataset = fct_inorder(dataset))\ng_ra_comp <- ggplot(ra_comp, aes(y=dataset, x=ra, color=na_type)) +\n geom_point() +\n scale_color_na() +\n scale_x_log10(name=\"Relative abundance of human virus reads\") +\n theme_base + theme(axis.title.y = element_blank())\ng_ra_comp\n\n\n\n\n\n\n\nHuman-infecting viruses: taxonomy and composition\nAt the family level, we see that Papillomaviridae, Herpesviridae, and Polyomaviridae are the most abundant families in both DNA and RNA reads, with Adenoviridae and (in RNA reads) Picornaviridae also making a respectable showing. The Herpesviridae reads are, predictably, overwhelmingly from HCMV, while the Papillomaviridae and Polyomaviridae reads are split up among a larger number of related viruses:\n\nCode# Get viral taxon names for putative HV reads\nviral_taxa$name[viral_taxa$taxid == 249588] <- \"Mamastrovirus\"\nviral_taxa$name[viral_taxa$taxid == 194960] <- \"Kobuvirus\"\nviral_taxa$name[viral_taxa$taxid == 688449] <- \"Salivirus\"\nviral_taxa$name[viral_taxa$taxid == 585893] <- \"Picobirnaviridae\"\nviral_taxa$name[viral_taxa$taxid == 333922] <- \"Betapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 334207] <- \"Betapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 369960] <- \"Porcine type-C oncovirus\"\nviral_taxa$name[viral_taxa$taxid == 333924] <- \"Betapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 687329] <- \"Anelloviridae\"\nviral_taxa$name[viral_taxa$taxid == 325455] <- \"Gammapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 333750] <- \"Alphapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 694002] <- \"Betacoronavirus\"\nviral_taxa$name[viral_taxa$taxid == 334202] <- \"Mupapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 197911] <- \"Alphainfluenzavirus\"\nviral_taxa$name[viral_taxa$taxid == 186938] <- \"Respirovirus\"\nviral_taxa$name[viral_taxa$taxid == 333926] <- \"Gammapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337051] <- \"Betapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337043] <- \"Alphapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 694003] <- \"Betacoronavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 334204] <- \"Mupapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 334208] <- \"Betapapillomavirus 4\"\n\nviral_taxa$name[viral_taxa$taxid == 334208] <- \"Betapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 334208] <- \"Betapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 334208] <- \"Betapapillomavirus 4\"\n\n\nmrg_hv_named <- mrg_hv %>% left_join(viral_taxa, by=\"taxid\")\n\n# Discover viral species & genera for HV reads\nraise_rank <- function(read_db, taxid_db, out_rank = \"species\", verbose = FALSE){\n # Get higher ranks than search rank\n ranks <- c(\"subspecies\", \"species\", \"subgenus\", \"genus\", \"subfamily\", \"family\", \"suborder\", \"order\", \"class\", \"subphylum\", \"phylum\", \"kingdom\", \"superkingdom\")\n rank_match <- which.max(ranks == out_rank)\n high_ranks <- ranks[rank_match:length(ranks)]\n # Merge read DB and taxid DB\n reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n # Extract sequences that are already at appropriate rank\n reads_rank <- filter(reads, rank == out_rank)\n # Drop sequences at a higher rank and return unclassified sequences\n reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...\n # Promote read taxids and re-merge with taxid DB, then re-classify and filter\n reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%\n select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%\n bind_rows(reads_rank)\n reads_norank <- reads_remaining %>%\n filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n }\n # Finally, extract and append reads that were excluded during the process\n reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)\n reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%\n select(-parent_taxid, -rank, -name) %>%\n left_join(taxid_db, by=\"taxid\")\n return(reads_out)\n}\nhv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, \"species\")\nhv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, \"genus\")\nhv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, \"family\")\n\n\n\nCodethreshold_major_family <- 0.06\n\n# Count reads for each human-viral family\nhv_family_counts <- hv_reads_family %>% \n group_by(sample, date, ctrl, na_type, name, taxid) %>%\n count(name = \"n_reads_hv\") %>%\n group_by(sample, date, ctrl, na_type) %>%\n mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nhv_family_major_tab <- hv_family_counts %>% group_by(name) %>% \n filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%\n arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)\nhv_family_counts_major <- hv_family_counts %>%\n mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, \"Other\")) %>%\n group_by(sample, date, ctrl, na_type, name_display) %>%\n summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), \n .groups=\"drop\") %>%\n mutate(name_display = factor(name_display, \n levels = c(hv_family_major_tab$name, \"Other\")))\nhv_family_counts_display <- hv_family_counts_major %>%\n rename(p_reads = p_reads_hv, classification = name_display)\n\n# Plot\ng_hv_family <- g_comp_base + \n geom_col(data=hv_family_counts_display, position = \"stack\") +\n scale_y_continuous(name=\"% HV Reads\", limits=c(0,1.01), \n breaks = seq(0,1,0.2),\n expand=c(0,0), labels = function(y) y*100) +\n scale_fill_manual(values=palette_viral, name = \"Viral family\")\ng_hv_family\n\n\n\n\n\n\n\n\nCodethreshold_major_genus <- 0.06\n\n# Count reads for each human-viral genus\nhv_genus_counts <- hv_reads_genus %>% \n group_by(sample, date, ctrl, na_type, name, taxid) %>%\n count(name = \"n_reads_hv\") %>%\n group_by(sample, date, ctrl, na_type) %>%\n mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nhv_genus_major_tab <- hv_genus_counts %>% group_by(name) %>% \n filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%\n arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_genus)\nhv_genus_counts_major <- hv_genus_counts %>%\n mutate(name_display = ifelse(name %in% hv_genus_major_tab$name, name, \"Other\")) %>%\n group_by(sample, date, ctrl, na_type, name_display) %>%\n summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), \n .groups=\"drop\") %>%\n mutate(name_display = factor(name_display, \n levels = c(hv_genus_major_tab$name, \"Other\")))\nhv_genus_counts_display <- hv_genus_counts_major %>%\n rename(p_reads = p_reads_hv, classification = name_display)\n\n# Plot\ng_hv_genus <- g_comp_base + \n geom_col(data=hv_genus_counts_display, position = \"stack\") +\n scale_y_continuous(name=\"% HV Reads\", limits=c(0,1.01), \n breaks = seq(0,1,0.2),\n expand=c(0,0), labels = function(y) y*100) +\n scale_fill_manual(values=palette_viral, name = \"Viral genus\")\ng_hv_genus\n\n\n\n\n\n\n\n\nCodethreshold_major_species <- 0.10\n\n# Count reads for each human-viral species\nhv_species_counts <- hv_reads_species %>% \n group_by(sample, date, ctrl, na_type, name, taxid) %>%\n count(name = \"n_reads_hv\") %>%\n group_by(sample, date, ctrl, na_type) %>%\n mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nhv_species_major_tab <- hv_species_counts %>% group_by(name) %>% \n filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%\n arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_species)\nhv_species_counts_major <- hv_species_counts %>%\n mutate(name_display = ifelse(name %in% hv_species_major_tab$name, name, \"Other\")) %>%\n group_by(sample, date, ctrl, na_type, name_display) %>%\n summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), \n .groups=\"drop\") %>%\n mutate(name_display = factor(name_display, \n levels = c(hv_species_major_tab$name, \"Other\")))\nhv_species_counts_display <- hv_species_counts_major %>%\n rename(p_reads = p_reads_hv, classification = name_display)\n\n# Plot\ng_hv_species <- g_comp_base + \n geom_col(data=hv_species_counts_display, position = \"stack\") +\n scale_y_continuous(name=\"% HV Reads\", limits=c(0,1.01), \n breaks = seq(0,1,0.2),\n expand=c(0,0), labels = function(y) y*100) +\n scale_fill_manual(values=palette_viral, name = \"Viral species\")\ng_hv_species\n\n\n\n\n\n\n\nCompared to the previous datasets I’ve analyzed, the most notable difference is the absence of enteric viruses: Norovirus, Rotavirus, and Enterovirus are all absent from the list of abundant human-viral taxa, as are ~all other gastrointestinal pathogens.\nFinally, here are the overall relative abundances of a set of specific viral genera picked out manually on the basis of scientific or medical interest. In the future, I’ll quantify the RA of these genera across all datasets analyzed with this pipeline to date, which should give us a better sense of how these data compare to others’ under this pipeline.\n\nCode# Define reference genera\npath_genera_rna <- c(\"Mamastrovirus\", \"Enterovirus\", \"Salivirus\", \"Kobuvirus\", \"Norovirus\", \"Sapovirus\", \"Rotavirus\", \"Alphacoronavirus\", \"Betacoronavirus\",\n \"Alphainfluenzavirus\", \"Betainfluenzavirus\", \"Lentivirus\")\npath_genera_dna <- c(\"Mastadenovirus\", \"Alphapolyomavirus\", \"Betapolyomavirus\", \"Alphapapillomavirus\", \"Betapapillomavirus\", \"Orthopoxvirus\", \"Simplexvirus\",\n \"Lymphocryptovirus\", \"Cytomegalovirus\", \"Dependoparvovirus\")\npath_genera <- bind_rows(tibble(name=path_genera_rna, genome_type=\"RNA genome\"),\n tibble(name=path_genera_dna, genome_type=\"DNA genome\")) %>%\n left_join(viral_taxa, by=\"name\")\n\n# Count in each sample\nn_path_genera <- hv_reads_genus %>% \n group_by(sample, date, na_type, ctrl, name, taxid) %>% \n count(name=\"n_reads_viral\") %>% \n inner_join(path_genera, by=c(\"name\", \"taxid\")) %>%\n left_join(read_counts_raw, by=c(\"sample\", \"date\", \"na_type\", \"ctrl\")) %>%\n mutate(p_reads_viral = n_reads_viral/n_reads_raw)\n\n# Pivot out and back to add zero lines\nn_path_genera_out <- n_path_genera %>% ungroup %>% select(sample, name, n_reads_viral) %>%\n pivot_wider(names_from=\"name\", values_from=\"n_reads_viral\", values_fill=0) %>%\n pivot_longer(-sample, names_to=\"name\", values_to=\"n_reads_viral\") %>%\n left_join(read_counts_raw, by=\"sample\") %>%\n left_join(path_genera, by=\"name\") %>%\n mutate(p_reads_viral = n_reads_viral/n_reads_raw)\n\n## Aggregate across dates\nn_path_genera_stype <- n_path_genera_out %>% \n group_by(na_type, ctrl,\n name, taxid, genome_type) %>%\n summarize(n_reads_raw = sum(n_reads_raw),\n n_reads_viral = sum(n_reads_viral), .groups = \"drop\") %>%\n mutate(sample=\"All samples\", date=\"All dates\",\n p_reads_viral = n_reads_viral/n_reads_raw)\n\n# Plot\ng_path_genera <- ggplot(n_path_genera_stype,\n aes(y=name, x=p_reads_viral, color=na_type)) +\n geom_point() +\n scale_x_log10(name=\"Relative abundance\") +\n scale_color_na() +\n facet_grid(genome_type~ctrl, scales=\"free_y\") +\n theme_base + theme(axis.title.y = element_blank())\ng_path_genera\n\nWarning: Transformation introduced infinite values in continuous x-axis\n\n\n\n\n\n\n\n\nConclusion\nThis is the first air-sampling dataset I’ve analyzed using this pipeline, and it was interesting to see the differences from the wastewater datasets I’ve been analyzing so far. Among the more striking differences were:\n\nA much higher total fraction of human reads;\nA lower total fraction of viral reads;\nNear-total absence of enteric viruses and Tobamovirus;\nMuch higher relative abundance of herpesviruses, particularly HCMV, and papillomaviruses among human-infecting virus reads.\n\nConversely, one thing that was not notably different, at least in RNA viruses, was the total relative abundance of human-infecting viruses as a whole. Given the lower fraction of reads made up of all viruses, this means that the fraction of total viruses arising from human-infecting viruses is much higher here than we’ve historically seen with wastewater data. In particular, HCMV represents a nontrivial fraction of total viruses for many DNA libraries, the first time I’ve seen a human pathogen show up significantly in the total virus composition data.\nGoing forward, I have two more air sampling datasets to analyze, Rosario et al. 2018 and Leung et al. 2021. It will be interesting to see whether the findings from this dataset generalize to other air sampling contexts."
}
]
\ No newline at end of file
diff --git a/notebooks/2024-04-08_brumfield.qmd b/notebooks/2024-04-08_brumfield.qmd
index 810b062..ee02b89 100644
--- a/notebooks/2024-04-08_brumfield.qmd
+++ b/notebooks/2024-04-08_brumfield.qmd
@@ -480,7 +480,6 @@ Having made these changes, the workflow identified a total of 14,073 RNA read pa
```{r}
#| label: hv-read-counts
-
hv_reads_filtered_path <- file.path(data_dir, "hv_hits_putative_filtered.tsv.gz")
hv_reads_filtered <- read_tsv(hv_reads_filtered_path, show_col_types = FALSE) %>%
inner_join(libraries, by="sample") %>%
diff --git a/notebooks/2024-04-12_prussin.qmd b/notebooks/2024-04-12_prussin.qmd
new file mode 100644
index 0000000..012e66f
--- /dev/null
+++ b/notebooks/2024-04-12_prussin.qmd
@@ -0,0 +1,1199 @@
+---
+title: "Workflow analysis of Prussin et al. (2019)"
+subtitle: "Wastewater from a manhole in Maryland."
+author: "Will Bradshaw"
+date: 2024-04-12
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ code-link: true
+ df-print: paged
+editor: visual
+title-block-banner: black
+---
+
+```{r}
+#| label: load-packages
+#| include: false
+library(tidyverse)
+library(cowplot)
+library(patchwork)
+library(fastqcr)
+library(RColorBrewer)
+source("../scripts/aux_plot-theme.R")
+theme_base <- theme_base + theme(aspect.ratio = NULL)
+theme_rotate <- theme_base + theme(
+ axis.text.x = element_text(hjust = 1, angle = 45),
+)
+theme_kit <- theme_rotate + theme(
+ axis.title.x = element_blank(),
+)
+tnl <- theme(legend.position = "none")
+```
+
+Taking a break from working on [P2RA datasets](https://doi.org/10.1101/2023.12.22.23300450), we're also working on a review of air sampling for viral pathogen detection. For that study, we're collecting and analyzing air MGS data that could give us a high-level idea of the likely viral composition of such samples.
+
+The first dataset I'm looking at for this work is [Prussin et al. (2019)](https://doi.org/10.1186/s40168-019-0672-z), a study of HVAC filter samples in a Virginia daycare center between 2014 and 2015. Samples were eluted from MERV-14 air filters collected every two weeks, with pairs of successive samples combined into four-week sampling periods. Like Brumfield et al, this study conducted both RNA and DNA sequencing; all samples were sequenced on an Illumina NextSeq500 with 2x150bp reads.
+
+# The raw data
+
+The Prussin dataset comprised sequencing data from 14 timepoints spread across the year, from 20th January 2014 to 2nd February 2015. Each sample represents a four-week sampling period. In addition to the 14 on-site samples, there were also two control samples, a negative control (NC) and an "unexposed filter" control (UFC), which were collected on December 23rd 2014.
+
+```{r}
+#| warning: false
+#| label: read-qc-data
+
+# Data input paths
+data_dir <- "../data/2024-04-11_prussin/"
+libraries_path <- file.path(data_dir, "sample-metadata.csv")
+basic_stats_path <- file.path(data_dir, "qc_basic_stats.tsv")
+adapter_stats_path <- file.path(data_dir, "qc_adapter_stats.tsv")
+quality_base_stats_path <- file.path(data_dir, "qc_quality_base_stats.tsv")
+quality_seq_stats_path <- file.path(data_dir, "qc_quality_sequence_stats.tsv")
+
+# Import libraries and extract metadata from sample names
+libraries_raw <- read_csv(libraries_path, show_col_types = FALSE)
+libraries <- libraries_raw %>%
+ arrange(desc(na_type)) %>% mutate(na_type = fct_inorder(na_type)) %>%
+ arrange(date) %>% rename(start_date = date) %>%
+ mutate(end_date = start_date + 28) %>%
+ mutate(date = fct_inorder(as.character(end_date)),
+ ctrl = ifelse(grepl("Negative_Control", sample_alias), "NC",
+ ifelse(grepl("Unexposed_Filter", sample_alias),
+ "UFC", "Non-Control")),
+ ctrl = factor(ctrl, levels = c("Non-Control",
+ "NC",
+ "UFC")),
+ open = (season != "Closed") & (ctrl == "On-Site"))
+
+# Import QC data
+stages <- c("raw_concat", "cleaned", "dedup", "ribo_initial", "ribo_secondary")
+basic_stats <- read_tsv(basic_stats_path, show_col_types = FALSE) %>%
+ inner_join(libraries, by="sample") %>%
+ mutate(stage = factor(stage, levels = stages),
+ sample = fct_inorder(sample))
+adapter_stats <- read_tsv(adapter_stats_path, show_col_types = FALSE) %>%
+ mutate(sample = sub("_2$", "", sample)) %>%
+ inner_join(libraries, by="sample") %>%
+ mutate(stage = factor(stage, levels = stages),
+ read_pair = fct_inorder(as.character(read_pair)))
+quality_base_stats <- read_tsv(quality_base_stats_path, show_col_types = FALSE) %>%
+ inner_join(libraries, by="sample") %>%
+ mutate(stage = factor(stage, levels = stages),
+ read_pair = fct_inorder(as.character(read_pair)))
+quality_seq_stats <- read_tsv(quality_seq_stats_path, show_col_types = FALSE) %>%
+ inner_join(libraries, by="sample") %>%
+ mutate(stage = factor(stage, levels = stages),
+ read_pair = fct_inorder(as.character(read_pair)))
+
+# Filter to raw data
+basic_stats_raw <- basic_stats %>% filter(stage == "raw_concat")
+adapter_stats_raw <- adapter_stats %>% filter(stage == "raw_concat")
+quality_base_stats_raw <- quality_base_stats %>% filter(stage == "raw_concat")
+quality_seq_stats_raw <- quality_seq_stats %>% filter(stage == "raw_concat")
+
+# Get key values for readout
+raw_read_counts <- basic_stats_raw %>% group_by(na_type, ctrl) %>%
+ summarize(rmin = min(n_read_pairs), rmax=max(n_read_pairs),
+ rmean=mean(n_read_pairs), .groups = "drop")
+raw_read_totals <- basic_stats_raw %>% group_by(na_type, ctrl) %>%
+ summarize(n_read_pairs = sum(n_read_pairs),
+ n_bases_approx = sum(n_bases_approx), .groups = "drop")
+raw_dup <- basic_stats_raw %>% group_by(na_type, ctrl) %>%
+ summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),
+ dmean=mean(percent_duplicates), .groups = "drop")
+
+```
+
+The 14 positive samples from the dataset yielded 5M-20M (mean 11.3M) RNA-sequencing reads and 10M-42M (mean 21.0M) DNA-sequencing reads per sample, for a total of 159M RNA read pairs and 294M DNA read pairs (46.3 and 87.0 gigabases of sequence, respectively). Controls contributed an additional 1M RNA read pairs and 17.5M DNA read pairs.
+
+In positive samples, read qualities were mostly high but tailed off slightly at the 3' end in some samples, suggesting the need for trimming. Adapter levels were high. Inferred duplication levels were low (2-4%) in DNA reads and moderate (21-44%) in RNA reads. Control sample reads were more problematic, with higher duplication and adapter levels and lower quality.
+
+```{r}
+#| fig-width: 9
+#| warning: false
+#| label: plot-basic-stats
+
+# Prepare data
+basic_stats_raw_metrics <- basic_stats_raw %>%
+ select(date, na_type, ctrl,
+ `# Read pairs` = n_read_pairs,
+ `Total base pairs\n(approx)` = n_bases_approx,
+ `% Duplicates\n(FASTQC)` = percent_duplicates) %>%
+ pivot_longer(-(date:ctrl), names_to = "metric", values_to = "value") %>%
+ mutate(metric = fct_inorder(metric))
+
+# Set up plot templates
+scale_fill_na <- purrr::partial(scale_fill_brewer, palette="Set1",
+ name="Nucleic acid type")
+g_basic <- ggplot(basic_stats_raw_metrics,
+ aes(x=date, y=value, fill=na_type)) +
+ geom_col(position = "dodge") +
+ scale_x_discrete(name="Collection Date") +
+ scale_y_continuous(expand=c(0,0)) +
+ expand_limits(y=c(0,100)) +
+ scale_fill_na() +
+ facet_grid(metric~ctrl, scales = "free", space="free_x", switch="y") +
+ theme_rotate + theme(
+ axis.title.y = element_blank(),
+ strip.text.y = element_text(face="plain")
+ )
+g_basic
+```
+
+```{r}
+#| label: plot-raw-quality
+
+# Set up plotting templates
+scale_color_na <- purrr::partial(scale_color_brewer, palette="Set1",
+ name="Nucleic acid type")
+g_qual_raw <- ggplot(mapping=aes(color=na_type, linetype=read_pair,
+ group=interaction(sample,read_pair))) +
+ scale_color_na() + scale_linetype_discrete(name = "Read Pair") +
+ guides(color=guide_legend(nrow=2,byrow=TRUE),
+ linetype = guide_legend(nrow=2,byrow=TRUE)) +
+ theme_base
+
+# Visualize adapters
+g_adapters_raw <- g_qual_raw +
+ geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +
+ scale_y_continuous(name="% Adapters", limits=c(0,NA),
+ breaks = seq(0,100,10), expand=c(0,0)) +
+ scale_x_continuous(name="Position", limits=c(0,NA),
+ breaks=seq(0,140,20), expand=c(0,0)) +
+ facet_grid(ctrl~adapter)
+g_adapters_raw
+
+# Visualize quality
+g_quality_base_raw <- g_qual_raw +
+ geom_hline(yintercept=25, linetype="dashed", color="red") +
+ geom_hline(yintercept=30, linetype="dashed", color="red") +
+ geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +
+ scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+ scale_x_continuous(name="Position", limits=c(0,NA),
+ breaks=seq(0,140,20), expand=c(0,0)) +
+ facet_grid(ctrl~.)
+g_quality_base_raw
+
+g_quality_seq_raw <- g_qual_raw +
+ geom_vline(xintercept=25, linetype="dashed", color="red") +
+ geom_vline(xintercept=30, linetype="dashed", color="red") +
+ geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +
+ scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+ scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+ facet_grid(ctrl~., scales = "free_y")
+g_quality_seq_raw
+```
+
+# Preprocessing
+
+The average fraction of reads lost at each stage in the preprocessing pipeline is shown in the following table. For positive samples, on average, cleaning and deduplication removed about 19% of total read pairs from RNA libraries and about 7% from DNA libraries. Subsequent ribodepletion removed a further \~32% of total read pairs on average from RNA libraries but \<0.5% of total read pairs from DNA libraries.
+
+Control samples, meanwhile, lost an average of 63% of RNA read pairs and 21% of DNA read pairs during cleaning and deduplication, consistent with the lower qualities observed above. Subsequent ribodepletion removed an additional 9% of total RNA read pairs and 0.3% of total DNA read pairs.
+
+```{r}
+#| label: preproc-table
+n_reads_rel <- basic_stats %>%
+ select(sample, date, na_type, ctrl, stage,
+ percent_duplicates, n_read_pairs) %>%
+ group_by(sample, na_type) %>% arrange(sample, na_type, stage) %>%
+ mutate(p_reads_retained = replace_na(n_read_pairs / lag(n_read_pairs), 0),
+ p_reads_lost = 1 - p_reads_retained,
+ p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
+ p_reads_lost_abs = 1-p_reads_retained_abs,
+ p_reads_lost_abs_marginal = replace_na(p_reads_lost_abs - lag(p_reads_lost_abs), 0))
+n_reads_rel_display <- n_reads_rel %>%
+ rename(Stage=stage, `NA Type`=na_type, `Control?`=ctrl) %>%
+ group_by(`Control?`, `NA Type`, Stage) %>%
+ summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), "-", round(max(p_reads_lost_abs*100),1), " (mean ", round(mean(p_reads_lost_abs*100),1), ")"),
+ `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), "-", round(max(p_reads_lost_abs_marginal*100),1), " (mean ", round(mean(p_reads_lost_abs_marginal*100),1), ")"), .groups="drop") %>%
+ filter(Stage != "raw_concat") %>%
+ mutate(Stage = Stage %>% as.numeric %>% factor(labels=c("Trimming & filtering", "Deduplication", "Initial ribodepletion", "Secondary ribodepletion")))
+n_reads_rel_display
+```
+
+```{r}
+#| label: preproc-figures
+#| warning: false
+#| fig-height: 8
+#| fig-width: 6
+
+g_stage_trace <- ggplot(basic_stats, aes(x=stage, color=na_type, group=sample)) +
+ scale_color_na() +
+ facet_wrap(ctrl~na_type, scales="free", ncol=2,
+ labeller = label_wrap_gen(multi_line=FALSE)) +
+ theme_kit
+
+# Plot reads over preprocessing
+g_reads_stages <- g_stage_trace +
+ geom_line(aes(y=n_read_pairs)) +
+ scale_y_continuous("# Read pairs", expand=c(0,0), limits=c(0,NA))
+g_reads_stages
+
+# Plot relative read losses during preprocessing
+g_reads_rel <- ggplot(n_reads_rel, aes(x=stage, color=na_type, group=sample)) +
+ geom_line(aes(y=p_reads_lost_abs_marginal)) +
+ scale_y_continuous("% Total Reads Lost", expand=c(0,0),
+ labels = function(x) x*100) +
+ scale_color_na() +
+ facet_wrap(ctrl~na_type, scales="free", ncol=2,
+ labeller = label_wrap_gen(multi_line=FALSE)) +
+ theme_kit
+g_reads_rel
+```
+
+In both positive and control samples, data cleaning with FASTP was very successful at removing adapters, with very few adapter sequences found by FASTQC at any stage after the raw data. FASTP was also successful at improving read quality.
+
+```{r}
+#| warning: false
+#| label: plot-quality
+#| fig-height: 7
+
+g_qual <- ggplot(mapping=aes(color=na_type, linetype=read_pair,
+ group=interaction(sample,read_pair))) +
+ scale_color_na() + scale_linetype_discrete(name = "Read Pair") +
+ guides(color=guide_legend(nrow=2,byrow=TRUE),
+ linetype = guide_legend(nrow=2,byrow=TRUE)) +
+ theme_base
+
+# Visualize adapters
+g_adapters <- g_qual +
+ geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +
+ scale_y_continuous(name="% Adapters", limits=c(0,20),
+ breaks = seq(0,50,10), expand=c(0,0)) +
+ scale_x_continuous(name="Position", limits=c(0,NA),
+ breaks=seq(0,140,20), expand=c(0,0)) +
+ facet_grid(stage~adapter)
+g_adapters
+
+# Visualize quality
+g_quality_base <- g_qual +
+ geom_hline(yintercept=25, linetype="dashed", color="red") +
+ geom_hline(yintercept=30, linetype="dashed", color="red") +
+ geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +
+ scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+ scale_x_continuous(name="Position", limits=c(0,NA),
+ breaks=seq(0,140,20), expand=c(0,0)) +
+ facet_grid(stage~.)
+g_quality_base
+
+g_quality_seq <- g_qual +
+ geom_vline(xintercept=25, linetype="dashed", color="red") +
+ geom_vline(xintercept=30, linetype="dashed", color="red") +
+ geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +
+ scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+ scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+ facet_grid(stage~.)
+g_quality_seq
+```
+
+According to FASTQC, deduplication was moderately effective at reducing measured duplicate levels in on-site samples, with FASTQC-measured levels falling from an average of 34% to 23% for RNA reads and from 2.7% to 2.2% for DNA reads.
+
+```{r}
+#| label: preproc-dedup
+#| fig-height: 8
+#| fig-width: 6
+
+stage_dup <- basic_stats %>% group_by(na_type, ctrl, stage) %>%
+ summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),
+ dmean=mean(percent_duplicates), .groups = "drop")
+
+g_dup_stages <- g_stage_trace +
+ geom_line(aes(y=percent_duplicates)) +
+ scale_y_continuous("% Duplicates", limits=c(0,NA), expand=c(0,0))
+g_dup_stages
+
+g_readlen_stages <- g_stage_trace + geom_line(aes(y=mean_seq_len)) +
+ scale_y_continuous("Mean read length (nt)", expand=c(0,0), limits=c(0,NA))
+g_readlen_stages
+```
+
+# High-level composition
+
+As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:
+
+```{r}
+#| label: prepare-composition
+
+# Import Bracken data
+bracken_path <- file.path(data_dir, "bracken_counts.tsv")
+bracken <- read_tsv(bracken_path, show_col_types = FALSE)
+total_assigned <- bracken %>% group_by(sample) %>% summarize(
+ name = "Total",
+ kraken_assigned_reads = sum(kraken_assigned_reads),
+ added_reads = sum(added_reads),
+ new_est_reads = sum(new_est_reads),
+ fraction_total_reads = sum(fraction_total_reads)
+)
+bracken_spread <- bracken %>% select(name, sample, new_est_reads) %>%
+ mutate(name = tolower(name)) %>%
+ pivot_wider(id_cols = "sample", names_from = "name",
+ values_from = "new_est_reads")
+
+# Count reads
+read_counts_preproc <- basic_stats %>%
+ select(sample, na_type, ctrl, date, stage, n_read_pairs) %>%
+ pivot_wider(id_cols = c("sample", "na_type", "ctrl", "date"),
+ names_from="stage", values_from="n_read_pairs")
+read_counts <- read_counts_preproc %>%
+ inner_join(total_assigned %>% select(sample, new_est_reads), by = "sample") %>%
+ rename(assigned = new_est_reads) %>%
+ inner_join(bracken_spread, by="sample")
+
+# Assess composition
+read_comp <- transmute(read_counts, sample=sample, na_type=na_type,
+ ctrl = ctrl, date = date,
+ n_filtered = raw_concat-cleaned,
+ n_duplicate = cleaned-dedup,
+ n_ribosomal = (dedup-ribo_initial) + (ribo_initial-ribo_secondary),
+ n_unassigned = ribo_secondary-assigned,
+ n_bacterial = bacteria,
+ n_archaeal = archaea,
+ n_viral = viruses,
+ n_human = eukaryota)
+read_comp_long <- pivot_longer(read_comp, -(sample:date),
+ names_to = "classification",
+ names_prefix = "n_", values_to = "n_reads") %>%
+ mutate(classification = fct_inorder(str_to_sentence(classification))) %>%
+ group_by(sample) %>% mutate(p_reads = n_reads/sum(n_reads))
+
+# Summarize composition
+read_comp_summ <- read_comp_long %>%
+ group_by(na_type, ctrl, classification) %>%
+ summarize(n_reads = sum(n_reads), .groups = "drop_last") %>%
+ mutate(n_reads = replace_na(n_reads,0),
+ p_reads = n_reads/sum(n_reads),
+ pc_reads = p_reads*100)
+```
+
+```{r}
+#| label: plot-composition-all
+
+# Prepare plotting templates
+g_comp_base <- ggplot(mapping=aes(x=date, y=p_reads, fill=classification)) +
+ scale_x_discrete(name="Collection Date") +
+ facet_grid(na_type~ctrl, scales = "free", space = "free_x") +
+ theme_rotate
+scale_y_pc_reads <- purrr::partial(scale_y_continuous, name = "% Reads",
+ expand = c(0,0), labels = function(y) y*100)
+
+# Plot overall composition
+g_comp <- g_comp_base + geom_col(data = read_comp_long, position = "stack") +
+ scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
+ scale_fill_brewer(palette = "Set1", name = "Classification")
+g_comp
+
+# Plot composition of minor components
+read_comp_minor <- read_comp_long %>%
+ filter(classification %in% c("Archaeal", "Viral", "Other"))
+palette_minor <- brewer.pal(9, "Set1")[c(6,7,9)]
+g_comp_minor <- g_comp_base + geom_col(data=read_comp_minor, position = "stack") +
+ scale_y_pc_reads() +
+ scale_fill_manual(values=palette_minor, name = "Classification")
+g_comp_minor
+
+```
+
+```{r}
+#| label: composition-summary
+
+p_reads_summ_group <- read_comp_long %>%
+ mutate(classification = ifelse(classification %in% c("Filtered", "Duplicate", "Unassigned"), "Excluded", as.character(classification)),
+ classification = fct_inorder(classification)) %>%
+ group_by(classification, sample, na_type, ctrl) %>%
+ summarize(p_reads = sum(p_reads), .groups = "drop") %>%
+ group_by(classification, na_type, ctrl) %>%
+ summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100,
+ pc_mean = mean(p_reads)*100, .groups = "drop")
+p_reads_summ_prep <- p_reads_summ_group %>%
+ mutate(classification = fct_inorder(classification),
+ pc_min = pc_min %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+ pc_max = pc_max %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+ pc_mean = pc_mean %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+ display = paste0(pc_min, "-", pc_max, "% (mean ", pc_mean, "%)"))
+p_reads_summ <- p_reads_summ_prep %>%
+ select(ctrl, classification, na_type, display) %>%
+ pivot_wider(names_from=na_type, values_from = display) %>%
+ arrange(ctrl, classification)
+p_reads_summ
+```
+
+What we see is a very different picture from the wastewater samples I've been analyzing so far. Most notably, the fraction of (non-ribosomal) human reads is *much* higher. Even the 2015-01-20 samples, which were taken when the daycare center was closed for the winter holidays, showed human read fractions (for both nucleic-acid types) of \>12%; non-control samples as a whole averaged 27% for RNA reads and 34% for DNA reads. Compare [Brumfield](https://data.securebio.org/wills-public-notebook/notebooks/2024-04-08_brumfield.html) (average 0.08% for RNA and 0.02% for DNA), [Yang](https://data.securebio.org/wills-public-notebook/notebooks/2024-03-16_yang.html) (mean 0.05% for RNA) or even [Rothman](https://data.securebio.org/wills-public-notebook/notebooks/2024-02-27_rothman-1.html) (mean 1.8% for non-panel-enriched RNA samples).
+
+Conversely, total viral reads are very low: mean 0.033% for RNA reads and 0.019% for DNA reads. Wastewater RNA datasets have typically had much higher total viruses: mean 0.5% for Brumfield, about the same for Crits-Christoph, 5.5% for Yang, 4.5% for Rothman. Brumfield's DNA data contained substantially fewer viruses than their RNA data, but still more than Prussin: about 0.08% on average.
+
+Looking at viral families...was less informative than usual, especially for DNA reads. It turns out that these samples contain a lot of viral reads that Kraken2 was only able to classify to the class level. In DNA reads, samples were dominated by *Caudoviricetes* phages, though *Herviviricetes* (which includes herpesviruses) and *Naldaviricetes* (a class of arthropod-infecting viruses) also put in a respectable showing. In RNA reads, *Caudoviricetes* was again a major presence, but *Alsuviricetes* (a family of primarily plant pathogens) was often as or more prevalent, and *Revtraviricetes* (a class that includes Hepatitis B virus and retroviruses) was also significant.
+
+```{r}
+#| label: extract-viral-taxa
+
+# Get viral taxonomy
+viral_taxa_path <- file.path(data_dir, "viral-taxids.tsv.gz")
+viral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)
+
+# Import Kraken reports & extract viral taxa
+samples <- as.character(basic_stats_raw$sample)
+col_names <- c("pc_reads_total", "n_reads_clade", "n_reads_direct",
+ "rank", "taxid", "name")
+report_paths <- paste0(data_dir, "kraken/", samples, ".report.gz")
+kraken_reports <- lapply(1:length(samples), function(n)
+ read_tsv(report_paths[n], col_names = col_names, show_col_types = FALSE) %>%
+ mutate(sample = samples[n])) %>% bind_rows
+kraken_reports_viral <- filter(kraken_reports, taxid %in% viral_taxa$taxid) %>%
+ group_by(sample) %>%
+ mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])
+kraken_reports_viral_cleaned <- kraken_reports_viral %>%
+ inner_join(libraries, by="sample") %>%
+ select(-pc_reads_total, -n_reads_direct) %>%
+ select(name, taxid, p_reads_viral, n_reads_clade, everything())
+
+viral_classes <- kraken_reports_viral_cleaned %>% filter(rank == "C")
+viral_families <- kraken_reports_viral_cleaned %>% filter(rank == "F")
+
+```
+
+```{r}
+#| label: viral-class-composition
+
+# Identify major viral classes
+viral_classes_major_tab <- viral_classes %>%
+ group_by(name, taxid) %>%
+ summarize(p_reads_viral_max = max(p_reads_viral), .groups="drop") %>%
+ filter(p_reads_viral_max >= 0.04)
+viral_classes_major_list <- viral_classes_major_tab %>% pull(name)
+viral_classes_major <- viral_classes %>%
+ filter(name %in% viral_classes_major_list) %>%
+ select(name, taxid, sample, na_type, ctrl, date, p_reads_viral)
+viral_classes_minor <- viral_classes_major %>%
+ group_by(sample, na_type, ctrl, date) %>%
+ summarize(p_reads_viral_major = sum(p_reads_viral), .groups = "drop") %>%
+ mutate(name = "Other", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%
+ select(name, taxid, sample, na_type, ctrl, date, p_reads_viral)
+viral_classes_display <- bind_rows(viral_classes_major, viral_classes_minor) %>%
+ arrange(desc(p_reads_viral)) %>%
+ mutate(name = factor(name, levels=c(viral_classes_major_list, "Other"))) %>%
+ rename(p_reads = p_reads_viral, classification=name)
+
+palette_viral <- c(brewer.pal(12, "Set3"), brewer.pal(8, "Set2"))
+g_classes <- g_comp_base +
+ geom_col(data=viral_classes_display, position = "stack") +
+ scale_y_continuous(name="% Viral Reads", limits=c(0,1.01), breaks = seq(0,1,0.2),
+ expand=c(0,0), labels = function(y) y*100) +
+ scale_fill_manual(values=palette_viral, name = "Viral class")
+g_classes
+
+```
+
+Of these, the most interesting are the strong presence of *Herviviricetes* in the DNA reads and *Revtraviricetes* in the RNA reads, as both of these are families that contain important human pathogens.
+
+Digging into the former, it turns out these reads are composed almost exclusively of *Herpesviridae* at the family level. Within non-control samples, these arise overwhelmingly from *Cytomegalovirus.* Digging in at the species level, these in turn are primarily attributed to a single CMV, Human betaherpesvirus 5, a.k.a. human cytomegalovirus (HCMV). I was excited to see this: this is the first time a single human pathogen, or even all human pathogens combined, have constituted a significant fraction of all viral reads in a sample I've analyzed with this pipeline.
+
+```{r}
+#| label: herviviricetes
+
+# Get all read counts in class
+hervi_taxid <- 2731363
+hervi_desc_taxids_old <- hervi_taxid
+hervi_desc_taxids_new <- unique(c(hervi_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% hervi_desc_taxids_old) %>% pull(taxid)))
+while (length(hervi_desc_taxids_new) > length(hervi_desc_taxids_old)){
+ hervi_desc_taxids_old <- hervi_desc_taxids_new
+ hervi_desc_taxids_new <- unique(c(hervi_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% hervi_desc_taxids_old) %>% pull(taxid)))
+}
+hervi_counts <- kraken_reports_viral_cleaned %>%
+ filter(taxid %in% hervi_desc_taxids_new) %>%
+ mutate(p_reads_hervi = n_reads_clade/n_reads_clade[1])
+
+# Get genus composition
+hervi_genera <- hervi_counts %>% filter(rank == "S", na_type == "DNA")
+hervi_genera_major_tab <- hervi_genera %>%
+ group_by(name, taxid) %>%
+ summarize(p_reads_hervi_max = max(p_reads_hervi), .groups="drop") %>%
+ filter(p_reads_hervi_max >= 0.04)
+hervi_genera_major_list <- hervi_genera_major_tab %>% pull(name)
+hervi_genera_major <- hervi_genera %>%
+ filter(name %in% hervi_genera_major_list) %>%
+ select(name, taxid, sample, na_type, ctrl, date, p_reads_hervi)
+hervi_genera_minor <- hervi_genera_major %>%
+ group_by(sample, na_type, ctrl, date) %>%
+ summarize(p_reads_hervi_major = sum(p_reads_hervi), .groups = "drop") %>%
+ mutate(name = "Other", taxid=NA, p_reads_hervi = 1-p_reads_hervi_major) %>%
+ select(name, taxid, sample, na_type, ctrl, date, p_reads_hervi)
+hervi_genera_display <- bind_rows(hervi_genera_major, hervi_genera_minor) %>%
+ arrange(desc(p_reads_hervi)) %>%
+ mutate(name = factor(name, levels=c(hervi_genera_major_list, "Other"))) %>%
+ rename(p_reads = p_reads_hervi, classification=name)
+
+# Plot
+g_hervi_genera <- g_comp_base +
+ geom_col(data=hervi_genera_display, position = "stack") +
+ scale_y_continuous(name="% Herviviricetes Reads", limits=c(0,1.01),
+ breaks = seq(0,1,0.2),
+ expand=c(0,0), labels = function(y) y*100) +
+ scale_fill_manual(values=palette_viral, name = "Viral species")
+g_hervi_genera
+
+```
+
+*Revtraviricetes* reads are similarly dominated by a single viral genus, *Alpharetrovirus.* Digging in at the species level, we see contributions from a variety of avian oncoviruses. To my (admittedly non-expert) knowledge, none of these infect humans, and I think they are probably primarily arising from local birds (or possibly rats).
+
+```{r}
+#| label: revtraviricetes
+
+# Get all read counts in class
+revtra_taxid <- 2732514
+revtra_desc_taxids_old <- revtra_taxid
+revtra_desc_taxids_new <- unique(c(revtra_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% revtra_desc_taxids_old) %>% pull(taxid)))
+while (length(revtra_desc_taxids_new) > length(revtra_desc_taxids_old)){
+ revtra_desc_taxids_old <- revtra_desc_taxids_new
+ revtra_desc_taxids_new <- unique(c(revtra_desc_taxids_old, viral_taxa %>% filter(parent_taxid %in% revtra_desc_taxids_old) %>% pull(taxid)))
+}
+revtra_counts <- kraken_reports_viral_cleaned %>%
+ filter(taxid %in% revtra_desc_taxids_new) %>%
+ mutate(p_reads_revtra = n_reads_clade/n_reads_clade[1])
+
+# Get genus composition
+revtra_species <- revtra_counts %>% filter(rank == "S", na_type == "RNA")
+revtra_species_major_tab <- revtra_species %>%
+ group_by(name, taxid) %>%
+ summarize(p_reads_revtra_max = max(p_reads_revtra), .groups="drop") %>%
+ filter(p_reads_revtra_max >= 0.04)
+revtra_species_major_list <- revtra_species_major_tab %>% pull(name)
+revtra_species_major <- revtra_species %>%
+ filter(name %in% revtra_species_major_list) %>%
+ select(name, taxid, sample, na_type, ctrl, date, p_reads_revtra)
+revtra_species_minor <- revtra_species_major %>%
+ group_by(sample, na_type, ctrl, date) %>%
+ summarize(p_reads_revtra_major = sum(p_reads_revtra), .groups = "drop") %>%
+ mutate(name = "Other", taxid=NA, p_reads_revtra = 1-p_reads_revtra_major) %>%
+ select(name, taxid, sample, na_type, ctrl, date, p_reads_revtra)
+revtra_species_display <- bind_rows(revtra_species_major, revtra_species_minor) %>%
+ arrange(desc(p_reads_revtra)) %>%
+ mutate(name = factor(name, levels=c(revtra_species_major_list, "Other"))) %>%
+ rename(p_reads = p_reads_revtra, classification=name)
+
+# Plot
+g_revtra_species <- g_comp_base +
+ geom_col(data=revtra_species_display, position = "stack") +
+ scale_y_continuous(name="% Revtraviricetes Reads", limits=c(0,1.01),
+ breaks = seq(0,1,0.2),
+ expand=c(0,0), labels = function(y) y*100) +
+ scale_fill_manual(values=palette_viral, name = "Viral species")
+g_revtra_species
+
+```
+
+# Human-infecting virus reads: validation
+
+Next, I investigated the human-infecting virus read content of these unenriched samples. Using the same workflow I used for Brumfield et al, I identified 2811 RNA read pairs and 12792 DNA read pairs as putatively human viral: 0.003% and 0.005% of surviving reads, respectively.
+
+```{r}
+#| label: hv-read-counts
+hv_reads_filtered_path <- file.path(data_dir, "hv_hits_putative_filtered.tsv.gz")
+hv_reads_filtered <- read_tsv(hv_reads_filtered_path, show_col_types = FALSE) %>%
+ inner_join(libraries, by="sample") %>%
+ arrange(date, na_type)
+n_hv_filtered <- hv_reads_filtered %>% group_by(sample, date, na_type, ctrl) %>% count %>% inner_join(basic_stats %>% filter(stage == "ribo_initial") %>% select(sample, n_read_pairs), by="sample") %>% rename(n_putative = n, n_total = n_read_pairs) %>% mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)
+n_hv_filtered_summ <- n_hv_filtered %>% group_by(na_type, ctrl) %>% summarize(n_putative = sum(n_putative), n_total = sum(n_total), .groups="drop") %>% mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)
+```
+
+```{r}
+#| label: plot-hv-scores
+#| warning: false
+#| fig-width: 8
+
+mrg <- hv_reads_filtered %>%
+ mutate(kraken_label = ifelse(assigned_hv, "Kraken2 HV\nassignment",
+ ifelse(hit_hv, "Kraken2 HV\nhit",
+ "No hit or\nassignment"))) %>%
+ group_by(sample, na_type) %>% arrange(desc(adj_score_fwd), desc(adj_score_rev)) %>%
+ mutate(seq_num = row_number(),
+ adj_score_max = pmax(adj_score_fwd, adj_score_rev))
+# Import Bowtie2/Kraken data and perform filtering steps
+g_mrg <- ggplot(mrg, aes(x=adj_score_fwd, y=adj_score_rev)) +
+ geom_point(alpha=0.5, shape=16) +
+ scale_x_continuous(name="S(forward read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
+ scale_y_continuous(name="S(reverse read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
+ facet_grid(na_type~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +
+ theme_base + theme(aspect.ratio=1)
+g_mrg
+
+g_hist_0 <- ggplot(mrg, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position="dodge") + facet_grid(na_type~kraken_label, scales="free_y") + scale_x_continuous(name = "Maximum adjusted alignment score") + scale_y_continuous(name="# Read pairs") + scale_fill_brewer(palette = "Dark2") + theme_base + geom_vline(xintercept=20, linetype="dashed", color="red")
+g_hist_0
+```
+
+As previously described, I ran BLASTN on these reads via a dedicated EC2 instance, using the same parameters I've used for previous datasets.
+
+```{r}
+#| label: make-blast-fasta
+mrg_fasta <- mrg %>%
+ mutate(seq_head = paste0(">", seq_id)) %>%
+ ungroup %>%
+ select(header1=seq_head, seq1=query_seq_fwd,
+ header2=seq_head, seq2=query_seq_rev) %>%
+ mutate(header1=paste0(header1, "_1"), header2=paste0(header2, "_2"))
+mrg_fasta_out <- do.call(paste, c(mrg_fasta, sep="\n")) %>% paste(collapse="\n")
+write(mrg_fasta_out, file.path(data_dir, "blast/putative-viral.fasta"))
+```
+
+```{r}
+#| label: process-blast-data
+#| warning: false
+
+# Import BLAST results
+# Pre-filtering BLAST results to save space
+# blast_results_path <- file.path(data_dir, "blast/putative-viral.blast.gz")
+# blast_cols <- c("qseqid", "sseqid", "sgi", "staxid", "qlen", "evalue", "bitscore", "qcovs", "length", "pident", "mismatch", "gapopen", "sstrand", "qstart", "qend", "sstart", "send")
+# blast_results <- read_tsv(blast_results_path, show_col_types = FALSE,
+# col_names = blast_cols, col_types = cols(.default="c"))
+blast_results_path <- file.path(data_dir, "blast/putative-viral-best.blast.gz")
+blast_results <- read_tsv(blast_results_path, show_col_types = FALSE)
+
+# Filter for best hit for each query/subject combination
+blast_results_best <- blast_results %>% group_by(qseqid, staxid) %>%
+ filter(bitscore == max(bitscore)) %>%
+ filter(length == max(length)) %>% filter(row_number() == 1)
+
+# Rank hits for each query and filter for high-ranking hits
+blast_results_ranked <- blast_results_best %>%
+ group_by(qseqid) %>% mutate(rank = dense_rank(desc(bitscore)))
+blast_results_highrank <- blast_results_ranked %>% filter(rank <= 5) %>%
+ mutate(read_pair = str_split(qseqid, "_") %>% sapply(nth, n=-1),
+ seq_id = str_split(qseqid, "_") %>% sapply(nth, n=1)) %>%
+ mutate(bitscore = as.numeric(bitscore))
+
+# Summarize by read pair and taxid
+blast_results_paired <- blast_results_highrank %>%
+ group_by(seq_id, staxid) %>%
+ summarize(bitscore_max = max(bitscore), bitscore_min = min(bitscore),
+ n_reads = n(), .groups = "drop")
+
+# Add viral status
+blast_results_viral <- mutate(blast_results_paired, viral = staxid %in% viral_taxa$taxid) %>%
+ mutate(viral_full = viral & n_reads == 2)
+
+# Compare to Kraken & Bowtie assignments
+mrg_assign <- mrg %>% select(sample, seq_id, taxid, assigned_taxid, adj_score_max, na_type)
+blast_results_assign <- left_join(blast_results_viral, mrg_assign, by="seq_id") %>%
+ mutate(taxid_match_bowtie = (staxid == taxid),
+ taxid_match_kraken = (staxid == assigned_taxid),
+ taxid_match_any = taxid_match_bowtie | taxid_match_kraken)
+blast_results_out <- blast_results_assign %>%
+ group_by(seq_id) %>%
+ summarize(viral_status = ifelse(any(viral_full), 2,
+ ifelse(any(taxid_match_any), 2,
+ ifelse(any(viral), 1, 0))),
+ .groups = "drop")
+```
+
+```{r}
+#| label: plot-blast-results
+#| fig-height: 6
+#| warning: false
+
+# Merge BLAST results with unenriched read data
+mrg_blast <- full_join(mrg, blast_results_out, by="seq_id") %>%
+ mutate(viral_status = replace_na(viral_status, 0),
+ viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))
+mrg_blast_rna <- mrg_blast %>% filter(na_type == "RNA")
+mrg_blast_dna <- mrg_blast %>% filter(na_type == "DNA")
+
+# Plot RNA
+g_mrg_blast_rna <- mrg_blast_rna %>%
+ ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +
+ geom_point(alpha=0.5, shape=16) +
+ scale_x_continuous(name="S(forward read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
+ scale_y_continuous(name="S(reverse read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
+ scale_color_brewer(palette = "Set1", name = "Viral status") +
+ facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +
+ theme_base + labs(title="RNA") +
+ theme(aspect.ratio=1, plot.title = element_text(size=rel(2), hjust=0))
+g_mrg_blast_rna
+
+# Plot DNA
+g_mrg_blast_dna <- mrg_blast_dna %>%
+ ggplot(aes(x=adj_score_fwd, y=adj_score_rev, color=viral_status_out)) +
+ geom_point(alpha=0.5, shape=16) +
+ scale_x_continuous(name="S(forward read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
+ scale_y_continuous(name="S(reverse read)", limits=c(0,65), breaks=seq(0,100,10), expand = c(0,0)) +
+ scale_color_brewer(palette = "Set1", name = "Viral status") +
+ facet_grid(viral_status_out~kraken_label, labeller = labeller(kit = label_wrap_gen(20))) +
+ theme_base + labs(title="DNA") +
+ theme(aspect.ratio=1, plot.title = element_text(size=rel(2), hjust=0))
+g_mrg_blast_dna
+```
+
+```{r}
+#| label: plot-blast-histogram
+g_hist_1 <- ggplot(mrg_blast, aes(x=adj_score_max)) + geom_histogram(binwidth=5,boundary=0,position="dodge") + facet_grid(na_type~viral_status_out, scales = "free_y") + scale_x_continuous(name = "Maximum adjusted alignment score") + scale_y_continuous(name="# Read pairs") + scale_fill_brewer(palette = "Dark2") + theme_base + geom_vline(xintercept=20, linetype="dashed", color="red")
+g_hist_1
+```
+
+These results look good on visual inspection, and indeed precision and sensitivity are both very high. For a disjunctive score threshold of 20, my updated workflow achieves an F1 score of 96.7% for RNA sequences and 98.2% for DNA sequences.
+
+```{r}
+#| label: plot-f1
+test_sens_spec <- function(tab, score_threshold, conjunctive, include_special){
+ if (!include_special) tab <- filter(tab, viral_status_out %in% c("TRUE", "FALSE"))
+ tab_retained <- tab %>% mutate(
+ conjunctive = conjunctive,
+ retain_score_conjunctive = (adj_score_fwd > score_threshold & adj_score_rev > score_threshold),
+ retain_score_disjunctive = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),
+ retain_score = ifelse(conjunctive, retain_score_conjunctive, retain_score_disjunctive),
+ retain = assigned_hv | hit_hv | retain_score) %>%
+ group_by(viral_status_out, retain) %>% count
+ pos_tru <- tab_retained %>% filter(viral_status_out == "TRUE", retain) %>% pull(n) %>% sum
+ pos_fls <- tab_retained %>% filter(viral_status_out != "TRUE", retain) %>% pull(n) %>% sum
+ neg_tru <- tab_retained %>% filter(viral_status_out != "TRUE", !retain) %>% pull(n) %>% sum
+ neg_fls <- tab_retained %>% filter(viral_status_out == "TRUE", !retain) %>% pull(n) %>% sum
+ sensitivity <- pos_tru / (pos_tru + neg_fls)
+ specificity <- neg_tru / (neg_tru + pos_fls)
+ precision <- pos_tru / (pos_tru + pos_fls)
+ f1 <- 2 * precision * sensitivity / (precision + sensitivity)
+ out <- tibble(threshold=score_threshold, include_special = include_special,
+ conjunctive = conjunctive, sensitivity=sensitivity,
+ specificity=specificity, precision=precision, f1=f1)
+ return(out)
+}
+range_f1 <- function(intab, inc_special, inrange=15:45){
+ tss <- purrr::partial(test_sens_spec, tab=intab, include_special=inc_special)
+ stats_conj <- lapply(inrange, tss, conjunctive=TRUE) %>% bind_rows
+ stats_disj <- lapply(inrange, tss, conjunctive=FALSE) %>% bind_rows
+ stats_all <- bind_rows(stats_conj, stats_disj) %>%
+ pivot_longer(!(threshold:conjunctive), names_to="metric", values_to="value") %>%
+ mutate(conj_label = ifelse(conjunctive, "Conjunctive", "Disjunctive"))
+ return(stats_all)
+}
+inc_special <- FALSE
+stats_rna <- range_f1(mrg_blast_rna, inc_special) %>% mutate(na_type = "RNA")
+stats_dna <- range_f1(mrg_blast_dna, inc_special) %>% mutate(na_type = "DNA")
+stats_0 <- bind_rows(stats_rna, stats_dna) %>% mutate(attempt=0)
+threshold_opt_0 <- stats_0 %>% group_by(conj_label,attempt,na_type) %>%
+ filter(metric == "f1") %>%
+ filter(value == max(value)) %>% filter(threshold == min(threshold))
+g_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +
+ geom_vline(data = threshold_opt_0, mapping = aes(xintercept=threshold),
+ color = "red", linetype = "dashed") +
+ geom_line() +
+ scale_y_continuous(name = "Value", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +
+ scale_x_continuous(name = "Threshold", expand = c(0,0)) +
+ scale_color_brewer(palette="Set3") +
+ facet_grid(na_type~conj_label) +
+ theme_base
+g_stats_0
+```
+
+Looking into the composition of different read groups, nothing stands out except the predominance of HCMV (human betaherpesvirus 5), which is consistent with the Kraken results above and borne out by the BLASTN alignments:
+
+```{r}
+#| label: fp
+#| fig-height: 6
+
+viral_taxa$name[viral_taxa$taxid == 211787] <- "Human papillomavirus type 92"
+viral_taxa$name[viral_taxa$taxid == 509154] <- "Porcine endogenous retrovirus C"
+
+major_threshold <- 0.05
+fp <- mrg_blast %>%
+ group_by(na_type, viral_status_out,
+ highscore = adj_score_max >= 20, taxid) %>% count %>%
+ group_by(na_type, viral_status_out, highscore) %>% mutate(p=n/sum(n)) %>%
+ left_join(viral_taxa, by="taxid") %>%
+ arrange(desc(p)) %>%
+ mutate(name = ifelse(taxid == 194958, "Porcine endogenous retrovirus A", name))
+fp_major_tab <- fp %>% filter(p > major_threshold) %>% arrange(desc(p))
+fp_major_list <- fp_major_tab %>% pull(name) %>% sort %>% unique %>% c(., "Other")
+fp_major <- fp %>% mutate(major = p > major_threshold) %>%
+ mutate(name_display = ifelse(major, name, "Other")) %>%
+ group_by(na_type, viral_status_out, highscore, name_display) %>%
+ summarize(n=sum(n), p=sum(p), .groups = "drop") %>%
+ mutate(name_display = factor(name_display, levels = fp_major_list),
+ score_display = ifelse(highscore, "S >= 20", "S < 20"),
+ status_display = ifelse(viral_status_out, "True positive", "False positive"))
+g_fp <- ggplot(fp_major, aes(x=score_display, y=p, fill=name_display)) +
+ geom_col(position="stack") +
+ scale_x_discrete(name = "True positive?") +
+ scale_y_continuous(name = "% reads", limits = c(0,1.01),
+ breaks = seq(0,1,0.2), expand = c(0,0)) +
+ scale_fill_manual(values = palette_viral, name = "Viral\ntaxon") +
+ facet_grid(na_type~status_display) +
+ guides(fill=guide_legend(ncol=3)) +
+ theme_kit
+g_fp
+
+```
+
+```{r}
+#| label: hcmv-blast-hits
+
+# Configure
+ref_taxid <- 10359
+
+# Get taxon names
+tax_names_path <- file.path(data_dir, "taxid-names.tsv.gz")
+tax_names <- read_tsv(tax_names_path, show_col_types = FALSE)
+
+# Add missing names
+tax_names_new <- tribble(~staxid, ~name,
+ 3050295, "Cytomegalovirus humanbeta5",
+ 459231, "FLAG-tagging vector pFLAG97-TSR")
+tax_names <- bind_rows(tax_names, tax_names_new)
+ref_name <- tax_names %>% filter(staxid == ref_taxid) %>% pull(name)
+
+# Get major matches
+fp_staxid <- mrg_blast %>% filter(taxid == ref_taxid) %>%
+ group_by(na_type, highscore = adj_score_max >= 20) %>% mutate(n_seq = n()) %>%
+ left_join(blast_results_paired, by="seq_id") %>%
+ mutate(staxid = as.integer(staxid)) %>%
+ left_join(tax_names, by="staxid") %>% rename(sname=name) %>%
+ left_join(tax_names %>% rename(taxid=staxid), by="taxid")
+fp_staxid_count <- fp_staxid %>%
+ group_by(viral_status_out, highscore, na_type,
+ taxid, name, staxid, sname, n_seq) %>%
+ count %>%
+ group_by(viral_status_out, highscore, na_type, taxid, name) %>%
+ mutate(p=n/n_seq)
+fp_staxid_count_major <- fp_staxid_count %>%
+ filter(n>1, p>0.1, !is.na(staxid)) %>%
+ mutate(score_display = ifelse(highscore, "S >= 20", "S < 20"),
+ status_display = ifelse(viral_status_out,
+ "True positive", "False positive"))
+
+# Plot
+g <- ggplot(fp_staxid_count_major, aes(x=p, y=sname)) +
+ geom_col() +
+ facet_grid(na_type~status_display+score_display, scales="free",
+ labeller = label_wrap_gen(multi_line = FALSE)) +
+ scale_x_continuous(name="% mapped reads", limits=c(0,1), breaks=seq(0,1,0.2),
+ expand=c(0,0)) +
+ labs(title=paste0(ref_name, " (taxid ", ref_taxid, ")")) +
+ theme_base + theme(
+ axis.title.y = element_blank(),
+ plot.title = element_text(size=rel(1.5), hjust=0, face="plain"))
+g
+```
+
+# Human-infecting viruses: overall relative abundance
+
+```{r}
+#| label: count-hv-reads
+
+# Get raw read counts
+read_counts_raw <- basic_stats_raw %>%
+ select(sample, date, na_type, ctrl, n_reads_raw = n_read_pairs)
+
+# Get HV read counts
+mrg_hv <- mrg %>% mutate(hv_status = assigned_hv | hit_hv | adj_score_max >= 20)
+read_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>%
+ count(name="n_reads_hv")
+read_counts <- read_counts_raw %>% left_join(read_counts_hv, by="sample") %>%
+ mutate(n_reads_hv = replace_na(n_reads_hv, 0))
+
+# Aggregate
+read_counts_total <- read_counts %>% group_by(na_type, ctrl) %>%
+ summarize(n_reads_raw = sum(n_reads_raw),
+ n_reads_hv = sum(n_reads_hv), .groups="drop") %>%
+ mutate(sample= "All samples", date = "All dates")
+read_counts_agg <- read_counts %>% arrange(date) %>%
+ arrange(sample) %>%
+ bind_rows(read_counts_total) %>%
+ mutate(sample = fct_inorder(sample),
+ p_reads_hv = n_reads_hv/n_reads_raw)
+```
+
+In non-control samples, applying a disjunctive cutoff at S=20 identifies 2591 RNA reads and 12236 DNA reads as human-viral. This gives an overall relative HV abundance of $1.63 \times 10^{-5}$ for RNA reads and $4.16 \times 10^{-5}$ for DNA reads. Reassuringly, relative abundances in control samples are at least 10x lower. We also see a sharp drop in relative abundance for both nucleic-acid types for the period when the daycare was closed (2015-01-20):
+
+```{r}
+#| label: plot-hv-ra
+#| warning: false
+# Visualize
+g_phv_agg <- ggplot(read_counts_agg, aes(x=date, color=na_type)) +
+ geom_point(aes(y=p_reads_hv)) +
+ scale_y_log10("Relative abundance of human virus reads") +
+ scale_x_discrete(name="Collection Date") +
+ facet_grid(.~ctrl, scales = "free", space = "free_x") +
+ scale_color_na() + theme_rotate
+g_phv_agg
+```
+
+These overall RA values are similar to those we've seen previously for non-panel-enriched wastewater RNA data. That said, it's notable that the DNA read RA seen here is much higher than that seen in the only DNA wastewater dataset I've analyzed so far (Brumfield):
+
+```{r}
+#| label: ra-hv-past
+
+# Collate past RA values
+ra_past <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
+ "Brumfield", 5e-5, "RNA", FALSE,
+ "Brumfield", 3.66e-7, "DNA", FALSE,
+ "Spurbeck", 5.44e-6, "RNA", FALSE,
+ "Yang", 3.62e-4, "RNA", FALSE,
+ "Rothman (unenriched)", 1.87e-5, "RNA", FALSE,
+ "Rothman (panel-enriched)", 3.3e-5, "RNA", TRUE,
+ "Crits-Christoph (unenriched)", 1.37e-5, "RNA", FALSE,
+ "Crits-Christoph (panel-enriched)", 1.26e-2, "RNA", TRUE)
+
+# Collate new RA values
+ra_new <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
+ "Prussin (non-control)", 1.63e-5, "RNA", FALSE,
+ "Prussin (non-control)", 4.16e-5, "DNA", FALSE)
+
+# Plot
+ra_comp <- bind_rows(ra_past, ra_new) %>% mutate(dataset = fct_inorder(dataset))
+g_ra_comp <- ggplot(ra_comp, aes(y=dataset, x=ra, color=na_type)) +
+ geom_point() +
+ scale_color_na() +
+ scale_x_log10(name="Relative abundance of human virus reads") +
+ theme_base + theme(axis.title.y = element_blank())
+g_ra_comp
+```
+
+# Human-infecting viruses: taxonomy and composition
+
+At the family level, we see that *Papillomaviridae*, *Herpesviridae*, and *Polyomaviridae* are the most abundant families in both DNA and RNA reads, with *Adenoviridae* and (in RNA reads) *Picornaviridae* also making a respectable showing. The *Herpesviridae* reads are, predictably, overwhelmingly from HCMV, while the *Papillomaviridae* and *Polyomaviridae* reads are split up among a larger number of related viruses:
+
+```{r}
+#| label: raise-hv-taxa
+
+# Get viral taxon names for putative HV reads
+viral_taxa$name[viral_taxa$taxid == 249588] <- "Mamastrovirus"
+viral_taxa$name[viral_taxa$taxid == 194960] <- "Kobuvirus"
+viral_taxa$name[viral_taxa$taxid == 688449] <- "Salivirus"
+viral_taxa$name[viral_taxa$taxid == 585893] <- "Picobirnaviridae"
+viral_taxa$name[viral_taxa$taxid == 333922] <- "Betapapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 334207] <- "Betapapillomavirus 3"
+viral_taxa$name[viral_taxa$taxid == 369960] <- "Porcine type-C oncovirus"
+viral_taxa$name[viral_taxa$taxid == 333924] <- "Betapapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 687329] <- "Anelloviridae"
+viral_taxa$name[viral_taxa$taxid == 325455] <- "Gammapapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 333750] <- "Alphapapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 694002] <- "Betacoronavirus"
+viral_taxa$name[viral_taxa$taxid == 334202] <- "Mupapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 197911] <- "Alphainfluenzavirus"
+viral_taxa$name[viral_taxa$taxid == 186938] <- "Respirovirus"
+viral_taxa$name[viral_taxa$taxid == 333926] <- "Gammapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 337051] <- "Betapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 337043] <- "Alphapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid == 694003] <- "Betacoronavirus 1"
+viral_taxa$name[viral_taxa$taxid == 334204] <- "Mupapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 334208] <- "Betapapillomavirus 4"
+
+viral_taxa$name[viral_taxa$taxid == 334208] <- "Betapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid == 334208] <- "Betapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid == 334208] <- "Betapapillomavirus 4"
+
+
+mrg_hv_named <- mrg_hv %>% left_join(viral_taxa, by="taxid")
+
+# Discover viral species & genera for HV reads
+raise_rank <- function(read_db, taxid_db, out_rank = "species", verbose = FALSE){
+ # Get higher ranks than search rank
+ ranks <- c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
+ rank_match <- which.max(ranks == out_rank)
+ high_ranks <- ranks[rank_match:length(ranks)]
+ # Merge read DB and taxid DB
+ reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%
+ left_join(taxid_db, by="taxid")
+ # Extract sequences that are already at appropriate rank
+ reads_rank <- filter(reads, rank == out_rank)
+ # Drop sequences at a higher rank and return unclassified sequences
+ reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+ while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...
+ # Promote read taxids and re-merge with taxid DB, then re-classify and filter
+ reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%
+ select(-parent_taxid, -rank, -name) %>%
+ left_join(taxid_db, by="taxid")
+ reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%
+ bind_rows(reads_rank)
+ reads_norank <- reads_remaining %>%
+ filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+ }
+ # Finally, extract and append reads that were excluded during the process
+ reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)
+ reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%
+ select(-parent_taxid, -rank, -name) %>%
+ left_join(taxid_db, by="taxid")
+ return(reads_out)
+}
+hv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, "species")
+hv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, "genus")
+hv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, "family")
+```
+
+```{r}
+#| label: hv-family
+#| fig-height: 6
+
+threshold_major_family <- 0.06
+
+# Count reads for each human-viral family
+hv_family_counts <- hv_reads_family %>%
+ group_by(sample, date, ctrl, na_type, name, taxid) %>%
+ count(name = "n_reads_hv") %>%
+ group_by(sample, date, ctrl, na_type) %>%
+ mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+hv_family_major_tab <- hv_family_counts %>% group_by(name) %>%
+ filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
+ arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)
+hv_family_counts_major <- hv_family_counts %>%
+ mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, "Other")) %>%
+ group_by(sample, date, ctrl, na_type, name_display) %>%
+ summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv),
+ .groups="drop") %>%
+ mutate(name_display = factor(name_display,
+ levels = c(hv_family_major_tab$name, "Other")))
+hv_family_counts_display <- hv_family_counts_major %>%
+ rename(p_reads = p_reads_hv, classification = name_display)
+
+# Plot
+g_hv_family <- g_comp_base +
+ geom_col(data=hv_family_counts_display, position = "stack") +
+ scale_y_continuous(name="% HV Reads", limits=c(0,1.01),
+ breaks = seq(0,1,0.2),
+ expand=c(0,0), labels = function(y) y*100) +
+ scale_fill_manual(values=palette_viral, name = "Viral family")
+g_hv_family
+```
+
+```{r}
+#| label: hv-genus
+#| fig-height: 6
+
+threshold_major_genus <- 0.06
+
+# Count reads for each human-viral genus
+hv_genus_counts <- hv_reads_genus %>%
+ group_by(sample, date, ctrl, na_type, name, taxid) %>%
+ count(name = "n_reads_hv") %>%
+ group_by(sample, date, ctrl, na_type) %>%
+ mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+hv_genus_major_tab <- hv_genus_counts %>% group_by(name) %>%
+ filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
+ arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_genus)
+hv_genus_counts_major <- hv_genus_counts %>%
+ mutate(name_display = ifelse(name %in% hv_genus_major_tab$name, name, "Other")) %>%
+ group_by(sample, date, ctrl, na_type, name_display) %>%
+ summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv),
+ .groups="drop") %>%
+ mutate(name_display = factor(name_display,
+ levels = c(hv_genus_major_tab$name, "Other")))
+hv_genus_counts_display <- hv_genus_counts_major %>%
+ rename(p_reads = p_reads_hv, classification = name_display)
+
+# Plot
+g_hv_genus <- g_comp_base +
+ geom_col(data=hv_genus_counts_display, position = "stack") +
+ scale_y_continuous(name="% HV Reads", limits=c(0,1.01),
+ breaks = seq(0,1,0.2),
+ expand=c(0,0), labels = function(y) y*100) +
+ scale_fill_manual(values=palette_viral, name = "Viral genus")
+g_hv_genus
+```
+
+```{r}
+#| label: hv-species
+#| fig-height: 6
+
+threshold_major_species <- 0.10
+
+# Count reads for each human-viral species
+hv_species_counts <- hv_reads_species %>%
+ group_by(sample, date, ctrl, na_type, name, taxid) %>%
+ count(name = "n_reads_hv") %>%
+ group_by(sample, date, ctrl, na_type) %>%
+ mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+hv_species_major_tab <- hv_species_counts %>% group_by(name) %>%
+ filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
+ arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_species)
+hv_species_counts_major <- hv_species_counts %>%
+ mutate(name_display = ifelse(name %in% hv_species_major_tab$name, name, "Other")) %>%
+ group_by(sample, date, ctrl, na_type, name_display) %>%
+ summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv),
+ .groups="drop") %>%
+ mutate(name_display = factor(name_display,
+ levels = c(hv_species_major_tab$name, "Other")))
+hv_species_counts_display <- hv_species_counts_major %>%
+ rename(p_reads = p_reads_hv, classification = name_display)
+
+# Plot
+g_hv_species <- g_comp_base +
+ geom_col(data=hv_species_counts_display, position = "stack") +
+ scale_y_continuous(name="% HV Reads", limits=c(0,1.01),
+ breaks = seq(0,1,0.2),
+ expand=c(0,0), labels = function(y) y*100) +
+ scale_fill_manual(values=palette_viral, name = "Viral species")
+g_hv_species
+```
+
+Compared to the previous datasets I've analyzed, the most notable difference is the absence of enteric viruses: *Norovirus*, *Rotavirus*, and *Enterovirus* are all absent from the list of abundant human-viral taxa, as are \~all other gastrointestinal pathogens.
+
+Finally, here are the overall relative abundances of a set of specific viral genera picked out manually on the basis of scientific or medical interest. In the future, I'll quantify the RA of these genera across all datasets analyzed with this pipeline to date, which should give us a better sense of how these data compare to others' under this pipeline.
+
+```{r}
+#| fig-height: 5
+#| label: ra-genera
+
+# Define reference genera
+path_genera_rna <- c("Mamastrovirus", "Enterovirus", "Salivirus", "Kobuvirus", "Norovirus", "Sapovirus", "Rotavirus", "Alphacoronavirus", "Betacoronavirus",
+ "Alphainfluenzavirus", "Betainfluenzavirus", "Lentivirus")
+path_genera_dna <- c("Mastadenovirus", "Alphapolyomavirus", "Betapolyomavirus", "Alphapapillomavirus", "Betapapillomavirus", "Orthopoxvirus", "Simplexvirus",
+ "Lymphocryptovirus", "Cytomegalovirus", "Dependoparvovirus")
+path_genera <- bind_rows(tibble(name=path_genera_rna, genome_type="RNA genome"),
+ tibble(name=path_genera_dna, genome_type="DNA genome")) %>%
+ left_join(viral_taxa, by="name")
+
+# Count in each sample
+n_path_genera <- hv_reads_genus %>%
+ group_by(sample, date, na_type, ctrl, name, taxid) %>%
+ count(name="n_reads_viral") %>%
+ inner_join(path_genera, by=c("name", "taxid")) %>%
+ left_join(read_counts_raw, by=c("sample", "date", "na_type", "ctrl")) %>%
+ mutate(p_reads_viral = n_reads_viral/n_reads_raw)
+
+# Pivot out and back to add zero lines
+n_path_genera_out <- n_path_genera %>% ungroup %>% select(sample, name, n_reads_viral) %>%
+ pivot_wider(names_from="name", values_from="n_reads_viral", values_fill=0) %>%
+ pivot_longer(-sample, names_to="name", values_to="n_reads_viral") %>%
+ left_join(read_counts_raw, by="sample") %>%
+ left_join(path_genera, by="name") %>%
+ mutate(p_reads_viral = n_reads_viral/n_reads_raw)
+
+## Aggregate across dates
+n_path_genera_stype <- n_path_genera_out %>%
+ group_by(na_type, ctrl,
+ name, taxid, genome_type) %>%
+ summarize(n_reads_raw = sum(n_reads_raw),
+ n_reads_viral = sum(n_reads_viral), .groups = "drop") %>%
+ mutate(sample="All samples", date="All dates",
+ p_reads_viral = n_reads_viral/n_reads_raw)
+
+# Plot
+g_path_genera <- ggplot(n_path_genera_stype,
+ aes(y=name, x=p_reads_viral, color=na_type)) +
+ geom_point() +
+ scale_x_log10(name="Relative abundance") +
+ scale_color_na() +
+ facet_grid(genome_type~ctrl, scales="free_y") +
+ theme_base + theme(axis.title.y = element_blank())
+g_path_genera
+```
+
+# Conclusion
+
+This is the first air-sampling dataset I've analyzed using this pipeline, and it was interesting to see the differences from the wastewater datasets I've been analyzing so far. Among the more striking differences were:
+
+- A much higher total fraction of human reads;
+
+- A lower total fraction of viral reads;
+
+- Near-total absence of enteric viruses and *Tobamovirus*;
+
+- Much higher relative abundance of herpesviruses, particularly HCMV, and papillomaviruses among human-infecting virus reads.
+
+Conversely, one thing that was not notably different, at least in RNA viruses, was the total relative abundance of human-infecting viruses as a whole. Given the lower fraction of reads made up of all viruses, this means that the fraction of total viruses arising from human-infecting viruses is much higher here than we've historically seen with wastewater data. In particular, HCMV represents a nontrivial fraction of total viruses for many DNA libraries, the first time I've seen a human pathogen show up significantly in the total virus composition data.
+
+Going forward, I have two more air sampling datasets to analyze, [Rosario et al. 2018](https://doi.org/10.1186/s40168-021-01044-7) and [Leung et al. 2021](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01044-7). It will be interesting to see whether the findings from this dataset generalize to other air sampling contexts.