diff --git a/data/2024-05-01_maritz/adapters.fasta b/data/2024-05-01_maritz/adapters.fasta new file mode 100644 index 0000000..5037b6c --- /dev/null +++ b/data/2024-05-01_maritz/adapters.fasta @@ -0,0 +1,41 @@ +>0 +heifigepsna +>1 +ACACTCTTTCCCTACACGACGCTCTTCCGATCT +>2 +AGATCGGAAGAGCACACGTCTGAACTCCAGTCA +>3 +GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT +>4 +CAAGCAGAAGACGGCATACGAGAT +>5 +GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG +>6 +GATCGGAAGAGCACACGTCTGAACTCCAGTCAC +>7 +CTGTCTCTTATACACATCTGACGCTGCCGACGA +>8 +GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG +>9 +GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC +>10 +GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG +>11 +AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT +>12 +CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC +T +>13 +TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT +>14 +unspecified +>15 +TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG +>16 +CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT +>17 +AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT +>18 +CTGTCTCTTATACACATCTCCGAGCCCACGAGAC +>19 +CGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT diff --git a/data/2024-05-01_maritz/hv_clade_counts.tsv.gz b/data/2024-05-01_maritz/hv_clade_counts.tsv.gz new file mode 100644 index 0000000..2fd015d Binary files /dev/null and b/data/2024-05-01_maritz/hv_clade_counts.tsv.gz differ diff --git a/data/2024-05-01_maritz/hv_hits_blast_paired.tsv.gz b/data/2024-05-01_maritz/hv_hits_blast_paired.tsv.gz new file mode 100644 index 0000000..6d141dd Binary files /dev/null and b/data/2024-05-01_maritz/hv_hits_blast_paired.tsv.gz differ diff --git a/data/2024-05-01_maritz/hv_hits_putative_filtered.tsv.gz b/data/2024-05-01_maritz/hv_hits_putative_filtered.tsv.gz new file mode 100644 index 0000000..4263ac0 Binary files /dev/null and b/data/2024-05-01_maritz/hv_hits_putative_filtered.tsv.gz differ diff --git a/data/2024-05-01_maritz/kraken_reports.tsv.gz b/data/2024-05-01_maritz/kraken_reports.tsv.gz new file mode 100644 index 0000000..1a4b270 Binary files /dev/null and b/data/2024-05-01_maritz/kraken_reports.tsv.gz differ diff --git a/data/2024-05-01_maritz/qc_adapter_stats.tsv.gz b/data/2024-05-01_maritz/qc_adapter_stats.tsv.gz new file mode 100644 index 0000000..6ab073e Binary files /dev/null and b/data/2024-05-01_maritz/qc_adapter_stats.tsv.gz differ diff --git a/data/2024-05-01_maritz/qc_basic_stats.tsv.gz b/data/2024-05-01_maritz/qc_basic_stats.tsv.gz new file mode 100644 index 0000000..b78f2dc Binary files /dev/null and b/data/2024-05-01_maritz/qc_basic_stats.tsv.gz differ diff --git a/data/2024-05-01_maritz/qc_quality_base_stats.tsv.gz b/data/2024-05-01_maritz/qc_quality_base_stats.tsv.gz new file mode 100644 index 0000000..a01c702 Binary files /dev/null and b/data/2024-05-01_maritz/qc_quality_base_stats.tsv.gz differ diff --git a/data/2024-05-01_maritz/qc_quality_sequence_stats.tsv.gz b/data/2024-05-01_maritz/qc_quality_sequence_stats.tsv.gz new file mode 100644 index 0000000..6eb7ae6 Binary files /dev/null and b/data/2024-05-01_maritz/qc_quality_sequence_stats.tsv.gz differ diff --git a/data/2024-05-01_maritz/sample-metadata.csv b/data/2024-05-01_maritz/sample-metadata.csv new file mode 100644 index 0000000..b3588c2 --- /dev/null +++ b/data/2024-05-01_maritz/sample-metadata.csv @@ -0,0 +1,17 @@ +library,sample,dataset,bioproject +ERR2729796,NYC-01,Maritz 2019,PRJEB28033 +ERR2729797,NYC-02,Maritz 2019,PRJEB28033 +ERR2729798,NYC-03,Maritz 2019,PRJEB28033 +ERR2729799,NYC-04,Maritz 2019,PRJEB28033 +ERR2729800,NYC-05,Maritz 2019,PRJEB28033 +ERR2729801,NYC-06,Maritz 2019,PRJEB28033 +ERR2729802,NYC-07,Maritz 2019,PRJEB28033 +ERR2729803,NYC-08,Maritz 2019,PRJEB28033 +ERR2729804,NYC-09,Maritz 2019,PRJEB28033 +ERR2729805,NYC-10,Maritz 2019,PRJEB28033 +ERR2729806,NYC-11,Maritz 2019,PRJEB28033 +ERR2729807,NYC-12,Maritz 2019,PRJEB28033 +ERR2729808,NYC-13,Maritz 2019,PRJEB28033 +ERR2729809,NYC-14,Maritz 2019,PRJEB28033 +ERR2729810,NYC-15,Maritz 2019,PRJEB28033 +ERR2729811,NYC-16,Maritz 2019,PRJEB28033 \ No newline at end of file diff --git a/data/2024-05-01_maritz/taxid-names.tsv.gz b/data/2024-05-01_maritz/taxid-names.tsv.gz new file mode 120000 index 0000000..626546b --- /dev/null +++ b/data/2024-05-01_maritz/taxid-names.tsv.gz @@ -0,0 +1 @@ +../2024-04-01_spurbeck/taxid-names.tsv.gz \ No newline at end of file diff --git a/data/2024-05-01_maritz/taxonomic_composition.tsv.gz b/data/2024-05-01_maritz/taxonomic_composition.tsv.gz new file mode 100644 index 0000000..f73bb84 Binary files /dev/null and b/data/2024-05-01_maritz/taxonomic_composition.tsv.gz differ diff --git a/data/2024-05-01_maritz/viral-taxids.tsv.gz b/data/2024-05-01_maritz/viral-taxids.tsv.gz new file mode 120000 index 0000000..349083e --- /dev/null +++ b/data/2024-05-01_maritz/viral-taxids.tsv.gz @@ -0,0 +1 @@ +../2024-03-19_brumfield/viral-taxids.tsv.gz \ No newline at end of file diff --git a/docs/index.html b/docs/index.html index e315eeb..bae01c6 100644 --- a/docs/index.html +++ b/docs/index.html @@ -165,7 +165,7 @@
-
+
-
+
+
+

+

+

+
+ + +
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
@@ -715,7 +737,7 @@

-
+
-
+

diff --git a/docs/listings.json b/docs/listings.json index b0f90b2..df42b8d 100644 --- a/docs/listings.json +++ b/docs/listings.json @@ -4,6 +4,7 @@ "items": [ "/notebooks/2024-05-01_ng.html", "/notebooks/2024-05-01_bengtsson-palme.html", + "/notebooks/2024-05-01_maritz.html", "/notebooks/2024-04-30_brinch.html", "/notebooks/2024-04-19_leung.html", "/notebooks/2024-04-12_rosario.html", diff --git a/docs/notebooks/2024-05-01_maritz.html b/docs/notebooks/2024-05-01_maritz.html new file mode 100644 index 0000000..2759c97 --- /dev/null +++ b/docs/notebooks/2024-05-01_maritz.html @@ -0,0 +1,3057 @@ + + + + + + + + +Will’s Public NAO Notebook - Workflow analysis of Maritz et al. (2019) + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+

Workflow analysis of Maritz et al. (2019)

+

Wastewater from NYC.

+
+
+ + +
+ +
+
Author
+
+

Will Bradshaw

+
+
+ +
+
Published
+
+

May 1, 2024

+
+
+ + +
+ + +
+ + + + +

Continuing my analysis of datasets from the P2RA preprint, I analyzed the data from Maritz et al. (2019), a study that used DNA sequencing of wastewater samples to characterize protist diversity and temporal diversity in New York City. Samples for this study underwent direct DNA extraction without a dedicated concentration step, then underwent library prep and Illumina sequencing on a HiSeq Rapid Run (2x250bp).

+

The raw data

+

16 samples were collected from 14 treatment plants in NYC in November 2014. These samples yielded 8.6M-18.3M (mean 10.8M) reads per sample, for a total of 172M read pairs (84 gigabases of sequence). Read qualities were mostly high; adapter levels were moderate; inferred duplication levels were low.

+
+
Code
# Importing the data is a bit more complicated this time as the samples are split across three pipeline runs
+data_dir <- "../data/2024-05-01_maritz"
+
+# Data input paths
+libraries_path <- file.path(data_dir, "sample-metadata.csv")
+basic_stats_path <- file.path(data_dir, "qc_basic_stats.tsv.gz")
+adapter_stats_path <- file.path(data_dir, "qc_adapter_stats.tsv.gz")
+quality_base_stats_path <- file.path(data_dir, "qc_quality_base_stats.tsv.gz")
+quality_seq_stats_path <- file.path(data_dir, "qc_quality_sequence_stats.tsv.gz")
+
+# Import libraries and extract metadata from sample names
+libraries_raw <- lapply(libraries_path, read_csv, show_col_types = FALSE) %>%
+  bind_rows
+libraries <- libraries_raw %>%
+  mutate(sample = fct_inorder(sample))
+
+
+
+
Code
# Import QC data
+stages <- c("raw_concat", "cleaned", "dedup", "ribo_initial", "ribo_secondary")
+import_basic <- function(paths){
+  lapply(paths, read_tsv, show_col_types = FALSE) %>% bind_rows %>%
+    inner_join(libraries, by="sample") %>%
+    arrange(sample) %>%
+    mutate(stage = factor(stage, levels = stages),
+           sample = fct_inorder(sample))
+}
+import_basic_paired <- function(paths){
+  import_basic(paths) %>% arrange(read_pair) %>% 
+    mutate(read_pair = fct_inorder(as.character(read_pair)))
+}
+basic_stats <- import_basic(basic_stats_path)
+adapter_stats <- import_basic_paired(adapter_stats_path)
+quality_base_stats <- import_basic_paired(quality_base_stats_path)
+quality_seq_stats <- import_basic_paired(quality_seq_stats_path)
+
+# Filter to raw data
+basic_stats_raw <- basic_stats %>% filter(stage == "raw_concat")
+adapter_stats_raw <- adapter_stats %>% filter(stage == "raw_concat")
+quality_base_stats_raw <- quality_base_stats %>% filter(stage == "raw_concat")
+quality_seq_stats_raw <- quality_seq_stats %>% filter(stage == "raw_concat")
+
+# Get key values for readout
+raw_read_counts <- basic_stats_raw %>% ungroup %>% 
+  summarize(rmin = min(n_read_pairs), rmax=max(n_read_pairs),
+            rmean=mean(n_read_pairs), 
+            rtot = sum(n_read_pairs),
+            btot = sum(n_bases_approx),
+            dmin = min(percent_duplicates), dmax=max(percent_duplicates),
+            dmean=mean(percent_duplicates), .groups = "drop")
+
+
+
+
Code
# Prepare data
+basic_stats_raw_metrics <- basic_stats_raw %>%
+  select(sample,
+         `# Read pairs` = n_read_pairs,
+         `Total base pairs\n(approx)` = n_bases_approx,
+         `% Duplicates\n(FASTQC)` = percent_duplicates) %>%
+  pivot_longer(-(sample), names_to = "metric", values_to = "value") %>%
+  mutate(metric = fct_inorder(metric))
+
+# Set up plot templates
+g_basic <- ggplot(basic_stats_raw_metrics, aes(x=sample, y=value)) +
+  geom_col(position = "dodge") +
+  scale_y_continuous(expand=c(0,0)) +
+  expand_limits(y=c(0,100)) +
+  facet_grid(metric~., scales = "free", space="free_x", switch="y") +
+  theme_kit + theme(
+    axis.title.y = element_blank(),
+    strip.text.y = element_text(face="plain")
+  )
+g_basic
+
+
+

+
+
+
+
+
+
Code
# Set up plotting templates
+g_qual_raw <- ggplot(mapping=aes(linetype=read_pair, 
+                         group=interaction(sample,read_pair))) + 
+  scale_linetype_discrete(name = "Read Pair") +
+  guides(color=guide_legend(nrow=2,byrow=TRUE),
+         linetype = guide_legend(nrow=2,byrow=TRUE)) +
+  theme_base
+
+# Visualize adapters
+g_adapters_raw <- g_qual_raw + 
+  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +
+  scale_y_continuous(name="% Adapters", limits=c(0,NA),
+                     breaks = seq(0,100,1), expand=c(0,0)) +
+  scale_x_continuous(name="Position", limits=c(0,NA),
+                     breaks=seq(0,500,20), expand=c(0,0)) +
+  facet_grid(.~adapter)
+g_adapters_raw
+
+
+

+
+
+
+
Code
# Visualize quality
+g_quality_base_raw <- g_qual_raw +
+  geom_hline(yintercept=25, linetype="dashed", color="red") +
+  geom_hline(yintercept=30, linetype="dashed", color="red") +
+  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +
+  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+  scale_x_continuous(name="Position", limits=c(0,NA),
+                     breaks=seq(0,500,20), expand=c(0,0))
+g_quality_base_raw
+
+
+

+
+
+
+
Code
g_quality_seq_raw <- g_qual_raw +
+  geom_vline(xintercept=25, linetype="dashed", color="red") +
+  geom_vline(xintercept=30, linetype="dashed", color="red") +
+  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +
+  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+  scale_y_continuous(name="# Sequences", expand=c(0,0))
+g_quality_seq_raw
+
+
+

+
+
+
+
+

Preprocessing

+

About 6% of reads on average were lost during cleaning, and a further 2% during deduplication. Very few reads were lost during ribodepletion, as expected for DNA sequencing libraries.

+
+
Code
n_reads_rel <- basic_stats %>% 
+  select(sample, stage, 
+         percent_duplicates, n_read_pairs) %>%
+  group_by(sample) %>% arrange(sample, stage) %>%
+  mutate(p_reads_retained = replace_na(n_read_pairs / lag(n_read_pairs), 0),
+         p_reads_lost = 1 - p_reads_retained,
+         p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
+         p_reads_lost_abs = 1-p_reads_retained_abs,
+         p_reads_lost_abs_marginal = replace_na(p_reads_lost_abs - lag(p_reads_lost_abs), 0))
+n_reads_rel_display <- n_reads_rel %>% 
+  group_by(Stage=stage) %>% 
+  summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), "-", round(max(p_reads_lost_abs*100),1), " (mean ", round(mean(p_reads_lost_abs*100),1), ")"),
+            `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), "-", round(max(p_reads_lost_abs_marginal*100),1), " (mean ", round(mean(p_reads_lost_abs_marginal*100),1), ")"), .groups="drop") %>% 
+  filter(Stage != "raw_concat") %>%
+  mutate(Stage = Stage %>% as.numeric %>% factor(labels=c("Trimming & filtering", "Deduplication", "Initial ribodepletion", "Secondary ribodepletion")))
+n_reads_rel_display
+
+
+ +
+
+
+
+
Code
g_stage_base <- ggplot(mapping=aes(x=stage, group=sample)) +
+  theme_kit
+
+# Plot reads over preprocessing
+g_reads_stages <- g_stage_base +
+  geom_line(aes(y=n_read_pairs), data=basic_stats) +
+  scale_y_continuous("# Read pairs", expand=c(0,0), limits=c(0,NA))
+g_reads_stages
+
+
+

+
+
+
+
Code
# Plot relative read losses during preprocessing
+g_reads_rel <- g_stage_base +
+  geom_line(aes(y=p_reads_lost_abs_marginal), data=n_reads_rel) +
+  scale_y_continuous("% Total Reads Lost", expand=c(0,0), 
+                     labels = function(x) x*100)
+g_reads_rel
+
+
+

+
+
+
+
+

Data cleaning was very successful at removing adapters and improving read qualities:

+
+
Code
g_qual <- ggplot(mapping=aes(linetype=read_pair, 
+                         group=interaction(sample,read_pair))) + 
+  scale_linetype_discrete(name = "Read Pair") +
+  guides(color=guide_legend(nrow=2,byrow=TRUE),
+         linetype = guide_legend(nrow=2,byrow=TRUE)) +
+  theme_base
+
+# Visualize adapters
+g_adapters <- g_qual + 
+  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +
+  scale_y_continuous(name="% Adapters", limits=c(0,20),
+                     breaks = seq(0,50,10), expand=c(0,0)) +
+  scale_x_continuous(name="Position", limits=c(0,NA),
+                     breaks=seq(0,140,20), expand=c(0,0)) +
+  facet_grid(stage~adapter)
+g_adapters
+
+
+

+
+
+
+
Code
# Visualize quality
+g_quality_base <- g_qual +
+  geom_hline(yintercept=25, linetype="dashed", color="red") +
+  geom_hline(yintercept=30, linetype="dashed", color="red") +
+  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +
+  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+  scale_x_continuous(name="Position", limits=c(0,NA),
+                     breaks=seq(0,140,20), expand=c(0,0)) +
+  facet_grid(stage~.)
+g_quality_base
+
+
+

+
+
+
+
Code
g_quality_seq <- g_qual +
+  geom_vline(xintercept=25, linetype="dashed", color="red") +
+  geom_vline(xintercept=30, linetype="dashed", color="red") +
+  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +
+  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+  scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+  facet_grid(stage~.)
+g_quality_seq
+
+
+

+
+
+
+
+

According to FASTQC, cleaning + deduplication was very effective at reducing measured duplicate levels in the few samples that required it:

+
+
Code
stage_dup <- basic_stats %>% group_by(stage) %>% 
+  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),
+            dmean=mean(percent_duplicates), .groups = "drop")
+
+g_dup_stages <- g_stage_base +
+  geom_line(aes(y=percent_duplicates), data=basic_stats) +
+  scale_y_continuous("% Duplicates", limits=c(0,NA), expand=c(0,0))
+g_dup_stages
+
+
+

+
+
+
+
Code
g_readlen_stages <- g_stage_base + 
+  geom_line(aes(y=mean_seq_len), data=basic_stats) +
+  scale_y_continuous("Mean read length (nt)", expand=c(0,0), limits=c(0,NA))
+g_readlen_stages
+
+
+

+
+
+
+
+

High-level composition

+

As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:

+
+
Code
classifications <- c("Filtered", "Duplicate", "Ribosomal", "Unassigned",
+                     "Bacterial", "Archaeal", "Viral", "Human")
+
+# Import composition data
+comp_path <- file.path(data_dir, "taxonomic_composition.tsv.gz")
+comp <- read_tsv(comp_path, show_col_types = FALSE) %>%
+  left_join(libraries, by="sample") %>%
+  mutate(classification = factor(classification, levels = classifications))
+  
+
+# Summarize composition
+read_comp_summ <- comp %>% 
+  group_by(classification) %>%
+  summarize(n_reads = sum(n_reads), .groups = "drop_last") %>%
+  mutate(n_reads = replace_na(n_reads,0),
+    p_reads = n_reads/sum(n_reads),
+    pc_reads = p_reads*100)
+
+
+
+
Code
# Prepare plotting templates
+g_comp_base <- ggplot(mapping=aes(x=sample, y=p_reads, fill=classification)) +
+  theme_kit
+scale_y_pc_reads <- purrr::partial(scale_y_continuous, name = "% Reads",
+                                   expand = c(0,0), labels = function(y) y*100)
+
+# Plot overall composition
+g_comp <- g_comp_base + geom_col(data = comp, position = "stack", width=1) +
+  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
+  scale_fill_brewer(palette = "Set1", name = "Classification")
+g_comp
+
+
+

+
+
+
+
Code
# Plot composition of minor components
+comp_minor <- comp %>% 
+  filter(classification %in% c("Archaeal", "Viral", "Human", "Other"))
+palette_minor <- brewer.pal(9, "Set1")[6:9]
+g_comp_minor <- g_comp_base + 
+  geom_col(data=comp_minor, position = "stack", width=1) +
+  scale_y_pc_reads() +
+  scale_fill_manual(values=palette_minor, name = "Classification")
+g_comp_minor
+
+
+

+
+
+
+
+
+
Code
p_reads_summ_group <- comp %>%
+  mutate(classification = ifelse(classification %in% c("Filtered", "Duplicate", "Unassigned"), "Excluded", as.character(classification)),
+         classification = fct_inorder(classification)) %>%
+  group_by(classification, sample) %>%
+  summarize(p_reads = sum(p_reads), .groups = "drop") %>%
+  group_by(classification) %>%
+  summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, 
+            pc_mean = mean(p_reads)*100, .groups = "drop")
+p_reads_summ_prep <- p_reads_summ_group %>%
+  mutate(classification = fct_inorder(classification),
+         pc_min = pc_min %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+         pc_max = pc_max %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+         pc_mean = pc_mean %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+         display = paste0(pc_min, "-", pc_max, "% (mean ", pc_mean, "%)"))
+p_reads_summ <- p_reads_summ_prep %>%
+  select(Classification=classification, 
+         `Read Fraction`=display) %>%
+  arrange(Classification)
+p_reads_summ
+
+
+ +
+
+
+

As in previous DNA datasets, the vast majority of classified reads were bacterial in origin. Viral fraction averaged 0.13%, though one samples (NYC-08) reached almost 1%. As is common for DNA data, viral reads were overwhelmingly dominated by Caudoviricetes phages:

+
+
Code
# Get Kraken reports
+reports_path <- file.path(data_dir, "kraken_reports.tsv.gz")
+reports <- read_tsv(reports_path, show_col_types = FALSE)
+
+# Get viral taxonomy
+viral_taxa_path <- file.path(data_dir, "viral-taxids.tsv.gz")
+viral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)
+
+# Filter to viral taxa
+kraken_reports_viral <- filter(reports, taxid %in% viral_taxa$taxid) %>%
+  group_by(sample) %>%
+  mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])
+kraken_reports_viral_cleaned <- kraken_reports_viral %>%
+  inner_join(libraries, by="sample") %>%
+  select(-pc_reads_total, -n_reads_direct, -contains("minimizers")) %>%
+  select(name, taxid, p_reads_viral, n_reads_clade, everything())
+
+viral_classes <- kraken_reports_viral_cleaned %>% filter(rank == "C")
+viral_families <- kraken_reports_viral_cleaned %>% filter(rank == "F")
+
+
+
+
Code
major_threshold <- 0.02
+
+# Identify major viral classes
+viral_classes_major_tab <- viral_classes %>% 
+  group_by(name, taxid) %>%
+  summarize(p_reads_viral_max = max(p_reads_viral), .groups="drop") %>%
+  filter(p_reads_viral_max >= major_threshold)
+viral_classes_major_list <- viral_classes_major_tab %>% pull(name)
+viral_classes_major <- viral_classes %>% 
+  filter(name %in% viral_classes_major_list) %>%
+  select(name, taxid, sample, p_reads_viral)
+viral_classes_minor <- viral_classes_major %>% 
+  group_by(sample) %>%
+  summarize(p_reads_viral_major = sum(p_reads_viral), .groups = "drop") %>%
+  mutate(name = "Other", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%
+  select(name, taxid, sample, p_reads_viral)
+viral_classes_display <- bind_rows(viral_classes_major, viral_classes_minor) %>%
+  arrange(desc(p_reads_viral)) %>% 
+  mutate(name = factor(name, levels=c(viral_classes_major_list, "Other")),
+         p_reads_viral = pmax(p_reads_viral, 0)) %>%
+  rename(p_reads = p_reads_viral, classification=name)
+
+palette_viral <- c(brewer.pal(12, "Set3"), brewer.pal(8, "Dark2"))
+g_classes <- g_comp_base + 
+  geom_col(data=viral_classes_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Viral Reads", limits=c(0,1.01), breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral class")
+  
+g_classes
+
+
+

+
+
+
+
+

Human-infecting virus reads: validation

+

Next, I investigated the human-infecting virus read content of these unenriched samples. A grand total of 199 reads were identified as putatively human-viral:

+
+
Code
# Import HV read data
+hv_reads_filtered_path <- file.path(data_dir, "hv_hits_putative_filtered.tsv.gz")
+hv_reads_filtered <- lapply(hv_reads_filtered_path, read_tsv,
+                            show_col_types = FALSE) %>%
+  bind_rows() %>%
+  left_join(libraries, by="sample")
+
+# Count reads
+n_hv_filtered <- hv_reads_filtered %>%
+  group_by(sample, seq_id) %>% count %>%
+  group_by(sample) %>% count %>% 
+  inner_join(basic_stats %>% filter(stage == "ribo_initial") %>% 
+               select(sample, n_read_pairs), by="sample") %>% 
+  rename(n_putative = n, n_total = n_read_pairs) %>% 
+  mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)
+n_hv_filtered_summ <- n_hv_filtered %>% ungroup %>%
+  summarize(n_putative = sum(n_putative), n_total = sum(n_total), 
+            .groups="drop") %>% 
+  mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)
+
+
+
+
Code
# Collapse multi-entry sequences
+rmax <- purrr::partial(max, na.rm = TRUE)
+collapse <- function(x) ifelse(all(x == x[1]), x[1], paste(x, collapse="/"))
+mrg <- hv_reads_filtered %>% 
+  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev, na.rm = TRUE)) %>%
+  arrange(desc(adj_score_max)) %>%
+  group_by(seq_id) %>%
+  summarize(sample = collapse(sample),
+            genome_id = collapse(genome_id),
+            taxid_best = taxid[1],
+            taxid = collapse(as.character(taxid)),
+            best_alignment_score_fwd = rmax(best_alignment_score_fwd),
+            best_alignment_score_rev = rmax(best_alignment_score_rev),
+            query_len_fwd = rmax(query_len_fwd),
+            query_len_rev = rmax(query_len_rev),
+            query_seq_fwd = query_seq_fwd[!is.na(query_seq_fwd)][1],
+            query_seq_rev = query_seq_rev[!is.na(query_seq_rev)][1],
+            classified = rmax(classified),
+            assigned_name = collapse(assigned_name),
+            assigned_taxid_best = assigned_taxid[1],
+            assigned_taxid = collapse(as.character(assigned_taxid)),
+            assigned_hv = rmax(assigned_hv),
+            hit_hv = rmax(hit_hv),
+            encoded_hits = collapse(encoded_hits),
+            adj_score_fwd = rmax(adj_score_fwd),
+            adj_score_rev = rmax(adj_score_rev)
+            ) %>%
+  inner_join(libraries, by="sample") %>%
+  mutate(kraken_label = ifelse(assigned_hv, "Kraken2 HV\nassignment",
+                               ifelse(hit_hv, "Kraken2 HV\nhit",
+                                      "No hit or\nassignment"))) %>%
+  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev),
+         highscore = adj_score_max >= 20)
+
+# Plot results
+geom_vhist <- purrr::partial(geom_histogram, binwidth=5, boundary=0)
+g_vhist_base <- ggplot(mapping=aes(x=adj_score_max)) +
+  geom_vline(xintercept=20, linetype="dashed", color="red") +
+  facet_wrap(~kraken_label, labeller = labeller(kit = label_wrap_gen(20)), scales = "free_y") +
+  scale_x_continuous(name = "Maximum adjusted alignment score") + 
+  scale_y_continuous(name="# Read pairs") + 
+  theme_base 
+g_vhist_0 <- g_vhist_base + geom_vhist(data=mrg)
+g_vhist_0
+
+
+

+
+
+
+
+

BLASTing these reads against nt, we find that the pipeline performs well, with only a single high-scoring false-positive read:

+
+
Code
# Import paired BLAST results
+blast_paired_path <- file.path(data_dir, "hv_hits_blast_paired.tsv.gz")
+blast_paired <- read_tsv(blast_paired_path, show_col_types = FALSE)
+
+# Add viral status
+blast_viral <- mutate(blast_paired, viral = staxid %in% viral_taxa$taxid) %>%
+  mutate(viral_full = viral & n_reads == 2)
+
+# Compare to Kraken & Bowtie assignments
+match_taxid <- function(taxid_1, taxid_2){
+  p1 <- mapply(grepl, paste0("/", taxid_1, "$"), taxid_2)
+  p2 <- mapply(grepl, paste0("^", taxid_1, "/"), taxid_2)
+  p3 <- mapply(grepl, paste0("^", taxid_1, "$"), taxid_2)
+  out <- setNames(p1|p2|p3, NULL)
+  return(out)
+}
+mrg_assign <- mrg %>% select(sample, seq_id, taxid, assigned_taxid, adj_score_max)
+blast_assign <- inner_join(blast_viral, mrg_assign, by="seq_id") %>%
+    mutate(taxid_match_bowtie = match_taxid(staxid, taxid),
+           taxid_match_kraken = match_taxid(staxid, assigned_taxid),
+           taxid_match_any = taxid_match_bowtie | taxid_match_kraken)
+blast_out <- blast_assign %>%
+  group_by(seq_id) %>%
+  summarize(viral_status = ifelse(any(viral_full), 2,
+                                  ifelse(any(taxid_match_any), 2,
+                                             ifelse(any(viral), 1, 0))),
+            .groups = "drop")
+
+
+
+
Code
# Merge BLAST results with unenriched read data
+mrg_blast <- full_join(mrg, blast_out, by="seq_id") %>%
+  mutate(viral_status = replace_na(viral_status, 0),
+         viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))
+
+# Plot
+g_vhist_1 <- g_vhist_base + geom_vhist(data=mrg_blast, mapping=aes(fill=viral_status_out)) +
+  scale_fill_brewer(palette = "Set1", name = "Viral status")
+g_vhist_1
+
+
+

+
+
+
+
+

My usual disjunctive score threshold of 20 gave precision, sensitivity, and F1 scores all >96%:

+
+
Code
test_sens_spec <- function(tab, score_threshold){
+  tab_retained <- tab %>% 
+    mutate(retain_score = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),
+           retain = assigned_hv | retain_score) %>%
+    group_by(viral_status_out, retain) %>% count
+  pos_tru <- tab_retained %>% filter(viral_status_out == "TRUE", retain) %>% pull(n) %>% sum
+  pos_fls <- tab_retained %>% filter(viral_status_out != "TRUE", retain) %>% pull(n) %>% sum
+  neg_tru <- tab_retained %>% filter(viral_status_out != "TRUE", !retain) %>% pull(n) %>% sum
+  neg_fls <- tab_retained %>% filter(viral_status_out == "TRUE", !retain) %>% pull(n) %>% sum
+  sensitivity <- pos_tru / (pos_tru + neg_fls)
+  specificity <- neg_tru / (neg_tru + pos_fls)
+  precision   <- pos_tru / (pos_tru + pos_fls)
+  f1 <- 2 * precision * sensitivity / (precision + sensitivity)
+  out <- tibble(threshold=score_threshold, sensitivity=sensitivity, 
+                specificity=specificity, precision=precision, f1=f1)
+  return(out)
+}
+range_f1 <- function(intab, inrange=15:45){
+  tss <- purrr::partial(test_sens_spec, tab=intab)
+  stats <- lapply(inrange, tss) %>% bind_rows %>%
+    pivot_longer(!threshold, names_to="metric", values_to="value")
+  return(stats)
+}
+stats_0 <- range_f1(mrg_blast)
+g_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +
+  geom_vline(xintercept=20, color = "red", linetype = "dashed") +
+  geom_line() +
+  scale_y_continuous(name = "Value", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +
+  scale_x_continuous(name = "Adjusted Score Threshold", expand = c(0,0)) +
+  scale_color_brewer(palette="Dark2") +
+  theme_base
+g_stats_0
+
+
+

+
+
+
+
Code
stats_0 %>% filter(threshold == 20) %>% 
+  select(Threshold=threshold, Metric=metric, Value=value)
+
+
+ +
+
+
+

Human-infecting viruses: overall relative abundance

+
+
Code
# Get raw read counts
+read_counts_raw <- basic_stats_raw %>%
+  select(sample, n_reads_raw = n_read_pairs)
+
+# Get HV read counts
+mrg_hv <- mrg %>% mutate(hv_status = assigned_hv | highscore) %>%
+  rename(taxid_all = taxid, taxid = taxid_best)
+read_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% 
+  count(name="n_reads_hv")
+read_counts <- read_counts_raw %>% left_join(read_counts_hv, by="sample") %>%
+  mutate(n_reads_hv = replace_na(n_reads_hv, 0))
+
+# Aggregate
+read_counts_grp <- read_counts %>%
+  summarize(n_reads_raw = sum(n_reads_raw),
+            n_reads_hv = sum(n_reads_hv), .groups="drop") %>%
+  mutate(sample= "All samples")
+read_counts_agg <- bind_rows(read_counts, read_counts_grp) %>%
+  mutate(p_reads_hv = n_reads_hv/n_reads_raw,
+         sample = factor(sample, levels=c(levels(libraries$sample), "All samples")))
+
+
+

Applying a disjunctive cutoff at S=20 identifies 162 read pairs as human-viral. This gives an overall relative HV abundance of \(9.42 \times 10^{-7}\); higher than Ng and Bengtsson-Palme but lower than most other datasets I’ve analyzed with this pipeline:

+
+
Code
# Visualize
+g_phv_agg <- ggplot(read_counts_agg, aes(x=sample)) +
+  geom_point(aes(y=p_reads_hv)) +
+  scale_y_log10("Relative abundance of human virus reads") +
+  theme_kit
+g_phv_agg
+
+
+

+
+
+
+
+
+
Code
# Collate past RA values
+ra_past <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
+                   "Brumfield", 5e-5, "RNA", FALSE,
+                   "Brumfield", 3.66e-7, "DNA", FALSE,
+                   "Spurbeck", 5.44e-6, "RNA", FALSE,
+                   "Yang", 3.62e-4, "RNA", FALSE,
+                   "Rothman (unenriched)", 1.87e-5, "RNA", FALSE,
+                   "Rothman (panel-enriched)", 3.3e-5, "RNA", TRUE,
+                   "Crits-Christoph (unenriched)", 1.37e-5, "RNA", FALSE,
+                   "Crits-Christoph (panel-enriched)", 1.26e-2, "RNA", TRUE,
+                   "Prussin (non-control)", 1.63e-5, "RNA", FALSE,
+                   "Prussin (non-control)", 4.16e-5, "DNA", FALSE,
+                   "Rosario (non-control)", 1.21e-5, "RNA", FALSE,
+                   "Rosario (non-control)", 1.50e-4, "DNA", FALSE,
+                   "Leung", 1.73e-5, "DNA", FALSE,
+                   "Brinch", 3.88e-6, "DNA", FALSE,
+                   "Bengtsson-Palme", 8.86e-8, "DNA", FALSE,
+                   "Ng", 2.90e-7, "DNA", FALSE
+)
+
+# Collate new RA values
+ra_new <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
+                  "Maritz", 9.42e-7, "DNA", FALSE)
+
+
+# Plot
+scale_color_na <- purrr::partial(scale_color_brewer, palette="Set1",
+                                 name="Nucleic acid type")
+ra_comp <- bind_rows(ra_past, ra_new) %>% mutate(dataset = fct_inorder(dataset))
+g_ra_comp <- ggplot(ra_comp, aes(y=dataset, x=ra, color=na_type)) +
+  geom_point() +
+  scale_color_na() +
+  scale_x_log10(name="Relative abundance of human virus reads") +
+  theme_base + theme(axis.title.y = element_blank())
+g_ra_comp
+
+
+

+
+
+
+
+

Human-infecting viruses: taxonomy and composition

+

In investigating the taxonomy of human-infecting virus reads, I restricted my analysis to samples with more than 5 HV read pairs total across all viruses, to reduce noise arising from extremely low HV read counts in some samples. 10 samples met this criterion.

+

At the family level, most samples were dominated by Adenoviridae, Polyomaviridae and Papillomaviridae. However, one sample, NYC-03, was overwhelmingly dominated by Herpesviridae:

+
+
Code
# Get viral taxon names for putative HV reads
+viral_taxa$name[viral_taxa$taxid == 249588] <- "Mamastrovirus"
+viral_taxa$name[viral_taxa$taxid == 194960] <- "Kobuvirus"
+viral_taxa$name[viral_taxa$taxid == 688449] <- "Salivirus"
+viral_taxa$name[viral_taxa$taxid == 585893] <- "Picobirnaviridae"
+viral_taxa$name[viral_taxa$taxid == 333922] <- "Betapapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 334207] <- "Betapapillomavirus 3"
+viral_taxa$name[viral_taxa$taxid == 369960] <- "Porcine type-C oncovirus"
+viral_taxa$name[viral_taxa$taxid == 333924] <- "Betapapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 687329] <- "Anelloviridae"
+viral_taxa$name[viral_taxa$taxid == 325455] <- "Gammapapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 333750] <- "Alphapapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 694002] <- "Betacoronavirus"
+viral_taxa$name[viral_taxa$taxid == 334202] <- "Mupapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 197911] <- "Alphainfluenzavirus"
+viral_taxa$name[viral_taxa$taxid == 186938] <- "Respirovirus"
+viral_taxa$name[viral_taxa$taxid == 333926] <- "Gammapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 337051] <- "Betapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 337043] <- "Alphapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid == 694003] <- "Betacoronavirus 1"
+viral_taxa$name[viral_taxa$taxid == 334204] <- "Mupapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 334208] <- "Betapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid == 333928] <- "Gammapapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 337039] <- "Alphapapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 333929] <- "Gammapapillomavirus 3"
+viral_taxa$name[viral_taxa$taxid == 337042] <- "Alphapapillomavirus 7"
+viral_taxa$name[viral_taxa$taxid == 334203] <- "Mupapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 333757] <- "Alphapapillomavirus 8"
+viral_taxa$name[viral_taxa$taxid == 337050] <- "Alphapapillomavirus 6"
+viral_taxa$name[viral_taxa$taxid == 333767] <- "Alphapapillomavirus 3"
+viral_taxa$name[viral_taxa$taxid == 333754] <- "Alphapapillomavirus 10"
+viral_taxa$name[viral_taxa$taxid == 687363] <- "Torque teno virus 24"
+viral_taxa$name[viral_taxa$taxid == 687342] <- "Torque teno virus 3"
+viral_taxa$name[viral_taxa$taxid == 687359] <- "Torque teno virus 20"
+viral_taxa$name[viral_taxa$taxid == 194441] <- "Primate T-lymphotropic virus 2"
+viral_taxa$name[viral_taxa$taxid == 334209] <- "Betapapillomavirus 5"
+viral_taxa$name[viral_taxa$taxid == 194965] <- "Aichivirus B"
+viral_taxa$name[viral_taxa$taxid == 333930] <- "Gammapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid == 337048] <- "Alphapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 337041] <- "Alphapapillomavirus 9"
+viral_taxa$name[viral_taxa$taxid == 337049] <- "Alphapapillomavirus 11"
+viral_taxa$name[viral_taxa$taxid == 337044] <- "Alphapapillomavirus 5"
+
+# Filter samples and add viral taxa information
+samples_keep <- read_counts %>% filter(n_reads_hv > 5) %>% pull(sample)
+mrg_hv_named <- mrg_hv %>% filter(sample %in% samples_keep, hv_status) %>% left_join(viral_taxa, by="taxid") 
+
+# Discover viral species & genera for HV reads
+raise_rank <- function(read_db, taxid_db, out_rank = "species", verbose = FALSE){
+  # Get higher ranks than search rank
+  ranks <- c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
+  rank_match <- which.max(ranks == out_rank)
+  high_ranks <- ranks[rank_match:length(ranks)]
+  # Merge read DB and taxid DB
+  reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%
+    left_join(taxid_db, by="taxid")
+  # Extract sequences that are already at appropriate rank
+  reads_rank <- filter(reads, rank == out_rank)
+  # Drop sequences at a higher rank and return unclassified sequences
+  reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+  while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...
+    # Promote read taxids and re-merge with taxid DB, then re-classify and filter
+    reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%
+      select(-parent_taxid, -rank, -name) %>%
+      left_join(taxid_db, by="taxid")
+    reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%
+      bind_rows(reads_rank)
+    reads_norank <- reads_remaining %>%
+      filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+  }
+  # Finally, extract and append reads that were excluded during the process
+  reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)
+  reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%
+    select(-parent_taxid, -rank, -name) %>%
+    left_join(taxid_db, by="taxid")
+  return(reads_out)
+}
+hv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, "species")
+hv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, "genus")
+hv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, "family")
+
+
+
+
Code
threshold_major_family <- 0.02
+
+# Count reads for each human-viral family
+hv_family_counts <- hv_reads_family %>% 
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+hv_family_major_tab <- hv_family_counts %>% group_by(name) %>% 
+  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
+  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)
+hv_family_counts_major <- hv_family_counts %>%
+  mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(hv_family_major_tab$name, "Other")))
+hv_family_counts_display <- hv_family_counts_major %>%
+  rename(p_reads = p_reads_hv, classification = name_display)
+
+# Plot
+g_hv_family <- g_comp_base + 
+  geom_col(data=hv_family_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% HV Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral family") +
+  labs(title="Family composition of human-viral reads") +
+  guides(fill=guide_legend(ncol=4)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+g_hv_family
+
+
+

+
+
+
+
Code
# Get most prominent families for text
+hv_family_collate <- hv_family_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv),
+            p_reads_max = max(p_reads_hv), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+
+
+

In investigating individual viral families, to avoid distortions from a few rare reads, I restricted myself to samples where that family made up at least 10% of human-viral reads:

+
+
Code
threshold_major_species <- 0.05
+taxid_adeno <- 10508
+
+# Get set of adenoviridae reads
+adeno_samples <- hv_family_counts %>% filter(taxid == taxid_adeno) %>%
+  filter(p_reads_hv >= 0.1) %>%
+  pull(sample)
+adeno_ids <- hv_reads_family %>% 
+  filter(taxid == taxid_adeno, sample %in% adeno_samples) %>%
+  pull(seq_id)
+
+# Count reads for each adenoviridae species
+adeno_species_counts <- hv_reads_species %>%
+  filter(seq_id %in% adeno_ids) %>%
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_adeno = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+adeno_species_major_tab <- adeno_species_counts %>% group_by(name) %>% 
+  filter(p_reads_adeno == max(p_reads_adeno)) %>% 
+  filter(row_number() == 1) %>%
+  arrange(desc(p_reads_adeno)) %>% 
+  filter(p_reads_adeno > threshold_major_species)
+adeno_species_counts_major <- adeno_species_counts %>%
+  mutate(name_display = ifelse(name %in% adeno_species_major_tab$name, 
+                               name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_adeno = sum(n_reads_hv),
+            p_reads_adeno = sum(p_reads_adeno), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(adeno_species_major_tab$name, "Other")))
+adeno_species_counts_display <- adeno_species_counts_major %>%
+  rename(p_reads = p_reads_adeno, classification = name_display)
+
+# Plot
+g_adeno_species <- g_comp_base + 
+  geom_col(data=adeno_species_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Adenoviridae Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral species") +
+  labs(title="Species composition of Adenoviridae reads") +
+  guides(fill=guide_legend(ncol=3)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+
+g_adeno_species
+
+
+

+
+
+
+
Code
# Get most prominent species for text
+adeno_species_collate <- adeno_species_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_adeno), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+
+
+
+
Code
threshold_major_species <- 0.1
+taxid_polyoma <- 151341
+
+# Get set of polyomaviridae reads
+polyoma_samples <- hv_family_counts %>% filter(taxid == taxid_polyoma) %>%
+  filter(p_reads_hv >= 0.1) %>%
+  pull(sample)
+polyoma_ids <- hv_reads_family %>% 
+  filter(taxid == taxid_polyoma, sample %in% polyoma_samples) %>%
+  pull(seq_id)
+
+# Count reads for each polyomaviridae species
+polyoma_species_counts <- hv_reads_species %>%
+  filter(seq_id %in% polyoma_ids) %>%
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_polyoma = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+polyoma_species_major_tab <- polyoma_species_counts %>% group_by(name) %>% 
+  filter(p_reads_polyoma == max(p_reads_polyoma)) %>% 
+  filter(row_number() == 1) %>%
+  arrange(desc(p_reads_polyoma)) %>% 
+  filter(p_reads_polyoma > threshold_major_species)
+polyoma_species_counts_major <- polyoma_species_counts %>%
+  mutate(name_display = ifelse(name %in% polyoma_species_major_tab$name, 
+                               name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_polyoma = sum(n_reads_hv),
+            p_reads_polyoma = sum(p_reads_polyoma), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(polyoma_species_major_tab$name, "Other")))
+polyoma_species_counts_display <- polyoma_species_counts_major %>%
+  rename(p_reads = p_reads_polyoma, classification = name_display)
+
+# Plot
+g_polyoma_species <- g_comp_base + 
+  geom_col(data=polyoma_species_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Polyomaviridae Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral species") +
+  labs(title="Species composition of Polyomaviridae reads") +
+  guides(fill=guide_legend(ncol=3)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+
+g_polyoma_species
+
+
+

+
+
+
+
Code
# Get most prominent species for text
+polyoma_species_collate <- polyoma_species_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_polyoma), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+
+
+
+
Code
threshold_major_species <- 0.1
+taxid_papilloma <- 151340
+
+# Get set of papillomaviridae reads
+papilloma_samples <- hv_family_counts %>% filter(taxid == taxid_papilloma) %>%
+  filter(p_reads_hv >= 0.1) %>%
+  pull(sample)
+papilloma_ids <- hv_reads_family %>% 
+  filter(taxid == taxid_papilloma, sample %in% papilloma_samples) %>%
+  pull(seq_id)
+
+# Count reads for each papillomaviridae species
+papilloma_species_counts <- hv_reads_species %>%
+  filter(seq_id %in% papilloma_ids) %>%
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_papilloma = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+papilloma_species_major_tab <- papilloma_species_counts %>% group_by(name) %>% 
+  filter(p_reads_papilloma == max(p_reads_papilloma)) %>% 
+  filter(row_number() == 1) %>%
+  arrange(desc(p_reads_papilloma)) %>% 
+  filter(p_reads_papilloma > threshold_major_species)
+papilloma_species_counts_major <- papilloma_species_counts %>%
+  mutate(name_display = ifelse(name %in% papilloma_species_major_tab$name, 
+                               name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_papilloma = sum(n_reads_hv),
+            p_reads_papilloma = sum(p_reads_papilloma), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(papilloma_species_major_tab$name, "Other")))
+papilloma_species_counts_display <- papilloma_species_counts_major %>%
+  rename(p_reads = p_reads_papilloma, classification = name_display)
+
+# Plot
+g_papilloma_species <- g_comp_base + 
+  geom_col(data=papilloma_species_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Papillomaviridae Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral species") +
+  labs(title="Species composition of Papillomaviridae reads") +
+  guides(fill=guide_legend(ncol=3)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+
+g_papilloma_species
+
+
+

+
+
+
+
Code
# Get most prominent species for text
+papilloma_species_collate <- papilloma_species_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_papilloma), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+
+
+
+
Code
threshold_major_species <- 0.1
+taxid_herpes <- 10292
+
+# Get set of herpesviridae reads
+herpes_samples <- hv_family_counts %>% filter(taxid == taxid_herpes) %>%
+  filter(p_reads_hv >= 0.1) %>%
+  pull(sample)
+herpes_ids <- hv_reads_family %>% 
+  filter(taxid == taxid_herpes, sample %in% herpes_samples) %>%
+  pull(seq_id)
+
+# Count reads for each herpesviridae species
+herpes_species_counts <- hv_reads_species %>%
+  filter(seq_id %in% herpes_ids) %>%
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_herpes = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+herpes_species_major_tab <- herpes_species_counts %>% group_by(name) %>% 
+  filter(p_reads_herpes == max(p_reads_herpes)) %>% 
+  filter(row_number() == 1) %>%
+  arrange(desc(p_reads_herpes)) %>% 
+  filter(p_reads_herpes > threshold_major_species)
+herpes_species_counts_major <- herpes_species_counts %>%
+  mutate(name_display = ifelse(name %in% herpes_species_major_tab$name, 
+                               name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_herpes = sum(n_reads_hv),
+            p_reads_herpes = sum(p_reads_herpes), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(herpes_species_major_tab$name, "Other")))
+herpes_species_counts_display <- herpes_species_counts_major %>%
+  rename(p_reads = p_reads_herpes, classification = name_display)
+
+# Plot
+g_herpes_species <- g_comp_base + 
+  geom_col(data=herpes_species_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Herpesviridae Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral species") +
+  labs(title="Species composition of Herpesviridae reads") +
+  guides(fill=guide_legend(ncol=3)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+
+g_herpes_species
+
+
+

+
+
+
+
Code
# Get most prominent species for text
+herpes_species_collate <- herpes_species_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_herpes), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+
+
+

I was a bit suspicious of this last result, given that it only occurred in one sample, but according to BLASTN, at least, these human gammaherpesvirus 4 (a.k.a. EBV) matches are real:

+
+
Code
# Configure
+ref_taxids_hv <- c(10376)
+ref_names_hv <- sapply(ref_taxids_hv, function(x) viral_taxa %>% filter(taxid == x) %>% pull(name) %>% first)
+p_threshold <- 0.1
+
+# Get taxon names
+tax_names_path <- file.path(data_dir, "taxid-names.tsv.gz")
+tax_names <- read_tsv(tax_names_path, show_col_types = FALSE)
+
+# Add missing names
+tax_names_new <- tribble(~staxid, ~name,
+                         3050295, "Cytomegalovirus humanbeta5",
+                         459231, "FLAG-tagging vector pFLAG97-TSR",
+                         257877, "Macaca thibetana thibetana",
+                         256321, "Lentiviral transfer vector pHsCXW",
+                         419242, "Shuttle vector pLvCmvMYOCDHA",
+                         419243, "Shuttle vector pLvCmvLacZ",
+                         421868, "Cloning vector pLvCmvLacZ.Gfp",
+                         421869, "Cloning vector pLvCmvMyocardin.Gfp",
+                         426303, "Lentiviral vector pNL-GFP-RRE(SA)",
+                         436015, "Lentiviral transfer vector pFTMGW",
+                         454257, "Shuttle vector pLvCmvMYOCD2aHA",
+                         476184, "Shuttle vector pLV.mMyoD::ERT2.eGFP",
+                         476185, "Shuttle vector pLV.hMyoD.eGFP",
+                         591936, "Piliocolobus tephrosceles",
+                         627481, "Lentiviral transfer vector pFTM3GW",
+                         680261, "Self-inactivating lentivirus vector pLV.C-EF1a.cyt-bGal.dCpG",
+                         2952778, "Expression vector pLV[Exp]-EGFP:T2A:Puro-EF1A",
+                         3022699, "Vector PAS_122122",
+                         3025913, "Vector pSIN-WP-mPGK-GDNF",
+                         3105863, "Vector pLKO.1-ZsGreen1",
+                         3105864, "Vector pLKO.1-ZsGreen1 mouse Wfs1 shRNA",
+                         3108001, "Cloning vector pLVSIN-CMV_Neo_v4.0",
+                         3109234, "Vector pTwist+Kan+High",
+                         3117662, "Cloning vector pLV[Exp]-CBA>P301L",
+                         3117663, "Cloning vector pLV[Exp]-CBA>P301L:T2A:mRuby3",
+                         3117664, "Cloning vector pLV[Exp]-CBA>hMAPT[NM_005910.6](ns):T2A:mRuby3",
+                         3117665, "Cloning vector pLV[Exp]-CBA>mRuby3",
+                         3117666, "Cloning vector pLV[Exp]-CBA>mRuby3/NFAT3 fusion protein",
+                         3117667, "Cloning vector pLV[Exp]-Neo-mPGK>{EGFP-hSEPT6}",
+                         438045, "Xenotropic MuLV-related virus",
+                         447135, "Myodes glareolus",
+                         590745, "Mus musculus mobilized endogenous polytropic provirus",
+                         181858, "Murine AIDS virus-related provirus",
+                         356663, "Xenotropic MuLV-related virus VP35",
+                         356664, "Xenotropic MuLV-related virus VP42",
+                         373193, "Xenotropic MuLV-related virus VP62",
+                         286419, "Canis lupus dingo",
+                         415978, "Sus scrofa scrofa",
+                         494514, "Vulpes lagopus",
+                         3082113, "Rangifer tarandus platyrhynchus",
+                         3119969, "Bubalus kerabau")
+tax_names <- bind_rows(tax_names, tax_names_new)
+
+# Get matches
+hv_blast_staxids <- hv_reads_species %>% filter(taxid %in% ref_taxids_hv) %>%
+  group_by(taxid) %>% mutate(n_seq = n()) %>%
+  left_join(blast_paired, by="seq_id") %>%
+  mutate(staxid = as.integer(staxid)) %>%
+  left_join(tax_names %>% rename(sname=name), by="staxid")
+
+# Count matches
+hv_blast_counts <- hv_blast_staxids %>%
+  group_by(taxid, name, staxid, sname, n_seq) %>%
+  count %>% mutate(p=n/n_seq)
+
+# Subset to major matches
+hv_blast_counts_major <- hv_blast_counts %>% 
+  filter(n>1, p>p_threshold, !is.na(staxid)) %>%
+  arrange(desc(p)) %>% group_by(taxid) %>%
+  filter(row_number() <= 25) %>%
+  mutate(name_display = ifelse(name == ref_names_hv[1], "EBV", name))
+
+# Plot
+g_hv_blast <- ggplot(hv_blast_counts_major, mapping=aes(x=p, y=sname)) +
+  geom_col() +
+  facet_grid(name_display~., scales="free_y", space="free_y") +
+  scale_x_continuous(name="% mapped reads", limits=c(0,1), 
+                     breaks=seq(0,1,0.2), expand=c(0,0)) +
+  theme_base + theme(axis.title.y = element_blank())
+g_hv_blast
+
+
+

+
+
+
+
+

Finally, here again are the overall relative abundances of the specific viral genera I picked out manually in my last entry:

+
+
Code
# Define reference genera
+path_genera_rna <- c("Mamastrovirus", "Enterovirus", "Salivirus", "Kobuvirus", "Norovirus", "Sapovirus", "Rotavirus", "Alphacoronavirus", "Betacoronavirus", "Alphainfluenzavirus", "Betainfluenzavirus", "Lentivirus")
+path_genera_dna <- c("Mastadenovirus", "Alphapolyomavirus", "Betapolyomavirus", "Alphapapillomavirus", "Betapapillomavirus", "Gammapapillomavirus", "Orthopoxvirus", "Simplexvirus",
+                     "Lymphocryptovirus", "Cytomegalovirus", "Dependoparvovirus")
+path_genera <- bind_rows(tibble(name=path_genera_rna, genome_type="RNA genome"),
+                         tibble(name=path_genera_dna, genome_type="DNA genome")) %>%
+  left_join(viral_taxa, by="name")
+
+# Count in each sample
+mrg_hv_named_all <- mrg_hv %>% left_join(viral_taxa, by="taxid")
+hv_reads_genus_all <- raise_rank(mrg_hv_named_all, viral_taxa, "genus")
+n_path_genera <- hv_reads_genus_all %>% 
+  group_by(sample, name, taxid) %>% 
+  count(name="n_reads_viral") %>% 
+  inner_join(path_genera, by=c("name", "taxid")) %>%
+  left_join(read_counts_raw, by=c("sample")) %>%
+  mutate(p_reads_viral = n_reads_viral/n_reads_raw)
+
+# Pivot out and back to add zero lines
+n_path_genera_out <- n_path_genera %>% ungroup %>% select(sample, name, n_reads_viral) %>%
+  pivot_wider(names_from="name", values_from="n_reads_viral", values_fill=0) %>%
+  pivot_longer(-sample, names_to="name", values_to="n_reads_viral") %>%
+  left_join(read_counts_raw, by="sample") %>%
+  left_join(path_genera, by="name") %>%
+  mutate(p_reads_viral = n_reads_viral/n_reads_raw)
+
+## Aggregate across dates
+n_path_genera_stype <- n_path_genera_out %>% 
+  group_by(name, taxid, genome_type) %>%
+  summarize(n_reads_raw = sum(n_reads_raw),
+            n_reads_viral = sum(n_reads_viral), .groups = "drop") %>%
+  mutate(sample="All samples", location="All locations",
+         p_reads_viral = n_reads_viral/n_reads_raw,
+         na_type = "DNA")
+
+# Plot
+g_path_genera <- ggplot(n_path_genera_stype,
+                        aes(y=name, x=p_reads_viral)) +
+  geom_point() +
+  scale_x_log10(name="Relative abundance") +
+  facet_grid(genome_type~., scales="free_y") +
+  theme_base + theme(axis.title.y = element_blank())
+g_path_genera
+
+
+

+
+
+
+
+

Conclusion

+

I’ve had trouble with this dataset previously, so I was surprised at how well this analysis went. It seems the improvements I’ve made to the pipeline over the last couple of months have really had an effect. Like other DNA wastewater datasets I’ve looked at recently, this one (a) has very low HV relative abundance overall, and (b) shows a very high preponderance of human mastadenovirus F. I have one more DNA dataset from the P2RA study to analyze with this pipeline, so we’ll see if this pattern persists there.

+ + + + +
+
+ + + + \ No newline at end of file diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-blast-hits-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-blast-hits-1.png new file mode 100644 index 0000000..4ba60b7 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-blast-hits-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-family-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-family-1.png new file mode 100644 index 0000000..94a5e33 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-family-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-adeno-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-adeno-1.png new file mode 100644 index 0000000..2218d3d Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-adeno-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-herpes-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-herpes-1.png new file mode 100644 index 0000000..760c92d Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-herpes-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-papilloma-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-papilloma-1.png new file mode 100644 index 0000000..8579a61 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-papilloma-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-polyoma-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-polyoma-1.png new file mode 100644 index 0000000..f8cb8c7 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/hv-species-polyoma-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-basic-stats-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-basic-stats-1.png new file mode 100644 index 0000000..90b6c4d Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-basic-stats-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-blast-results-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-blast-results-1.png new file mode 100644 index 0000000..c597a22 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-blast-results-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-composition-all-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-composition-all-1.png new file mode 100644 index 0000000..3ba0855 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-composition-all-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-composition-all-2.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-composition-all-2.png new file mode 100644 index 0000000..151c409 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-composition-all-2.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-f1-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-f1-1.png new file mode 100644 index 0000000..aa07682 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-f1-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-hv-ra-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-hv-ra-1.png new file mode 100644 index 0000000..9a5016f Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-hv-ra-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-hv-scores-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-hv-scores-1.png new file mode 100644 index 0000000..7136762 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-hv-scores-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-quality-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-quality-1.png new file mode 100644 index 0000000..bcfd838 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-quality-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-quality-2.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-quality-2.png new file mode 100644 index 0000000..fde7bbd Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-quality-2.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-quality-3.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-quality-3.png new file mode 100644 index 0000000..0f045f7 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-quality-3.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-raw-quality-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-raw-quality-1.png new file mode 100644 index 0000000..98a24e7 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-raw-quality-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-raw-quality-2.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-raw-quality-2.png new file mode 100644 index 0000000..b1cd56e Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-raw-quality-2.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-raw-quality-3.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-raw-quality-3.png new file mode 100644 index 0000000..45a6f63 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/plot-raw-quality-3.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-dedup-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-dedup-1.png new file mode 100644 index 0000000..221ef34 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-dedup-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-dedup-2.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-dedup-2.png new file mode 100644 index 0000000..c2d166d Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-dedup-2.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-figures-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-figures-1.png new file mode 100644 index 0000000..b6c4bd2 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-figures-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-figures-2.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-figures-2.png new file mode 100644 index 0000000..c1b37fd Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/preproc-figures-2.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/ra-genera-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/ra-genera-1.png new file mode 100644 index 0000000..9c4f6f3 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/ra-genera-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/ra-hv-past-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/ra-hv-past-1.png new file mode 100644 index 0000000..93b237c Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/ra-hv-past-1.png differ diff --git a/docs/notebooks/2024-05-01_maritz_files/figure-html/viral-class-composition-1.png b/docs/notebooks/2024-05-01_maritz_files/figure-html/viral-class-composition-1.png new file mode 100644 index 0000000..71c4320 Binary files /dev/null and b/docs/notebooks/2024-05-01_maritz_files/figure-html/viral-class-composition-1.png differ diff --git a/docs/notebooks/2024-05-01_ng.html b/docs/notebooks/2024-05-01_ng.html index 18360a9..e00f724 100644 --- a/docs/notebooks/2024-05-01_ng.html +++ b/docs/notebooks/2024-05-01_ng.html @@ -574,7 +574,7 @@
-

As in previous DNA datasets, the vast majority of classified reads were bacterial in origin. The fraction of virus reads varied substantially between sample types, averaging <0.01% in influent and final effluent but closer to 0.05% in other sample types. Interestingly (though not particularly relevantly for this analysis), the fraction of archaeal reads was much higher in influent than other sample types, in contrast to Bengtsson-Palme where it was highest in slidge.

+

As in previous DNA datasets, the vast majority of classified reads were bacterial in origin. The fraction of virus reads varied substantially between sample types, averaging <0.01% in influent and final effluent but closer to 0.05% in other sample types. Interestingly (though not particularly relevantly for this analysis), the fraction of archaeal reads was much higher in influent than other sample types, in contrast to Bengtsson-Palme where it was highest in sludge.

As is common for DNA data, viral reads were overwhelmingly dominated by Caudoviricetes phages, though one wet-well sample contained a substantial fraction of Alsuviricetes (a class of mainly plant pathogens that includes Virgaviridae):

Code
# Get Kraken reports
@@ -2151,7 +2151,7 @@
 p_reads_summ
 ```
 
-As in previous DNA datasets, the vast majority of classified reads were bacterial in origin. The fraction of virus reads varied substantially between sample types, averaging \<0.01% in influent and final effluent but closer to 0.05% in other sample types. Interestingly (though not particularly relevantly for this analysis), the fraction of archaeal reads was much higher in influent than other sample types, in contrast to [Bengtsson-Palme](https://data.securebio.org/wills-public-notebook/notebooks/2024-05-01_bengtsson-palme.html) where it was highest in slidge.
+As in previous DNA datasets, the vast majority of classified reads were bacterial in origin. The fraction of virus reads varied substantially between sample types, averaging \<0.01% in influent and final effluent but closer to 0.05% in other sample types. Interestingly (though not particularly relevantly for this analysis), the fraction of archaeal reads was much higher in influent than other sample types, in contrast to [Bengtsson-Palme](https://data.securebio.org/wills-public-notebook/notebooks/2024-05-01_bengtsson-palme.html) where it was highest in sludge.
 
 As is common for DNA data, viral reads were overwhelmingly dominated by *Caudoviricetes* phages, though one wet-well sample contained a substantial fraction of *Alsuviricetes* (a class of mainly plant pathogens that includes *Virgaviridae*):
 
diff --git a/docs/search.json b/docs/search.json
index 04cb3d0..0e41c6e 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -32,7 +32,7 @@
     "href": "index.html",
     "title": "Will's Public NAO Notebook",
     "section": "",
-    "text": "Workflow analysis of Ng et al. (2019)\n\n\nWastewater from Singapore.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Bengtsson-Palme et al. (2016)\n\n\nWastewater grab samples from Sweden.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Brinch et al. (2020)\n\n\nWastewater from Copenhagen.\n\n\n\n\n\nApr 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Leung et al. (2021)\n\n\nAir sampling from urban public transit systems.\n\n\n\n\n\nApr 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rosario et al. (2018)\n\n\nAir sampling from a student dorm in Colorado.\n\n\n\n\n\nApr 12, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Prussin et al. (2019)\n\n\nAir filters from a daycare in Virginia.\n\n\n\n\n\nApr 12, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Brumfield et al. (2022)\n\n\nWastewater from a manhole in Maryland.\n\n\n\n\n\nApr 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Spurbeck et al. (2023)\n\n\nCave carpa.\n\n\n\n\n\nApr 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nFollowup analysis of Yang et al. (2020)\n\n\nDigging into deduplication.\n\n\n\n\n\nMar 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Yang et al. (2020)\n\n\nWastewater from Xinjiang.\n\n\n\n\n\nMar 16, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nImproving read deduplication in the MGS workflow\n\n\nRemoving reverse-complement duplicates of human-viral reads.\n\n\n\n\n\nMar 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 2\n\n\nPanel-enriched samples.\n\n\n\n\n\nFeb 29, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 1\n\n\nUnenriched samples.\n\n\n\n\n\nFeb 27, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 3\n\n\nFixing the virus pipeline.\n\n\n\n\n\nFeb 15, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 2\n\n\nAbundance and composition of human-infecting viruses.\n\n\n\n\n\nFeb 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 1\n\n\nPreprocessing and composition.\n\n\n\n\n\nFeb 4, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nAutomating BLAST validation of human viral read assignment\n\n\nExperiments with BLASTN remote mode\n\n\n\n\n\nJan 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nProject Runway RNA-seq testing data: removing livestock reads\n\n\n\n\n\n\n\n\nDec 22, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Project Runway RNA-seq testing data\n\n\nApplying a new workflow to some oldish data.\n\n\n\n\n\nDec 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nEstimating the effect of read depth on duplication rate for Project Runway DNA data\n\n\nHow deep can we go?\n\n\n\n\n\nNov 8, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
+    "text": "Workflow analysis of Ng et al. (2019)\n\n\nWastewater from Singapore.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Bengtsson-Palme et al. (2016)\n\n\nWastewater grab samples from Sweden.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Maritz et al. (2019)\n\n\nWastewater from NYC.\n\n\n\n\n\nMay 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Brinch et al. (2020)\n\n\nWastewater from Copenhagen.\n\n\n\n\n\nApr 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Leung et al. (2021)\n\n\nAir sampling from urban public transit systems.\n\n\n\n\n\nApr 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rosario et al. (2018)\n\n\nAir sampling from a student dorm in Colorado.\n\n\n\n\n\nApr 12, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Prussin et al. (2019)\n\n\nAir filters from a daycare in Virginia.\n\n\n\n\n\nApr 12, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Brumfield et al. (2022)\n\n\nWastewater from a manhole in Maryland.\n\n\n\n\n\nApr 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Spurbeck et al. (2023)\n\n\nCave carpa.\n\n\n\n\n\nApr 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nFollowup analysis of Yang et al. (2020)\n\n\nDigging into deduplication.\n\n\n\n\n\nMar 19, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Yang et al. (2020)\n\n\nWastewater from Xinjiang.\n\n\n\n\n\nMar 16, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nImproving read deduplication in the MGS workflow\n\n\nRemoving reverse-complement duplicates of human-viral reads.\n\n\n\n\n\nMar 1, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 2\n\n\nPanel-enriched samples.\n\n\n\n\n\nFeb 29, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Rothman et al. (2021), part 1\n\n\nUnenriched samples.\n\n\n\n\n\nFeb 27, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 3\n\n\nFixing the virus pipeline.\n\n\n\n\n\nFeb 15, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 2\n\n\nAbundance and composition of human-infecting viruses.\n\n\n\n\n\nFeb 8, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Crits-Christoph et al. (2021), part 1\n\n\nPreprocessing and composition.\n\n\n\n\n\nFeb 4, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nAutomating BLAST validation of human viral read assignment\n\n\nExperiments with BLASTN remote mode\n\n\n\n\n\nJan 30, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nProject Runway RNA-seq testing data: removing livestock reads\n\n\n\n\n\n\n\n\nDec 22, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nWorkflow analysis of Project Runway RNA-seq testing data\n\n\nApplying a new workflow to some oldish data.\n\n\n\n\n\nDec 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nEstimating the effect of read depth on duplication rate for Project Runway DNA data\n\n\nHow deep can we go?\n\n\n\n\n\nNov 8, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n\n\n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
   },
   {
     "objectID": "notebooks/2023-10-12_fastp-vs-adapterremoval.html",
@@ -326,6 +326,13 @@
     "href": "notebooks/2024-05-01_ng.html",
     "title": "Workflow analysis of Ng et al. (2019)",
     "section": "",
-    "text": "Continuing my analysis of datasets from the P2RA preprint, I analyzed the data from Ng et al. (2019), a study that used DNA sequencing of wastewater samples to characterize the bacterial microbiota and resistome in Singapore. This study used processing methods I haven’t seen before:\n\nAll samples passed through “a filter” on-site at the WWTP prior to further processing in lab.\nSamples concentrated to 400ml using a Hemoflow dialyzer “via standard bloodline tubing”.\nEluted concentrates then further concentrated by passing through a 0.22um filter and retaining the retentate (NB: this is anti-selecting for viruses).\nSludge samples were instead centrifuged and the pellet kept for further analysis.\nAfter concentration, samples underwent DNA extraction with the PowerSoil DNA Isolation Kit, then underwent library prep and Illumina sequencing with an Illumina HiSeq2500 (2x250bp).\n\nSince this was a bacteria-focused study that used processing methods we expect to select against viruses, we wouldn’t expect to see high viral relative abundances here. Nevertheless, it’s worth seeing what we can see.\nThe raw data\nSamples were collected from six different locations in the treatment plant on six different dates (from October 2016 to August 2017) for a total of 36 samples:\n\n\nCode# Importing the data is a bit more complicated this time as the samples are split across three pipeline runs\ndata_dir <- \"../data/2024-05-01_ng\"\n\n# Data input paths\nlibraries_path <- file.path(data_dir, \"sample-metadata.csv\")\nbasic_stats_path <- file.path(data_dir, \"qc_basic_stats.tsv.gz\")\nadapter_stats_path <- file.path(data_dir, \"qc_adapter_stats.tsv.gz\")\nquality_base_stats_path <- file.path(data_dir, \"qc_quality_base_stats.tsv.gz\")\nquality_seq_stats_path <- file.path(data_dir, \"qc_quality_sequence_stats.tsv.gz\")\n\n# Import libraries and extract metadata from sample names\nlocs <- c(\"INF\", \"PST\", \"SLUDGE\", \"SST\", \"MBR\", \"WW\")\nlibraries_raw <- lapply(libraries_path, read_csv, show_col_types = FALSE) %>%\n  bind_rows\nlibraries <- libraries_raw %>%\n  mutate(sample_type_long = gsub(\" \\\\(.*\", \"\", sample_type),\n         sample_type_short = ifelse(sample_type_long == \"Influent\", \"INF\",\n                                    sub(\".*\\\\((.*)\\\\)\", \"\\\\1\", sample_type)),\n         sample_type_short = factor(sample_type_short, levels=locs)) %>%\n  arrange(sample_type_short, date) %>%\n  mutate(sample_type_long = fct_inorder(sample_type_long),\n         sample = fct_inorder(sample)) %>%\n  arrange(date) %>%\n  mutate(date = fct_inorder(date))\n\n# Make table\ncount_samples <- libraries %>% group_by(sample_type_long, sample_type_short) %>%\n  count %>%\n  rename(`Sample Type`=sample_type_long, Abbreviation=sample_type_short)\ncount_samples\n\n\n  \n\n\n\n\nCode# Import QC data\nstages <- c(\"raw_concat\", \"cleaned\", \"dedup\", \"ribo_initial\", \"ribo_secondary\")\nimport_basic <- function(paths){\n  lapply(paths, read_tsv, show_col_types = FALSE) %>% bind_rows %>%\n    inner_join(libraries, by=\"sample\") %>%\n      arrange(sample_type_short, date, sample) %>%\n    mutate(stage = factor(stage, levels = stages),\n           sample = fct_inorder(sample))\n}\nimport_basic_paired <- function(paths){\n  import_basic(paths) %>% arrange(read_pair) %>% \n    mutate(read_pair = fct_inorder(as.character(read_pair)))\n}\nbasic_stats <- import_basic(basic_stats_path)\nadapter_stats <- import_basic_paired(adapter_stats_path)\nquality_base_stats <- import_basic_paired(quality_base_stats_path)\nquality_seq_stats <- import_basic_paired(quality_seq_stats_path)\n\n# Filter to raw data\nbasic_stats_raw <- basic_stats %>% filter(stage == \"raw_concat\")\nadapter_stats_raw <- adapter_stats %>% filter(stage == \"raw_concat\")\nquality_base_stats_raw <- quality_base_stats %>% filter(stage == \"raw_concat\")\nquality_seq_stats_raw <- quality_seq_stats %>% filter(stage == \"raw_concat\")\n\n# Get key values for readout\nraw_read_counts <- basic_stats_raw %>% ungroup %>% \n  summarize(rmin = min(n_read_pairs), rmax=max(n_read_pairs),\n            rmean=mean(n_read_pairs), \n            rtot = sum(n_read_pairs),\n            btot = sum(n_bases_approx),\n            dmin = min(percent_duplicates), dmax=max(percent_duplicates),\n            dmean=mean(percent_duplicates), .groups = \"drop\")\n\n\nThese 36 samples yielded 26.6M-74.1M (mean 46.1M) reads per sample, for a total of 1.7B read pairs (830 gigabases of sequence). Read qualities were mostly high but tailed off towards the 3’ end, requiring some trimming. Adapter levels were fairly low but still in need of some trimming. Inferred duplication levels were variable (1-64%, mean 31%), with libraries with lower read depth showing much lower duplicate levels.\n\nCode# Prepare data\nbasic_stats_raw_metrics <- basic_stats_raw %>%\n  select(sample, sample_type_short, date,\n         `# Read pairs` = n_read_pairs,\n         `Total base pairs\\n(approx)` = n_bases_approx,\n         `% Duplicates\\n(FASTQC)` = percent_duplicates) %>%\n  pivot_longer(-(sample:date), names_to = \"metric\", values_to = \"value\") %>%\n  mutate(metric = fct_inorder(metric))\n\n# Set up plot templates\nscale_fill_st <- purrr::partial(scale_fill_brewer, palette=\"Set1\", name=\"Sample Type\")\ng_basic <- ggplot(basic_stats_raw_metrics, \n                  aes(x=sample, y=value, fill=sample_type_short, \n                      group=interaction(sample_type_short,sample))) +\n  geom_col(position = \"dodge\") +\n  scale_y_continuous(expand=c(0,0)) +\n  expand_limits(y=c(0,100)) +\n  scale_fill_st() + \n  facet_grid(metric~., scales = \"free\", space=\"free_x\", switch=\"y\") +\n  theme_xblank + theme(\n    axis.title.y = element_blank(),\n    strip.text.y = element_text(face=\"plain\")\n  )\ng_basic\n\n\n\n\n\n\n\n\nCode# Set up plotting templates\nscale_color_st <- purrr::partial(scale_color_brewer, palette=\"Set1\",\n                                   name=\"Sample Type\")\ng_qual_raw <- ggplot(mapping=aes(color=sample_type_short, linetype=read_pair, \n                         group=interaction(sample,read_pair))) + \n  scale_color_st() + scale_linetype_discrete(name = \"Read Pair\") +\n  guides(color=guide_legend(nrow=2,byrow=TRUE),\n         linetype = guide_legend(nrow=2,byrow=TRUE)) +\n  theme_base\n\n# Visualize adapters\ng_adapters_raw <- g_qual_raw + \n  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +\n  scale_y_continuous(name=\"% Adapters\", limits=c(0,NA),\n                     breaks = seq(0,100,1), expand=c(0,0)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,500,20), expand=c(0,0)) +\n  facet_grid(.~adapter)\ng_adapters_raw\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base_raw <- g_qual_raw +\n  geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +\n  scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,500,20), expand=c(0,0))\ng_quality_base_raw\n\n\n\n\n\n\nCodeg_quality_seq_raw <- g_qual_raw +\n  geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +\n  scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n  scale_y_continuous(name=\"# Sequences\", expand=c(0,0))\ng_quality_seq_raw\n\n\n\n\n\n\n\nPreprocessing\nThe average fraction of reads lost at each stage in the preprocessing pipeline is shown in the following table. As expected given the observed difference in duplication levels, many more reads were lost during deduplication in liquid samples than sludge samples. Conversely, trimming and filtering consistently removed more reads in sludge than in liquid samples, though the effect was less dramatic than for deduplication. Very few reads were lost during ribodepletion, as expected for DNA sequencing libraries.\n\nCoden_reads_rel <- basic_stats %>% \n  select(sample, sample_type_short, date, stage, \n         percent_duplicates, n_read_pairs) %>%\n  group_by(sample) %>% arrange(sample, stage) %>%\n  mutate(p_reads_retained = replace_na(n_read_pairs / lag(n_read_pairs), 0),\n         p_reads_lost = 1 - p_reads_retained,\n         p_reads_retained_abs = n_read_pairs / n_read_pairs[1],\n         p_reads_lost_abs = 1-p_reads_retained_abs,\n         p_reads_lost_abs_marginal = replace_na(p_reads_lost_abs - lag(p_reads_lost_abs), 0))\nn_reads_rel_display <- n_reads_rel %>% \n  group_by(`Sample Type`=sample_type_short, Stage=stage) %>% \n  summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), \"-\", round(max(p_reads_lost_abs*100),1), \" (mean \", round(mean(p_reads_lost_abs*100),1), \")\"),\n            `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), \"-\", round(max(p_reads_lost_abs_marginal*100),1), \" (mean \", round(mean(p_reads_lost_abs_marginal*100),1), \")\"), .groups=\"drop\") %>% \n  filter(Stage != \"raw_concat\") %>%\n  mutate(Stage = Stage %>% as.numeric %>% factor(labels=c(\"Trimming & filtering\", \"Deduplication\", \"Initial ribodepletion\", \"Secondary ribodepletion\")))\nn_reads_rel_display\n\n\n  \n\n\n\n\nCodeg_stage_base <- ggplot(mapping=aes(x=stage, color=sample_type_short, group=sample)) +\n  scale_color_st() +\n  theme_kit\n\n# Plot reads over preprocessing\ng_reads_stages <- g_stage_base +\n  geom_line(aes(y=n_read_pairs), data=basic_stats) +\n  scale_y_continuous(\"# Read pairs\", expand=c(0,0), limits=c(0,NA))\ng_reads_stages\n\n\n\n\n\n\nCode# Plot relative read losses during preprocessing\ng_reads_rel <- g_stage_base +\n  geom_line(aes(y=p_reads_lost_abs_marginal), data=n_reads_rel) +\n  scale_y_continuous(\"% Total Reads Lost\", expand=c(0,0), \n                     labels = function(x) x*100)\ng_reads_rel\n\n\n\n\n\n\n\nData cleaning was very successful at removing adapters and improving read qualities:\n\nCodeg_qual <- ggplot(mapping=aes(color=sample_type_short, linetype=read_pair, \n                         group=interaction(sample,read_pair))) + \n  scale_color_st() + scale_linetype_discrete(name = \"Read Pair\") +\n  guides(color=guide_legend(nrow=2,byrow=TRUE),\n         linetype = guide_legend(nrow=2,byrow=TRUE)) +\n  theme_base\n\n# Visualize adapters\ng_adapters <- g_qual + \n  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +\n  scale_y_continuous(name=\"% Adapters\", limits=c(0,20),\n                     breaks = seq(0,50,10), expand=c(0,0)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,140,20), expand=c(0,0)) +\n  facet_grid(stage~adapter)\ng_adapters\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base <- g_qual +\n  geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +\n  scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,140,20), expand=c(0,0)) +\n  facet_grid(stage~.)\ng_quality_base\n\n\n\n\n\n\nCodeg_quality_seq <- g_qual +\n  geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +\n  scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n  scale_y_continuous(name=\"# Sequences\", expand=c(0,0)) +\n  facet_grid(stage~.)\ng_quality_seq\n\n\n\n\n\n\n\nAccording to FASTQC, cleaning + deduplication was very effective at reducing measured duplicate levels, which fell from an average of 31% to 6.5%:\n\nCodestage_dup <- basic_stats %>% group_by(stage) %>% \n  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),\n            dmean=mean(percent_duplicates), .groups = \"drop\")\n\ng_dup_stages <- g_stage_base +\n  geom_line(aes(y=percent_duplicates), data=basic_stats) +\n  scale_y_continuous(\"% Duplicates\", limits=c(0,NA), expand=c(0,0))\ng_dup_stages\n\n\n\n\n\n\nCodeg_readlen_stages <- g_stage_base + \n  geom_line(aes(y=mean_seq_len), data=basic_stats) +\n  scale_y_continuous(\"Mean read length (nt)\", expand=c(0,0), limits=c(0,NA))\ng_readlen_stages\n\n\n\n\n\n\n\nHigh-level composition\nAs before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:\n\nCodeclassifications <- c(\"Filtered\", \"Duplicate\", \"Ribosomal\", \"Unassigned\",\n                     \"Bacterial\", \"Archaeal\", \"Viral\", \"Human\")\n\n# Import composition data\ncomp_path <- file.path(data_dir, \"taxonomic_composition.tsv.gz\")\ncomp <- read_tsv(comp_path, show_col_types = FALSE) %>%\n  left_join(libraries, by=\"sample\") %>%\n  mutate(classification = factor(classification, levels = classifications))\n  \n\n# Summarize composition\nread_comp_summ <- comp %>% \n  group_by(sample_type_short, classification) %>%\n  summarize(n_reads = sum(n_reads), .groups = \"drop_last\") %>%\n  mutate(n_reads = replace_na(n_reads,0),\n    p_reads = n_reads/sum(n_reads),\n    pc_reads = p_reads*100)\n\n\n\nCode# Prepare plotting templates\ng_comp_base <- ggplot(mapping=aes(x=sample, y=p_reads, fill=classification)) +\n  facet_wrap(~sample_type_short, scales = \"free_x\", ncol=3,\n             labeller = label_wrap_gen(multi_line=FALSE, width=20)) +\n  theme_xblank\nscale_y_pc_reads <- purrr::partial(scale_y_continuous, name = \"% Reads\",\n                                   expand = c(0,0), labels = function(y) y*100)\n\n# Plot overall composition\ng_comp <- g_comp_base + geom_col(data = comp, position = \"stack\", width=1) +\n  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +\n  scale_fill_brewer(palette = \"Set1\", name = \"Classification\")\ng_comp\n\n\n\n\n\n\nCode# Plot composition of minor components\ncomp_minor <- comp %>% \n  filter(classification %in% c(\"Archaeal\", \"Viral\", \"Human\", \"Other\"))\npalette_minor <- brewer.pal(9, \"Set1\")[6:9]\ng_comp_minor <- g_comp_base + \n  geom_col(data=comp_minor, position = \"stack\", width=1) +\n  scale_y_pc_reads() +\n  scale_fill_manual(values=palette_minor, name = \"Classification\")\ng_comp_minor\n\n\n\n\n\n\n\n\nCodep_reads_summ_group <- comp %>%\n  mutate(classification = ifelse(classification %in% c(\"Filtered\", \"Duplicate\", \"Unassigned\"), \"Excluded\", as.character(classification)),\n         classification = fct_inorder(classification)) %>%\n  group_by(classification, sample, sample_type_short) %>%\n  summarize(p_reads = sum(p_reads), .groups = \"drop\") %>%\n  group_by(classification, sample_type_short) %>%\n  summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, \n            pc_mean = mean(p_reads)*100, .groups = \"drop\")\np_reads_summ_prep <- p_reads_summ_group %>%\n  mutate(classification = fct_inorder(classification),\n         pc_min = pc_min %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n         pc_max = pc_max %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n         pc_mean = pc_mean %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n         display = paste0(pc_min, \"-\", pc_max, \"% (mean \", pc_mean, \"%)\"))\np_reads_summ <- p_reads_summ_prep %>%\n  select(`Sample Type`=sample_type_short, Classification=classification, \n         `Read Fraction`=display) %>%\n  arrange(`Sample Type`, Classification)\np_reads_summ\n\n\n  \n\n\n\nAs in previous DNA datasets, the vast majority of classified reads were bacterial in origin. The fraction of virus reads varied substantially between sample types, averaging <0.01% in influent and final effluent but closer to 0.05% in other sample types. Interestingly (though not particularly relevantly for this analysis), the fraction of archaeal reads was much higher in influent than other sample types, in contrast to Bengtsson-Palme where it was highest in slidge.\nAs is common for DNA data, viral reads were overwhelmingly dominated by Caudoviricetes phages, though one wet-well sample contained a substantial fraction of Alsuviricetes (a class of mainly plant pathogens that includes Virgaviridae):\n\nCode# Get Kraken reports\nreports_path <- file.path(data_dir, \"kraken_reports.tsv.gz\")\nreports <- read_tsv(reports_path, show_col_types = FALSE)\n\n# Get viral taxonomy\nviral_taxa_path <- file.path(data_dir, \"viral-taxids.tsv.gz\")\nviral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)\n\n# Filter to viral taxa\nkraken_reports_viral <- filter(reports, taxid %in% viral_taxa$taxid) %>%\n  group_by(sample) %>%\n  mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])\nkraken_reports_viral_cleaned <- kraken_reports_viral %>%\n  inner_join(libraries, by=\"sample\") %>%\n  select(-pc_reads_total, -n_reads_direct, -contains(\"minimizers\")) %>%\n  select(name, taxid, p_reads_viral, n_reads_clade, everything())\n\nviral_classes <- kraken_reports_viral_cleaned %>% filter(rank == \"C\")\nviral_families <- kraken_reports_viral_cleaned %>% filter(rank == \"F\")\n\n\n\nCodemajor_threshold <- 0.02\n\n# Identify major viral classes\nviral_classes_major_tab <- viral_classes %>% \n  group_by(name, taxid) %>%\n  summarize(p_reads_viral_max = max(p_reads_viral), .groups=\"drop\") %>%\n  filter(p_reads_viral_max >= major_threshold)\nviral_classes_major_list <- viral_classes_major_tab %>% pull(name)\nviral_classes_major <- viral_classes %>% \n  filter(name %in% viral_classes_major_list) %>%\n  select(name, taxid, sample, sample_type_short, date, p_reads_viral)\nviral_classes_minor <- viral_classes_major %>% \n  group_by(sample, sample_type_short, date) %>%\n  summarize(p_reads_viral_major = sum(p_reads_viral), .groups = \"drop\") %>%\n  mutate(name = \"Other\", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%\n  select(name, taxid, sample, sample_type_short, date, p_reads_viral)\nviral_classes_display <- bind_rows(viral_classes_major, viral_classes_minor) %>%\n  arrange(desc(p_reads_viral)) %>% \n  mutate(name = factor(name, levels=c(viral_classes_major_list, \"Other\")),\n         p_reads_viral = pmax(p_reads_viral, 0)) %>%\n  rename(p_reads = p_reads_viral, classification=name)\n\npalette_viral <- c(brewer.pal(12, \"Set3\"), brewer.pal(8, \"Dark2\"))\ng_classes <- g_comp_base + \n  geom_col(data=viral_classes_display, position = \"stack\", width=1) +\n  scale_y_continuous(name=\"% Viral Reads\", limits=c(0,1.01), breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral class\")\n  \ng_classes\n\n\n\n\n\n\n\nHuman-infecting virus reads: validation\nNext, I investigated the human-infecting virus read content of these unenriched samples. A grand total of 527 reads were identified as putatively human-viral, with half of samples showing 5 or fewer total HV read pairs.\n\nCode# Import HV read data\nhv_reads_filtered_path <- file.path(data_dir, \"hv_hits_putative_filtered.tsv.gz\")\nhv_reads_filtered <- lapply(hv_reads_filtered_path, read_tsv,\n                            show_col_types = FALSE) %>%\n  bind_rows() %>%\n  left_join(libraries, by=\"sample\")\n\n# Count reads\nn_hv_filtered <- hv_reads_filtered %>%\n  group_by(sample, date, sample_type_short, seq_id) %>% count %>%\n  group_by(sample, date, sample_type_short) %>% count %>% \n  inner_join(basic_stats %>% filter(stage == \"ribo_initial\") %>% \n               select(sample, n_read_pairs), by=\"sample\") %>% \n  rename(n_putative = n, n_total = n_read_pairs) %>% \n  mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)\nn_hv_filtered_summ <- n_hv_filtered %>% ungroup %>%\n  summarize(n_putative = sum(n_putative), n_total = sum(n_total), \n            .groups=\"drop\") %>% \n  mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)\n\n\n\nCode# Collapse multi-entry sequences\nrmax <- purrr::partial(max, na.rm = TRUE)\ncollapse <- function(x) ifelse(all(x == x[1]), x[1], paste(x, collapse=\"/\"))\nmrg <- hv_reads_filtered %>% \n  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev, na.rm = TRUE)) %>%\n  arrange(desc(adj_score_max)) %>%\n  group_by(seq_id) %>%\n  summarize(sample = collapse(sample),\n            genome_id = collapse(genome_id),\n            taxid_best = taxid[1],\n            taxid = collapse(as.character(taxid)),\n            best_alignment_score_fwd = rmax(best_alignment_score_fwd),\n            best_alignment_score_rev = rmax(best_alignment_score_rev),\n            query_len_fwd = rmax(query_len_fwd),\n            query_len_rev = rmax(query_len_rev),\n            query_seq_fwd = query_seq_fwd[!is.na(query_seq_fwd)][1],\n            query_seq_rev = query_seq_rev[!is.na(query_seq_rev)][1],\n            classified = rmax(classified),\n            assigned_name = collapse(assigned_name),\n            assigned_taxid_best = assigned_taxid[1],\n            assigned_taxid = collapse(as.character(assigned_taxid)),\n            assigned_hv = rmax(assigned_hv),\n            hit_hv = rmax(hit_hv),\n            encoded_hits = collapse(encoded_hits),\n            adj_score_fwd = rmax(adj_score_fwd),\n            adj_score_rev = rmax(adj_score_rev)\n            ) %>%\n  inner_join(libraries, by=\"sample\") %>%\n  mutate(kraken_label = ifelse(assigned_hv, \"Kraken2 HV\\nassignment\",\n                               ifelse(hit_hv, \"Kraken2 HV\\nhit\",\n                                      \"No hit or\\nassignment\"))) %>%\n  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev),\n         highscore = adj_score_max >= 20)\n\n# Plot results\ngeom_vhist <- purrr::partial(geom_histogram, binwidth=5, boundary=0)\ng_vhist_base <- ggplot(mapping=aes(x=adj_score_max)) +\n  geom_vline(xintercept=20, linetype=\"dashed\", color=\"red\") +\n  facet_wrap(~kraken_label, labeller = labeller(kit = label_wrap_gen(20)), scales = \"free_y\") +\n  scale_x_continuous(name = \"Maximum adjusted alignment score\") + \n  scale_y_continuous(name=\"# Read pairs\") + \n  theme_base \ng_vhist_0 <- g_vhist_base + geom_vhist(data=mrg)\ng_vhist_0\n\n\n\n\n\n\n\nBLASTing these reads against nt, we find that the pipeline performs well, with only a single high-scoring false-positive read:\n\nCode# Import paired BLAST results\nblast_paired_path <- file.path(data_dir, \"hv_hits_blast_paired.tsv.gz\")\nblast_paired <- read_tsv(blast_paired_path, show_col_types = FALSE)\n\n# Add viral status\nblast_viral <- mutate(blast_paired, viral = staxid %in% viral_taxa$taxid) %>%\n  mutate(viral_full = viral & n_reads == 2)\n\n# Compare to Kraken & Bowtie assignments\nmatch_taxid <- function(taxid_1, taxid_2){\n  p1 <- mapply(grepl, paste0(\"/\", taxid_1, \"$\"), taxid_2)\n  p2 <- mapply(grepl, paste0(\"^\", taxid_1, \"/\"), taxid_2)\n  p3 <- mapply(grepl, paste0(\"^\", taxid_1, \"$\"), taxid_2)\n  out <- setNames(p1|p2|p3, NULL)\n  return(out)\n}\nmrg_assign <- mrg %>% select(sample, seq_id, taxid, assigned_taxid, adj_score_max)\nblast_assign <- inner_join(blast_viral, mrg_assign, by=\"seq_id\") %>%\n    mutate(taxid_match_bowtie = match_taxid(staxid, taxid),\n           taxid_match_kraken = match_taxid(staxid, assigned_taxid),\n           taxid_match_any = taxid_match_bowtie | taxid_match_kraken)\nblast_out <- blast_assign %>%\n  group_by(seq_id) %>%\n  summarize(viral_status = ifelse(any(viral_full), 2,\n                                  ifelse(any(taxid_match_any), 2,\n                                             ifelse(any(viral), 1, 0))),\n            .groups = \"drop\")\n\n\n\nCode# Merge BLAST results with unenriched read data\nmrg_blast <- full_join(mrg, blast_out, by=\"seq_id\") %>%\n  mutate(viral_status = replace_na(viral_status, 0),\n         viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))\n\n# Plot\ng_vhist_1 <- g_vhist_base + geom_vhist(data=mrg_blast, mapping=aes(fill=viral_status_out)) +\n  scale_fill_brewer(palette = \"Set1\", name = \"Viral status\")\ng_vhist_1\n\n\n\n\n\n\n\nMy usual disjunctive score threshold of 20 gave precision, sensitivity, and F1 scores all >97%:\n\nCodetest_sens_spec <- function(tab, score_threshold){\n  tab_retained <- tab %>% \n    mutate(retain_score = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),\n           retain = assigned_hv | retain_score) %>%\n    group_by(viral_status_out, retain) %>% count\n  pos_tru <- tab_retained %>% filter(viral_status_out == \"TRUE\", retain) %>% pull(n) %>% sum\n  pos_fls <- tab_retained %>% filter(viral_status_out != \"TRUE\", retain) %>% pull(n) %>% sum\n  neg_tru <- tab_retained %>% filter(viral_status_out != \"TRUE\", !retain) %>% pull(n) %>% sum\n  neg_fls <- tab_retained %>% filter(viral_status_out == \"TRUE\", !retain) %>% pull(n) %>% sum\n  sensitivity <- pos_tru / (pos_tru + neg_fls)\n  specificity <- neg_tru / (neg_tru + pos_fls)\n  precision   <- pos_tru / (pos_tru + pos_fls)\n  f1 <- 2 * precision * sensitivity / (precision + sensitivity)\n  out <- tibble(threshold=score_threshold, sensitivity=sensitivity, \n                specificity=specificity, precision=precision, f1=f1)\n  return(out)\n}\nrange_f1 <- function(intab, inrange=15:45){\n  tss <- purrr::partial(test_sens_spec, tab=intab)\n  stats <- lapply(inrange, tss) %>% bind_rows %>%\n    pivot_longer(!threshold, names_to=\"metric\", values_to=\"value\")\n  return(stats)\n}\nstats_0 <- range_f1(mrg_blast)\ng_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +\n  geom_vline(xintercept=20, color = \"red\", linetype = \"dashed\") +\n  geom_line() +\n  scale_y_continuous(name = \"Value\", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +\n  scale_x_continuous(name = \"Adjusted Score Threshold\", expand = c(0,0)) +\n  scale_color_brewer(palette=\"Dark2\") +\n  theme_base\ng_stats_0\n\n\n\n\n\n\nCodestats_0 %>% filter(threshold == 20) %>% \n  select(Threshold=threshold, Metric=metric, Value=value)\n\n\n  \n\n\n\nHuman-infecting viruses: overall relative abundance\n\nCode# Get raw read counts\nread_counts_raw <- basic_stats_raw %>%\n  select(sample, sample_type_short, date, n_reads_raw = n_read_pairs)\n\n# Get HV read counts\nmrg_hv <- mrg %>% mutate(hv_status = assigned_hv | highscore) %>%\n  rename(taxid_all = taxid, taxid = taxid_best)\nread_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% \n  count(name=\"n_reads_hv\")\nread_counts <- read_counts_raw %>% left_join(read_counts_hv, by=\"sample\") %>%\n  mutate(n_reads_hv = replace_na(n_reads_hv, 0))\n\n# Aggregate\nread_counts_grp <- read_counts %>% group_by(date, sample_type_short) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n  mutate(sample= \"All samples\")\nread_counts_st <- read_counts_grp %>% group_by(sample, sample_type_short) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n  mutate(date = \"All dates\")\nread_counts_date <- read_counts_grp %>%\n  group_by(sample, date) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n  mutate(sample_type_short = \"All sample types\")\nread_counts_tot <- read_counts_date %>% group_by(sample, sample_type_short) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n  mutate(date = \"All dates\")\nread_counts_agg <- bind_rows(read_counts_grp, read_counts_st,\n                             read_counts_date, read_counts_tot) %>%\n  mutate(p_reads_hv = n_reads_hv/n_reads_raw,\n         date = factor(date, levels = c(levels(libraries$date), \"All dates\")),\n         sample_type_short = factor(sample_type_short, levels = c(levels(libraries$sample_type_short), \"All sample types\")))\n\n\nApplying a disjunctive cutoff at S=20 identifies 482 read pairs as human-viral. This gives an overall relative HV abundance of \\(2.90 \\times 10^{-7}\\); on the low end across all datasets I’ve analyzed, though higher than for Bengtsson-Palme:\n\nCode# Visualize\ng_phv_agg <- ggplot(read_counts_agg, aes(x=date, color=sample_type_short)) +\n  geom_point(aes(y=p_reads_hv)) +\n  scale_y_log10(\"Relative abundance of human virus reads\") +\n  scale_color_st() + theme_kit\ng_phv_agg\n\n\n\n\n\n\n\n\nCode# Collate past RA values\nra_past <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,\n                   \"Brumfield\", 5e-5, \"RNA\", FALSE,\n                   \"Brumfield\", 3.66e-7, \"DNA\", FALSE,\n                   \"Spurbeck\", 5.44e-6, \"RNA\", FALSE,\n                   \"Yang\", 3.62e-4, \"RNA\", FALSE,\n                   \"Rothman (unenriched)\", 1.87e-5, \"RNA\", FALSE,\n                   \"Rothman (panel-enriched)\", 3.3e-5, \"RNA\", TRUE,\n                   \"Crits-Christoph (unenriched)\", 1.37e-5, \"RNA\", FALSE,\n                   \"Crits-Christoph (panel-enriched)\", 1.26e-2, \"RNA\", TRUE,\n                   \"Prussin (non-control)\", 1.63e-5, \"RNA\", FALSE,\n                   \"Prussin (non-control)\", 4.16e-5, \"DNA\", FALSE,\n                   \"Rosario (non-control)\", 1.21e-5, \"RNA\", FALSE,\n                   \"Rosario (non-control)\", 1.50e-4, \"DNA\", FALSE,\n                   \"Leung\", 1.73e-5, \"DNA\", FALSE,\n                   \"Brinch\", 3.88e-6, \"DNA\", FALSE,\n                   \"Bengtsson-Palme\", 8.86e-8, \"DNA\", FALSE\n)\n\n# Collate new RA values\nra_new <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,\n                  \"Ng\", 2.90e-7, \"DNA\", FALSE)\n\n\n# Plot\nscale_color_na <- purrr::partial(scale_color_brewer, palette=\"Set1\",\n                                 name=\"Nucleic acid type\")\nra_comp <- bind_rows(ra_past, ra_new) %>% mutate(dataset = fct_inorder(dataset))\ng_ra_comp <- ggplot(ra_comp, aes(y=dataset, x=ra, color=na_type)) +\n  geom_point() +\n  scale_color_na() +\n  scale_x_log10(name=\"Relative abundance of human virus reads\") +\n  theme_base + theme(axis.title.y = element_blank())\ng_ra_comp\n\n\n\n\n\n\n\nHuman-infecting viruses: taxonomy and composition\nIn investigating the taxonomy of human-infecting virus reads, I restricted my analysis to samples with more than 5 HV read pairs total across all viruses, to reduce noise arising from extremely low HV read counts in some samples. 13 samples met this criterion.\nAt the family level, most samples were overwhelmingly dominated by Adenoviridae, with Picornaviridae, Polyomaviridae and Papillomaviridae making up most of the rest:\n\nCode# Get viral taxon names for putative HV reads\nviral_taxa$name[viral_taxa$taxid == 249588] <- \"Mamastrovirus\"\nviral_taxa$name[viral_taxa$taxid == 194960] <- \"Kobuvirus\"\nviral_taxa$name[viral_taxa$taxid == 688449] <- \"Salivirus\"\nviral_taxa$name[viral_taxa$taxid == 585893] <- \"Picobirnaviridae\"\nviral_taxa$name[viral_taxa$taxid == 333922] <- \"Betapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 334207] <- \"Betapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 369960] <- \"Porcine type-C oncovirus\"\nviral_taxa$name[viral_taxa$taxid == 333924] <- \"Betapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 687329] <- \"Anelloviridae\"\nviral_taxa$name[viral_taxa$taxid == 325455] <- \"Gammapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 333750] <- \"Alphapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 694002] <- \"Betacoronavirus\"\nviral_taxa$name[viral_taxa$taxid == 334202] <- \"Mupapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 197911] <- \"Alphainfluenzavirus\"\nviral_taxa$name[viral_taxa$taxid == 186938] <- \"Respirovirus\"\nviral_taxa$name[viral_taxa$taxid == 333926] <- \"Gammapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337051] <- \"Betapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337043] <- \"Alphapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 694003] <- \"Betacoronavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 334204] <- \"Mupapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 334208] <- \"Betapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 333928] <- \"Gammapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 337039] <- \"Alphapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 333929] <- \"Gammapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 337042] <- \"Alphapapillomavirus 7\"\nviral_taxa$name[viral_taxa$taxid == 334203] <- \"Mupapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 333757] <- \"Alphapapillomavirus 8\"\nviral_taxa$name[viral_taxa$taxid == 337050] <- \"Alphapapillomavirus 6\"\nviral_taxa$name[viral_taxa$taxid == 333767] <- \"Alphapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 333754] <- \"Alphapapillomavirus 10\"\nviral_taxa$name[viral_taxa$taxid == 687363] <- \"Torque teno virus 24\"\nviral_taxa$name[viral_taxa$taxid == 687342] <- \"Torque teno virus 3\"\nviral_taxa$name[viral_taxa$taxid == 687359] <- \"Torque teno virus 20\"\nviral_taxa$name[viral_taxa$taxid == 194441] <- \"Primate T-lymphotropic virus 2\"\nviral_taxa$name[viral_taxa$taxid == 334209] <- \"Betapapillomavirus 5\"\nviral_taxa$name[viral_taxa$taxid == 194965] <- \"Aichivirus B\"\nviral_taxa$name[viral_taxa$taxid == 333930] <- \"Gammapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 337048] <- \"Alphapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337041] <- \"Alphapapillomavirus 9\"\nviral_taxa$name[viral_taxa$taxid == 337049] <- \"Alphapapillomavirus 11\"\nviral_taxa$name[viral_taxa$taxid == 337044] <- \"Alphapapillomavirus 5\"\n\n# Filter samples and add viral taxa information\nsamples_keep <- read_counts %>% filter(n_reads_hv > 5) %>% pull(sample)\nmrg_hv_named <- mrg_hv %>% filter(sample %in% samples_keep, hv_status) %>% left_join(viral_taxa, by=\"taxid\") \n\n# Discover viral species & genera for HV reads\nraise_rank <- function(read_db, taxid_db, out_rank = \"species\", verbose = FALSE){\n  # Get higher ranks than search rank\n  ranks <- c(\"subspecies\", \"species\", \"subgenus\", \"genus\", \"subfamily\", \"family\", \"suborder\", \"order\", \"class\", \"subphylum\", \"phylum\", \"kingdom\", \"superkingdom\")\n  rank_match <- which.max(ranks == out_rank)\n  high_ranks <- ranks[rank_match:length(ranks)]\n  # Merge read DB and taxid DB\n  reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%\n    left_join(taxid_db, by=\"taxid\")\n  # Extract sequences that are already at appropriate rank\n  reads_rank <- filter(reads, rank == out_rank)\n  # Drop sequences at a higher rank and return unclassified sequences\n  reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n  while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...\n    # Promote read taxids and re-merge with taxid DB, then re-classify and filter\n    reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%\n      select(-parent_taxid, -rank, -name) %>%\n      left_join(taxid_db, by=\"taxid\")\n    reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%\n      bind_rows(reads_rank)\n    reads_norank <- reads_remaining %>%\n      filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n  }\n  # Finally, extract and append reads that were excluded during the process\n  reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)\n  reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%\n    select(-parent_taxid, -rank, -name) %>%\n    left_join(taxid_db, by=\"taxid\")\n  return(reads_out)\n}\nhv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, \"species\")\nhv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, \"genus\")\nhv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, \"family\")\n\n\n\nCodethreshold_major_family <- 0.02\n\n# Count reads for each human-viral family\nhv_family_counts <- hv_reads_family %>% \n  group_by(sample, date, sample_type_short, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample, date, sample_type_short) %>%\n  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nhv_family_major_tab <- hv_family_counts %>% group_by(name) %>% \n  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%\n  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)\nhv_family_counts_major <- hv_family_counts %>%\n  mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, \"Other\")) %>%\n  group_by(sample, date, sample_type_short, name_display) %>%\n  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(hv_family_major_tab$name, \"Other\")))\nhv_family_counts_display <- hv_family_counts_major %>%\n  rename(p_reads = p_reads_hv, classification = name_display)\n\n# Plot\ng_hv_family <- g_comp_base + \n  geom_col(data=hv_family_counts_display, position = \"stack\") +\n  scale_y_continuous(name=\"% HV Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral family\") +\n  labs(title=\"Family composition of human-viral reads\") +\n  guides(fill=guide_legend(ncol=4)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\ng_hv_family\n\n\n\n\n\n\nCode# Get most prominent families for text\nhv_family_collate <- hv_family_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv),\n            p_reads_max = max(p_reads_hv), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\nIn investigating individual viral families, to avoid distortions from a few rare reads, I restricted myself to samples where that family made up at least 10% of human-viral reads:\n\nCodethreshold_major_species <- 0.05\ntaxid_adeno <- 10508\n\n# Get set of adenoviridae reads\nadeno_samples <- hv_family_counts %>% filter(taxid == taxid_adeno) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\nadeno_ids <- hv_reads_family %>% \n  filter(taxid == taxid_adeno, sample %in% adeno_samples) %>%\n  pull(seq_id)\n\n# Count reads for each adenoviridae species\nadeno_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% adeno_ids) %>%\n  group_by(sample, date, sample_type_short, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample, date, sample_type_short) %>%\n  mutate(p_reads_adeno = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nadeno_species_major_tab <- adeno_species_counts %>% group_by(name) %>% \n  filter(p_reads_adeno == max(p_reads_adeno)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_adeno)) %>% \n  filter(p_reads_adeno > threshold_major_species)\nadeno_species_counts_major <- adeno_species_counts %>%\n  mutate(name_display = ifelse(name %in% adeno_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, date, sample_type_short, name_display) %>%\n  summarize(n_reads_adeno = sum(n_reads_hv),\n            p_reads_adeno = sum(p_reads_adeno), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(adeno_species_major_tab$name, \"Other\")))\nadeno_species_counts_display <- adeno_species_counts_major %>%\n  rename(p_reads = p_reads_adeno, classification = name_display)\n\n# Plot\ng_adeno_species <- g_comp_base + \n  geom_col(data=adeno_species_counts_display, position = \"stack\") +\n  scale_y_continuous(name=\"% Adenoviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Adenoviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_adeno_species\n\n\n\n\n\n\nCode# Get most prominent species for text\nadeno_species_collate <- adeno_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_adeno), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\n\nCodethreshold_major_species <- 0.1\ntaxid_picorna <- 12058\n\n# Get set of picornaviridae reads\npicorna_samples <- hv_family_counts %>% filter(taxid == taxid_picorna) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\npicorna_ids <- hv_reads_family %>% \n  filter(taxid == taxid_picorna, sample %in% picorna_samples) %>%\n  pull(seq_id)\n\n# Count reads for each picornaviridae species\npicorna_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% picorna_ids) %>%\n  group_by(sample, date, sample_type_short, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample, date, sample_type_short) %>%\n  mutate(p_reads_picorna = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\npicorna_species_major_tab <- picorna_species_counts %>% group_by(name) %>% \n  filter(p_reads_picorna == max(p_reads_picorna)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_picorna)) %>% \n  filter(p_reads_picorna > threshold_major_species)\npicorna_species_counts_major <- picorna_species_counts %>%\n  mutate(name_display = ifelse(name %in% picorna_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, date, sample_type_short, name_display) %>%\n  summarize(n_reads_picorna = sum(n_reads_hv),\n            p_reads_picorna = sum(p_reads_picorna), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(picorna_species_major_tab$name, \"Other\")))\npicorna_species_counts_display <- picorna_species_counts_major %>%\n  rename(p_reads = p_reads_picorna, classification = name_display)\n\n# Plot\ng_picorna_species <- g_comp_base + \n  geom_col(data=picorna_species_counts_display, position = \"stack\") +\n  scale_y_continuous(name=\"% Picornaviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Picornaviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_picorna_species\n\n\n\n\n\n\nCode# Get most prominent species for text\npicorna_species_collate <- picorna_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_picorna), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\n\nCodethreshold_major_species <- 0.1\ntaxid_polyoma <- 151341\n\n# Get set of polyomaviridae reads\npolyoma_samples <- hv_family_counts %>% filter(taxid == taxid_polyoma) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\npolyoma_ids <- hv_reads_family %>% \n  filter(taxid == taxid_polyoma, sample %in% polyoma_samples) %>%\n  pull(seq_id)\n\n# Count reads for each polyomaviridae species\npolyoma_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% polyoma_ids) %>%\n  group_by(sample, date, sample_type_short, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample, date, sample_type_short) %>%\n  mutate(p_reads_polyoma = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\npolyoma_species_major_tab <- polyoma_species_counts %>% group_by(name) %>% \n  filter(p_reads_polyoma == max(p_reads_polyoma)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_polyoma)) %>% \n  filter(p_reads_polyoma > threshold_major_species)\npolyoma_species_counts_major <- polyoma_species_counts %>%\n  mutate(name_display = ifelse(name %in% polyoma_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, date, sample_type_short, name_display) %>%\n  summarize(n_reads_polyoma = sum(n_reads_hv),\n            p_reads_polyoma = sum(p_reads_polyoma), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(polyoma_species_major_tab$name, \"Other\")))\npolyoma_species_counts_display <- polyoma_species_counts_major %>%\n  rename(p_reads = p_reads_polyoma, classification = name_display)\n\n# Plot\ng_polyoma_species <- g_comp_base + \n  geom_col(data=polyoma_species_counts_display, position = \"stack\") +\n  scale_y_continuous(name=\"% Polyomaviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Polyomaviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_polyoma_species\n\n\n\n\n\n\nCode# Get most prominent species for text\npolyoma_species_collate <- polyoma_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_polyoma), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\nFinally, here again are the overall relative abundances of the specific viral genera I picked out manually in my last entry:\n\nCode# Define reference genera\npath_genera_rna <- c(\"Mamastrovirus\", \"Enterovirus\", \"Salivirus\", \"Kobuvirus\", \"Norovirus\", \"Sapovirus\", \"Rotavirus\", \"Alphacoronavirus\", \"Betacoronavirus\", \"Alphainfluenzavirus\", \"Betainfluenzavirus\", \"Lentivirus\")\npath_genera_dna <- c(\"Mastadenovirus\", \"Alphapolyomavirus\", \"Betapolyomavirus\", \"Alphapapillomavirus\", \"Betapapillomavirus\", \"Gammapapillomavirus\", \"Orthopoxvirus\", \"Simplexvirus\",\n                     \"Lymphocryptovirus\", \"Cytomegalovirus\", \"Dependoparvovirus\")\npath_genera <- bind_rows(tibble(name=path_genera_rna, genome_type=\"RNA genome\"),\n                         tibble(name=path_genera_dna, genome_type=\"DNA genome\")) %>%\n  left_join(viral_taxa, by=\"name\")\n\n# Count in each sample\nmrg_hv_named_all <- mrg_hv %>% left_join(viral_taxa, by=\"taxid\")\nhv_reads_genus_all <- raise_rank(mrg_hv_named_all, viral_taxa, \"genus\")\nn_path_genera <- hv_reads_genus_all %>% \n  group_by(sample, date, sample_type_short, name, taxid) %>% \n  count(name=\"n_reads_viral\") %>% \n  inner_join(path_genera, by=c(\"name\", \"taxid\")) %>%\n  left_join(read_counts_raw, by=c(\"sample\", \"date\", \"sample_type_short\")) %>%\n  mutate(p_reads_viral = n_reads_viral/n_reads_raw)\n\n# Pivot out and back to add zero lines\nn_path_genera_out <- n_path_genera %>% ungroup %>% select(sample, name, n_reads_viral) %>%\n  pivot_wider(names_from=\"name\", values_from=\"n_reads_viral\", values_fill=0) %>%\n  pivot_longer(-sample, names_to=\"name\", values_to=\"n_reads_viral\") %>%\n  left_join(read_counts_raw, by=\"sample\") %>%\n  left_join(path_genera, by=\"name\") %>%\n  mutate(p_reads_viral = n_reads_viral/n_reads_raw)\n\n## Aggregate across dates\nn_path_genera_stype <- n_path_genera_out %>% \n  group_by(name, taxid, genome_type, sample_type_short) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_viral = sum(n_reads_viral), .groups = \"drop\") %>%\n  mutate(sample=\"All samples\", location=\"All locations\",\n         p_reads_viral = n_reads_viral/n_reads_raw,\n         na_type = \"DNA\")\n\n# Plot\ng_path_genera <- ggplot(n_path_genera_stype,\n                        aes(y=name, x=p_reads_viral, color=sample_type_short)) +\n  geom_point() +\n  scale_x_log10(name=\"Relative abundance\") +\n  scale_color_st() +\n  facet_grid(genome_type~., scales=\"free_y\") +\n  theme_base + theme(axis.title.y = element_blank())\ng_path_genera\n\n\n\n\n\n\n\nConclusion\nThis is another dataset with very low HV abundance, arising from lab methods intended to maximize bacterial abundance at the expense of other taxa. Nevertheless, this dataset had higher HV relative abundance than the last one. Interestingly, all three wastewater DNA datasets analyzed so far have exhibited a strong predominance of adenoviruses, and especially human mastadenovirus F, among human-infecting viruses. We’ll see if this pattern persists in the other DNA wastewater datasets I have in the queue."
+    "text": "Continuing my analysis of datasets from the P2RA preprint, I analyzed the data from Ng et al. (2019), a study that used DNA sequencing of wastewater samples to characterize the bacterial microbiota and resistome in Singapore. This study used processing methods I haven’t seen before:\n\nAll samples passed through “a filter” on-site at the WWTP prior to further processing in lab.\nSamples concentrated to 400ml using a Hemoflow dialyzer “via standard bloodline tubing”.\nEluted concentrates then further concentrated by passing through a 0.22um filter and retaining the retentate (NB: this is anti-selecting for viruses).\nSludge samples were instead centrifuged and the pellet kept for further analysis.\nAfter concentration, samples underwent DNA extraction with the PowerSoil DNA Isolation Kit, then underwent library prep and Illumina sequencing with an Illumina HiSeq2500 (2x250bp).\n\nSince this was a bacteria-focused study that used processing methods we expect to select against viruses, we wouldn’t expect to see high viral relative abundances here. Nevertheless, it’s worth seeing what we can see.\nThe raw data\nSamples were collected from six different locations in the treatment plant on six different dates (from October 2016 to August 2017) for a total of 36 samples:\n\n\nCode# Importing the data is a bit more complicated this time as the samples are split across three pipeline runs\ndata_dir <- \"../data/2024-05-01_ng\"\n\n# Data input paths\nlibraries_path <- file.path(data_dir, \"sample-metadata.csv\")\nbasic_stats_path <- file.path(data_dir, \"qc_basic_stats.tsv.gz\")\nadapter_stats_path <- file.path(data_dir, \"qc_adapter_stats.tsv.gz\")\nquality_base_stats_path <- file.path(data_dir, \"qc_quality_base_stats.tsv.gz\")\nquality_seq_stats_path <- file.path(data_dir, \"qc_quality_sequence_stats.tsv.gz\")\n\n# Import libraries and extract metadata from sample names\nlocs <- c(\"INF\", \"PST\", \"SLUDGE\", \"SST\", \"MBR\", \"WW\")\nlibraries_raw <- lapply(libraries_path, read_csv, show_col_types = FALSE) %>%\n  bind_rows\nlibraries <- libraries_raw %>%\n  mutate(sample_type_long = gsub(\" \\\\(.*\", \"\", sample_type),\n         sample_type_short = ifelse(sample_type_long == \"Influent\", \"INF\",\n                                    sub(\".*\\\\((.*)\\\\)\", \"\\\\1\", sample_type)),\n         sample_type_short = factor(sample_type_short, levels=locs)) %>%\n  arrange(sample_type_short, date) %>%\n  mutate(sample_type_long = fct_inorder(sample_type_long),\n         sample = fct_inorder(sample)) %>%\n  arrange(date) %>%\n  mutate(date = fct_inorder(date))\n\n# Make table\ncount_samples <- libraries %>% group_by(sample_type_long, sample_type_short) %>%\n  count %>%\n  rename(`Sample Type`=sample_type_long, Abbreviation=sample_type_short)\ncount_samples\n\n\n  \n\n\n\n\nCode# Import QC data\nstages <- c(\"raw_concat\", \"cleaned\", \"dedup\", \"ribo_initial\", \"ribo_secondary\")\nimport_basic <- function(paths){\n  lapply(paths, read_tsv, show_col_types = FALSE) %>% bind_rows %>%\n    inner_join(libraries, by=\"sample\") %>%\n      arrange(sample_type_short, date, sample) %>%\n    mutate(stage = factor(stage, levels = stages),\n           sample = fct_inorder(sample))\n}\nimport_basic_paired <- function(paths){\n  import_basic(paths) %>% arrange(read_pair) %>% \n    mutate(read_pair = fct_inorder(as.character(read_pair)))\n}\nbasic_stats <- import_basic(basic_stats_path)\nadapter_stats <- import_basic_paired(adapter_stats_path)\nquality_base_stats <- import_basic_paired(quality_base_stats_path)\nquality_seq_stats <- import_basic_paired(quality_seq_stats_path)\n\n# Filter to raw data\nbasic_stats_raw <- basic_stats %>% filter(stage == \"raw_concat\")\nadapter_stats_raw <- adapter_stats %>% filter(stage == \"raw_concat\")\nquality_base_stats_raw <- quality_base_stats %>% filter(stage == \"raw_concat\")\nquality_seq_stats_raw <- quality_seq_stats %>% filter(stage == \"raw_concat\")\n\n# Get key values for readout\nraw_read_counts <- basic_stats_raw %>% ungroup %>% \n  summarize(rmin = min(n_read_pairs), rmax=max(n_read_pairs),\n            rmean=mean(n_read_pairs), \n            rtot = sum(n_read_pairs),\n            btot = sum(n_bases_approx),\n            dmin = min(percent_duplicates), dmax=max(percent_duplicates),\n            dmean=mean(percent_duplicates), .groups = \"drop\")\n\n\nThese 36 samples yielded 26.6M-74.1M (mean 46.1M) reads per sample, for a total of 1.7B read pairs (830 gigabases of sequence). Read qualities were mostly high but tailed off towards the 3’ end, requiring some trimming. Adapter levels were fairly low but still in need of some trimming. Inferred duplication levels were variable (1-64%, mean 31%), with libraries with lower read depth showing much lower duplicate levels.\n\nCode# Prepare data\nbasic_stats_raw_metrics <- basic_stats_raw %>%\n  select(sample, sample_type_short, date,\n         `# Read pairs` = n_read_pairs,\n         `Total base pairs\\n(approx)` = n_bases_approx,\n         `% Duplicates\\n(FASTQC)` = percent_duplicates) %>%\n  pivot_longer(-(sample:date), names_to = \"metric\", values_to = \"value\") %>%\n  mutate(metric = fct_inorder(metric))\n\n# Set up plot templates\nscale_fill_st <- purrr::partial(scale_fill_brewer, palette=\"Set1\", name=\"Sample Type\")\ng_basic <- ggplot(basic_stats_raw_metrics, \n                  aes(x=sample, y=value, fill=sample_type_short, \n                      group=interaction(sample_type_short,sample))) +\n  geom_col(position = \"dodge\") +\n  scale_y_continuous(expand=c(0,0)) +\n  expand_limits(y=c(0,100)) +\n  scale_fill_st() + \n  facet_grid(metric~., scales = \"free\", space=\"free_x\", switch=\"y\") +\n  theme_xblank + theme(\n    axis.title.y = element_blank(),\n    strip.text.y = element_text(face=\"plain\")\n  )\ng_basic\n\n\n\n\n\n\n\n\nCode# Set up plotting templates\nscale_color_st <- purrr::partial(scale_color_brewer, palette=\"Set1\",\n                                   name=\"Sample Type\")\ng_qual_raw <- ggplot(mapping=aes(color=sample_type_short, linetype=read_pair, \n                         group=interaction(sample,read_pair))) + \n  scale_color_st() + scale_linetype_discrete(name = \"Read Pair\") +\n  guides(color=guide_legend(nrow=2,byrow=TRUE),\n         linetype = guide_legend(nrow=2,byrow=TRUE)) +\n  theme_base\n\n# Visualize adapters\ng_adapters_raw <- g_qual_raw + \n  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +\n  scale_y_continuous(name=\"% Adapters\", limits=c(0,NA),\n                     breaks = seq(0,100,1), expand=c(0,0)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,500,20), expand=c(0,0)) +\n  facet_grid(.~adapter)\ng_adapters_raw\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base_raw <- g_qual_raw +\n  geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +\n  scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,500,20), expand=c(0,0))\ng_quality_base_raw\n\n\n\n\n\n\nCodeg_quality_seq_raw <- g_qual_raw +\n  geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +\n  scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n  scale_y_continuous(name=\"# Sequences\", expand=c(0,0))\ng_quality_seq_raw\n\n\n\n\n\n\n\nPreprocessing\nThe average fraction of reads lost at each stage in the preprocessing pipeline is shown in the following table. As expected given the observed difference in duplication levels, many more reads were lost during deduplication in liquid samples than sludge samples. Conversely, trimming and filtering consistently removed more reads in sludge than in liquid samples, though the effect was less dramatic than for deduplication. Very few reads were lost during ribodepletion, as expected for DNA sequencing libraries.\n\nCoden_reads_rel <- basic_stats %>% \n  select(sample, sample_type_short, date, stage, \n         percent_duplicates, n_read_pairs) %>%\n  group_by(sample) %>% arrange(sample, stage) %>%\n  mutate(p_reads_retained = replace_na(n_read_pairs / lag(n_read_pairs), 0),\n         p_reads_lost = 1 - p_reads_retained,\n         p_reads_retained_abs = n_read_pairs / n_read_pairs[1],\n         p_reads_lost_abs = 1-p_reads_retained_abs,\n         p_reads_lost_abs_marginal = replace_na(p_reads_lost_abs - lag(p_reads_lost_abs), 0))\nn_reads_rel_display <- n_reads_rel %>% \n  group_by(`Sample Type`=sample_type_short, Stage=stage) %>% \n  summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), \"-\", round(max(p_reads_lost_abs*100),1), \" (mean \", round(mean(p_reads_lost_abs*100),1), \")\"),\n            `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), \"-\", round(max(p_reads_lost_abs_marginal*100),1), \" (mean \", round(mean(p_reads_lost_abs_marginal*100),1), \")\"), .groups=\"drop\") %>% \n  filter(Stage != \"raw_concat\") %>%\n  mutate(Stage = Stage %>% as.numeric %>% factor(labels=c(\"Trimming & filtering\", \"Deduplication\", \"Initial ribodepletion\", \"Secondary ribodepletion\")))\nn_reads_rel_display\n\n\n  \n\n\n\n\nCodeg_stage_base <- ggplot(mapping=aes(x=stage, color=sample_type_short, group=sample)) +\n  scale_color_st() +\n  theme_kit\n\n# Plot reads over preprocessing\ng_reads_stages <- g_stage_base +\n  geom_line(aes(y=n_read_pairs), data=basic_stats) +\n  scale_y_continuous(\"# Read pairs\", expand=c(0,0), limits=c(0,NA))\ng_reads_stages\n\n\n\n\n\n\nCode# Plot relative read losses during preprocessing\ng_reads_rel <- g_stage_base +\n  geom_line(aes(y=p_reads_lost_abs_marginal), data=n_reads_rel) +\n  scale_y_continuous(\"% Total Reads Lost\", expand=c(0,0), \n                     labels = function(x) x*100)\ng_reads_rel\n\n\n\n\n\n\n\nData cleaning was very successful at removing adapters and improving read qualities:\n\nCodeg_qual <- ggplot(mapping=aes(color=sample_type_short, linetype=read_pair, \n                         group=interaction(sample,read_pair))) + \n  scale_color_st() + scale_linetype_discrete(name = \"Read Pair\") +\n  guides(color=guide_legend(nrow=2,byrow=TRUE),\n         linetype = guide_legend(nrow=2,byrow=TRUE)) +\n  theme_base\n\n# Visualize adapters\ng_adapters <- g_qual + \n  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +\n  scale_y_continuous(name=\"% Adapters\", limits=c(0,20),\n                     breaks = seq(0,50,10), expand=c(0,0)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,140,20), expand=c(0,0)) +\n  facet_grid(stage~adapter)\ng_adapters\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base <- g_qual +\n  geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +\n  scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,140,20), expand=c(0,0)) +\n  facet_grid(stage~.)\ng_quality_base\n\n\n\n\n\n\nCodeg_quality_seq <- g_qual +\n  geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +\n  scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n  scale_y_continuous(name=\"# Sequences\", expand=c(0,0)) +\n  facet_grid(stage~.)\ng_quality_seq\n\n\n\n\n\n\n\nAccording to FASTQC, cleaning + deduplication was very effective at reducing measured duplicate levels, which fell from an average of 31% to 6.5%:\n\nCodestage_dup <- basic_stats %>% group_by(stage) %>% \n  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),\n            dmean=mean(percent_duplicates), .groups = \"drop\")\n\ng_dup_stages <- g_stage_base +\n  geom_line(aes(y=percent_duplicates), data=basic_stats) +\n  scale_y_continuous(\"% Duplicates\", limits=c(0,NA), expand=c(0,0))\ng_dup_stages\n\n\n\n\n\n\nCodeg_readlen_stages <- g_stage_base + \n  geom_line(aes(y=mean_seq_len), data=basic_stats) +\n  scale_y_continuous(\"Mean read length (nt)\", expand=c(0,0), limits=c(0,NA))\ng_readlen_stages\n\n\n\n\n\n\n\nHigh-level composition\nAs before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:\n\nCodeclassifications <- c(\"Filtered\", \"Duplicate\", \"Ribosomal\", \"Unassigned\",\n                     \"Bacterial\", \"Archaeal\", \"Viral\", \"Human\")\n\n# Import composition data\ncomp_path <- file.path(data_dir, \"taxonomic_composition.tsv.gz\")\ncomp <- read_tsv(comp_path, show_col_types = FALSE) %>%\n  left_join(libraries, by=\"sample\") %>%\n  mutate(classification = factor(classification, levels = classifications))\n  \n\n# Summarize composition\nread_comp_summ <- comp %>% \n  group_by(sample_type_short, classification) %>%\n  summarize(n_reads = sum(n_reads), .groups = \"drop_last\") %>%\n  mutate(n_reads = replace_na(n_reads,0),\n    p_reads = n_reads/sum(n_reads),\n    pc_reads = p_reads*100)\n\n\n\nCode# Prepare plotting templates\ng_comp_base <- ggplot(mapping=aes(x=sample, y=p_reads, fill=classification)) +\n  facet_wrap(~sample_type_short, scales = \"free_x\", ncol=3,\n             labeller = label_wrap_gen(multi_line=FALSE, width=20)) +\n  theme_xblank\nscale_y_pc_reads <- purrr::partial(scale_y_continuous, name = \"% Reads\",\n                                   expand = c(0,0), labels = function(y) y*100)\n\n# Plot overall composition\ng_comp <- g_comp_base + geom_col(data = comp, position = \"stack\", width=1) +\n  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +\n  scale_fill_brewer(palette = \"Set1\", name = \"Classification\")\ng_comp\n\n\n\n\n\n\nCode# Plot composition of minor components\ncomp_minor <- comp %>% \n  filter(classification %in% c(\"Archaeal\", \"Viral\", \"Human\", \"Other\"))\npalette_minor <- brewer.pal(9, \"Set1\")[6:9]\ng_comp_minor <- g_comp_base + \n  geom_col(data=comp_minor, position = \"stack\", width=1) +\n  scale_y_pc_reads() +\n  scale_fill_manual(values=palette_minor, name = \"Classification\")\ng_comp_minor\n\n\n\n\n\n\n\n\nCodep_reads_summ_group <- comp %>%\n  mutate(classification = ifelse(classification %in% c(\"Filtered\", \"Duplicate\", \"Unassigned\"), \"Excluded\", as.character(classification)),\n         classification = fct_inorder(classification)) %>%\n  group_by(classification, sample, sample_type_short) %>%\n  summarize(p_reads = sum(p_reads), .groups = \"drop\") %>%\n  group_by(classification, sample_type_short) %>%\n  summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, \n            pc_mean = mean(p_reads)*100, .groups = \"drop\")\np_reads_summ_prep <- p_reads_summ_group %>%\n  mutate(classification = fct_inorder(classification),\n         pc_min = pc_min %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n         pc_max = pc_max %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n         pc_mean = pc_mean %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n         display = paste0(pc_min, \"-\", pc_max, \"% (mean \", pc_mean, \"%)\"))\np_reads_summ <- p_reads_summ_prep %>%\n  select(`Sample Type`=sample_type_short, Classification=classification, \n         `Read Fraction`=display) %>%\n  arrange(`Sample Type`, Classification)\np_reads_summ\n\n\n  \n\n\n\nAs in previous DNA datasets, the vast majority of classified reads were bacterial in origin. The fraction of virus reads varied substantially between sample types, averaging <0.01% in influent and final effluent but closer to 0.05% in other sample types. Interestingly (though not particularly relevantly for this analysis), the fraction of archaeal reads was much higher in influent than other sample types, in contrast to Bengtsson-Palme where it was highest in sludge.\nAs is common for DNA data, viral reads were overwhelmingly dominated by Caudoviricetes phages, though one wet-well sample contained a substantial fraction of Alsuviricetes (a class of mainly plant pathogens that includes Virgaviridae):\n\nCode# Get Kraken reports\nreports_path <- file.path(data_dir, \"kraken_reports.tsv.gz\")\nreports <- read_tsv(reports_path, show_col_types = FALSE)\n\n# Get viral taxonomy\nviral_taxa_path <- file.path(data_dir, \"viral-taxids.tsv.gz\")\nviral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)\n\n# Filter to viral taxa\nkraken_reports_viral <- filter(reports, taxid %in% viral_taxa$taxid) %>%\n  group_by(sample) %>%\n  mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])\nkraken_reports_viral_cleaned <- kraken_reports_viral %>%\n  inner_join(libraries, by=\"sample\") %>%\n  select(-pc_reads_total, -n_reads_direct, -contains(\"minimizers\")) %>%\n  select(name, taxid, p_reads_viral, n_reads_clade, everything())\n\nviral_classes <- kraken_reports_viral_cleaned %>% filter(rank == \"C\")\nviral_families <- kraken_reports_viral_cleaned %>% filter(rank == \"F\")\n\n\n\nCodemajor_threshold <- 0.02\n\n# Identify major viral classes\nviral_classes_major_tab <- viral_classes %>% \n  group_by(name, taxid) %>%\n  summarize(p_reads_viral_max = max(p_reads_viral), .groups=\"drop\") %>%\n  filter(p_reads_viral_max >= major_threshold)\nviral_classes_major_list <- viral_classes_major_tab %>% pull(name)\nviral_classes_major <- viral_classes %>% \n  filter(name %in% viral_classes_major_list) %>%\n  select(name, taxid, sample, sample_type_short, date, p_reads_viral)\nviral_classes_minor <- viral_classes_major %>% \n  group_by(sample, sample_type_short, date) %>%\n  summarize(p_reads_viral_major = sum(p_reads_viral), .groups = \"drop\") %>%\n  mutate(name = \"Other\", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%\n  select(name, taxid, sample, sample_type_short, date, p_reads_viral)\nviral_classes_display <- bind_rows(viral_classes_major, viral_classes_minor) %>%\n  arrange(desc(p_reads_viral)) %>% \n  mutate(name = factor(name, levels=c(viral_classes_major_list, \"Other\")),\n         p_reads_viral = pmax(p_reads_viral, 0)) %>%\n  rename(p_reads = p_reads_viral, classification=name)\n\npalette_viral <- c(brewer.pal(12, \"Set3\"), brewer.pal(8, \"Dark2\"))\ng_classes <- g_comp_base + \n  geom_col(data=viral_classes_display, position = \"stack\", width=1) +\n  scale_y_continuous(name=\"% Viral Reads\", limits=c(0,1.01), breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral class\")\n  \ng_classes\n\n\n\n\n\n\n\nHuman-infecting virus reads: validation\nNext, I investigated the human-infecting virus read content of these unenriched samples. A grand total of 527 reads were identified as putatively human-viral, with half of samples showing 5 or fewer total HV read pairs.\n\nCode# Import HV read data\nhv_reads_filtered_path <- file.path(data_dir, \"hv_hits_putative_filtered.tsv.gz\")\nhv_reads_filtered <- lapply(hv_reads_filtered_path, read_tsv,\n                            show_col_types = FALSE) %>%\n  bind_rows() %>%\n  left_join(libraries, by=\"sample\")\n\n# Count reads\nn_hv_filtered <- hv_reads_filtered %>%\n  group_by(sample, date, sample_type_short, seq_id) %>% count %>%\n  group_by(sample, date, sample_type_short) %>% count %>% \n  inner_join(basic_stats %>% filter(stage == \"ribo_initial\") %>% \n               select(sample, n_read_pairs), by=\"sample\") %>% \n  rename(n_putative = n, n_total = n_read_pairs) %>% \n  mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)\nn_hv_filtered_summ <- n_hv_filtered %>% ungroup %>%\n  summarize(n_putative = sum(n_putative), n_total = sum(n_total), \n            .groups=\"drop\") %>% \n  mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)\n\n\n\nCode# Collapse multi-entry sequences\nrmax <- purrr::partial(max, na.rm = TRUE)\ncollapse <- function(x) ifelse(all(x == x[1]), x[1], paste(x, collapse=\"/\"))\nmrg <- hv_reads_filtered %>% \n  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev, na.rm = TRUE)) %>%\n  arrange(desc(adj_score_max)) %>%\n  group_by(seq_id) %>%\n  summarize(sample = collapse(sample),\n            genome_id = collapse(genome_id),\n            taxid_best = taxid[1],\n            taxid = collapse(as.character(taxid)),\n            best_alignment_score_fwd = rmax(best_alignment_score_fwd),\n            best_alignment_score_rev = rmax(best_alignment_score_rev),\n            query_len_fwd = rmax(query_len_fwd),\n            query_len_rev = rmax(query_len_rev),\n            query_seq_fwd = query_seq_fwd[!is.na(query_seq_fwd)][1],\n            query_seq_rev = query_seq_rev[!is.na(query_seq_rev)][1],\n            classified = rmax(classified),\n            assigned_name = collapse(assigned_name),\n            assigned_taxid_best = assigned_taxid[1],\n            assigned_taxid = collapse(as.character(assigned_taxid)),\n            assigned_hv = rmax(assigned_hv),\n            hit_hv = rmax(hit_hv),\n            encoded_hits = collapse(encoded_hits),\n            adj_score_fwd = rmax(adj_score_fwd),\n            adj_score_rev = rmax(adj_score_rev)\n            ) %>%\n  inner_join(libraries, by=\"sample\") %>%\n  mutate(kraken_label = ifelse(assigned_hv, \"Kraken2 HV\\nassignment\",\n                               ifelse(hit_hv, \"Kraken2 HV\\nhit\",\n                                      \"No hit or\\nassignment\"))) %>%\n  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev),\n         highscore = adj_score_max >= 20)\n\n# Plot results\ngeom_vhist <- purrr::partial(geom_histogram, binwidth=5, boundary=0)\ng_vhist_base <- ggplot(mapping=aes(x=adj_score_max)) +\n  geom_vline(xintercept=20, linetype=\"dashed\", color=\"red\") +\n  facet_wrap(~kraken_label, labeller = labeller(kit = label_wrap_gen(20)), scales = \"free_y\") +\n  scale_x_continuous(name = \"Maximum adjusted alignment score\") + \n  scale_y_continuous(name=\"# Read pairs\") + \n  theme_base \ng_vhist_0 <- g_vhist_base + geom_vhist(data=mrg)\ng_vhist_0\n\n\n\n\n\n\n\nBLASTing these reads against nt, we find that the pipeline performs well, with only a single high-scoring false-positive read:\n\nCode# Import paired BLAST results\nblast_paired_path <- file.path(data_dir, \"hv_hits_blast_paired.tsv.gz\")\nblast_paired <- read_tsv(blast_paired_path, show_col_types = FALSE)\n\n# Add viral status\nblast_viral <- mutate(blast_paired, viral = staxid %in% viral_taxa$taxid) %>%\n  mutate(viral_full = viral & n_reads == 2)\n\n# Compare to Kraken & Bowtie assignments\nmatch_taxid <- function(taxid_1, taxid_2){\n  p1 <- mapply(grepl, paste0(\"/\", taxid_1, \"$\"), taxid_2)\n  p2 <- mapply(grepl, paste0(\"^\", taxid_1, \"/\"), taxid_2)\n  p3 <- mapply(grepl, paste0(\"^\", taxid_1, \"$\"), taxid_2)\n  out <- setNames(p1|p2|p3, NULL)\n  return(out)\n}\nmrg_assign <- mrg %>% select(sample, seq_id, taxid, assigned_taxid, adj_score_max)\nblast_assign <- inner_join(blast_viral, mrg_assign, by=\"seq_id\") %>%\n    mutate(taxid_match_bowtie = match_taxid(staxid, taxid),\n           taxid_match_kraken = match_taxid(staxid, assigned_taxid),\n           taxid_match_any = taxid_match_bowtie | taxid_match_kraken)\nblast_out <- blast_assign %>%\n  group_by(seq_id) %>%\n  summarize(viral_status = ifelse(any(viral_full), 2,\n                                  ifelse(any(taxid_match_any), 2,\n                                             ifelse(any(viral), 1, 0))),\n            .groups = \"drop\")\n\n\n\nCode# Merge BLAST results with unenriched read data\nmrg_blast <- full_join(mrg, blast_out, by=\"seq_id\") %>%\n  mutate(viral_status = replace_na(viral_status, 0),\n         viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))\n\n# Plot\ng_vhist_1 <- g_vhist_base + geom_vhist(data=mrg_blast, mapping=aes(fill=viral_status_out)) +\n  scale_fill_brewer(palette = \"Set1\", name = \"Viral status\")\ng_vhist_1\n\n\n\n\n\n\n\nMy usual disjunctive score threshold of 20 gave precision, sensitivity, and F1 scores all >97%:\n\nCodetest_sens_spec <- function(tab, score_threshold){\n  tab_retained <- tab %>% \n    mutate(retain_score = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),\n           retain = assigned_hv | retain_score) %>%\n    group_by(viral_status_out, retain) %>% count\n  pos_tru <- tab_retained %>% filter(viral_status_out == \"TRUE\", retain) %>% pull(n) %>% sum\n  pos_fls <- tab_retained %>% filter(viral_status_out != \"TRUE\", retain) %>% pull(n) %>% sum\n  neg_tru <- tab_retained %>% filter(viral_status_out != \"TRUE\", !retain) %>% pull(n) %>% sum\n  neg_fls <- tab_retained %>% filter(viral_status_out == \"TRUE\", !retain) %>% pull(n) %>% sum\n  sensitivity <- pos_tru / (pos_tru + neg_fls)\n  specificity <- neg_tru / (neg_tru + pos_fls)\n  precision   <- pos_tru / (pos_tru + pos_fls)\n  f1 <- 2 * precision * sensitivity / (precision + sensitivity)\n  out <- tibble(threshold=score_threshold, sensitivity=sensitivity, \n                specificity=specificity, precision=precision, f1=f1)\n  return(out)\n}\nrange_f1 <- function(intab, inrange=15:45){\n  tss <- purrr::partial(test_sens_spec, tab=intab)\n  stats <- lapply(inrange, tss) %>% bind_rows %>%\n    pivot_longer(!threshold, names_to=\"metric\", values_to=\"value\")\n  return(stats)\n}\nstats_0 <- range_f1(mrg_blast)\ng_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +\n  geom_vline(xintercept=20, color = \"red\", linetype = \"dashed\") +\n  geom_line() +\n  scale_y_continuous(name = \"Value\", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +\n  scale_x_continuous(name = \"Adjusted Score Threshold\", expand = c(0,0)) +\n  scale_color_brewer(palette=\"Dark2\") +\n  theme_base\ng_stats_0\n\n\n\n\n\n\nCodestats_0 %>% filter(threshold == 20) %>% \n  select(Threshold=threshold, Metric=metric, Value=value)\n\n\n  \n\n\n\nHuman-infecting viruses: overall relative abundance\n\nCode# Get raw read counts\nread_counts_raw <- basic_stats_raw %>%\n  select(sample, sample_type_short, date, n_reads_raw = n_read_pairs)\n\n# Get HV read counts\nmrg_hv <- mrg %>% mutate(hv_status = assigned_hv | highscore) %>%\n  rename(taxid_all = taxid, taxid = taxid_best)\nread_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% \n  count(name=\"n_reads_hv\")\nread_counts <- read_counts_raw %>% left_join(read_counts_hv, by=\"sample\") %>%\n  mutate(n_reads_hv = replace_na(n_reads_hv, 0))\n\n# Aggregate\nread_counts_grp <- read_counts %>% group_by(date, sample_type_short) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n  mutate(sample= \"All samples\")\nread_counts_st <- read_counts_grp %>% group_by(sample, sample_type_short) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n  mutate(date = \"All dates\")\nread_counts_date <- read_counts_grp %>%\n  group_by(sample, date) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n  mutate(sample_type_short = \"All sample types\")\nread_counts_tot <- read_counts_date %>% group_by(sample, sample_type_short) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n  mutate(date = \"All dates\")\nread_counts_agg <- bind_rows(read_counts_grp, read_counts_st,\n                             read_counts_date, read_counts_tot) %>%\n  mutate(p_reads_hv = n_reads_hv/n_reads_raw,\n         date = factor(date, levels = c(levels(libraries$date), \"All dates\")),\n         sample_type_short = factor(sample_type_short, levels = c(levels(libraries$sample_type_short), \"All sample types\")))\n\n\nApplying a disjunctive cutoff at S=20 identifies 482 read pairs as human-viral. This gives an overall relative HV abundance of \\(2.90 \\times 10^{-7}\\); on the low end across all datasets I’ve analyzed, though higher than for Bengtsson-Palme:\n\nCode# Visualize\ng_phv_agg <- ggplot(read_counts_agg, aes(x=date, color=sample_type_short)) +\n  geom_point(aes(y=p_reads_hv)) +\n  scale_y_log10(\"Relative abundance of human virus reads\") +\n  scale_color_st() + theme_kit\ng_phv_agg\n\n\n\n\n\n\n\n\nCode# Collate past RA values\nra_past <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,\n                   \"Brumfield\", 5e-5, \"RNA\", FALSE,\n                   \"Brumfield\", 3.66e-7, \"DNA\", FALSE,\n                   \"Spurbeck\", 5.44e-6, \"RNA\", FALSE,\n                   \"Yang\", 3.62e-4, \"RNA\", FALSE,\n                   \"Rothman (unenriched)\", 1.87e-5, \"RNA\", FALSE,\n                   \"Rothman (panel-enriched)\", 3.3e-5, \"RNA\", TRUE,\n                   \"Crits-Christoph (unenriched)\", 1.37e-5, \"RNA\", FALSE,\n                   \"Crits-Christoph (panel-enriched)\", 1.26e-2, \"RNA\", TRUE,\n                   \"Prussin (non-control)\", 1.63e-5, \"RNA\", FALSE,\n                   \"Prussin (non-control)\", 4.16e-5, \"DNA\", FALSE,\n                   \"Rosario (non-control)\", 1.21e-5, \"RNA\", FALSE,\n                   \"Rosario (non-control)\", 1.50e-4, \"DNA\", FALSE,\n                   \"Leung\", 1.73e-5, \"DNA\", FALSE,\n                   \"Brinch\", 3.88e-6, \"DNA\", FALSE,\n                   \"Bengtsson-Palme\", 8.86e-8, \"DNA\", FALSE\n)\n\n# Collate new RA values\nra_new <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,\n                  \"Ng\", 2.90e-7, \"DNA\", FALSE)\n\n\n# Plot\nscale_color_na <- purrr::partial(scale_color_brewer, palette=\"Set1\",\n                                 name=\"Nucleic acid type\")\nra_comp <- bind_rows(ra_past, ra_new) %>% mutate(dataset = fct_inorder(dataset))\ng_ra_comp <- ggplot(ra_comp, aes(y=dataset, x=ra, color=na_type)) +\n  geom_point() +\n  scale_color_na() +\n  scale_x_log10(name=\"Relative abundance of human virus reads\") +\n  theme_base + theme(axis.title.y = element_blank())\ng_ra_comp\n\n\n\n\n\n\n\nHuman-infecting viruses: taxonomy and composition\nIn investigating the taxonomy of human-infecting virus reads, I restricted my analysis to samples with more than 5 HV read pairs total across all viruses, to reduce noise arising from extremely low HV read counts in some samples. 13 samples met this criterion.\nAt the family level, most samples were overwhelmingly dominated by Adenoviridae, with Picornaviridae, Polyomaviridae and Papillomaviridae making up most of the rest:\n\nCode# Get viral taxon names for putative HV reads\nviral_taxa$name[viral_taxa$taxid == 249588] <- \"Mamastrovirus\"\nviral_taxa$name[viral_taxa$taxid == 194960] <- \"Kobuvirus\"\nviral_taxa$name[viral_taxa$taxid == 688449] <- \"Salivirus\"\nviral_taxa$name[viral_taxa$taxid == 585893] <- \"Picobirnaviridae\"\nviral_taxa$name[viral_taxa$taxid == 333922] <- \"Betapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 334207] <- \"Betapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 369960] <- \"Porcine type-C oncovirus\"\nviral_taxa$name[viral_taxa$taxid == 333924] <- \"Betapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 687329] <- \"Anelloviridae\"\nviral_taxa$name[viral_taxa$taxid == 325455] <- \"Gammapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 333750] <- \"Alphapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 694002] <- \"Betacoronavirus\"\nviral_taxa$name[viral_taxa$taxid == 334202] <- \"Mupapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 197911] <- \"Alphainfluenzavirus\"\nviral_taxa$name[viral_taxa$taxid == 186938] <- \"Respirovirus\"\nviral_taxa$name[viral_taxa$taxid == 333926] <- \"Gammapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337051] <- \"Betapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337043] <- \"Alphapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 694003] <- \"Betacoronavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 334204] <- \"Mupapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 334208] <- \"Betapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 333928] <- \"Gammapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 337039] <- \"Alphapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 333929] <- \"Gammapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 337042] <- \"Alphapapillomavirus 7\"\nviral_taxa$name[viral_taxa$taxid == 334203] <- \"Mupapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 333757] <- \"Alphapapillomavirus 8\"\nviral_taxa$name[viral_taxa$taxid == 337050] <- \"Alphapapillomavirus 6\"\nviral_taxa$name[viral_taxa$taxid == 333767] <- \"Alphapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 333754] <- \"Alphapapillomavirus 10\"\nviral_taxa$name[viral_taxa$taxid == 687363] <- \"Torque teno virus 24\"\nviral_taxa$name[viral_taxa$taxid == 687342] <- \"Torque teno virus 3\"\nviral_taxa$name[viral_taxa$taxid == 687359] <- \"Torque teno virus 20\"\nviral_taxa$name[viral_taxa$taxid == 194441] <- \"Primate T-lymphotropic virus 2\"\nviral_taxa$name[viral_taxa$taxid == 334209] <- \"Betapapillomavirus 5\"\nviral_taxa$name[viral_taxa$taxid == 194965] <- \"Aichivirus B\"\nviral_taxa$name[viral_taxa$taxid == 333930] <- \"Gammapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 337048] <- \"Alphapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337041] <- \"Alphapapillomavirus 9\"\nviral_taxa$name[viral_taxa$taxid == 337049] <- \"Alphapapillomavirus 11\"\nviral_taxa$name[viral_taxa$taxid == 337044] <- \"Alphapapillomavirus 5\"\n\n# Filter samples and add viral taxa information\nsamples_keep <- read_counts %>% filter(n_reads_hv > 5) %>% pull(sample)\nmrg_hv_named <- mrg_hv %>% filter(sample %in% samples_keep, hv_status) %>% left_join(viral_taxa, by=\"taxid\") \n\n# Discover viral species & genera for HV reads\nraise_rank <- function(read_db, taxid_db, out_rank = \"species\", verbose = FALSE){\n  # Get higher ranks than search rank\n  ranks <- c(\"subspecies\", \"species\", \"subgenus\", \"genus\", \"subfamily\", \"family\", \"suborder\", \"order\", \"class\", \"subphylum\", \"phylum\", \"kingdom\", \"superkingdom\")\n  rank_match <- which.max(ranks == out_rank)\n  high_ranks <- ranks[rank_match:length(ranks)]\n  # Merge read DB and taxid DB\n  reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%\n    left_join(taxid_db, by=\"taxid\")\n  # Extract sequences that are already at appropriate rank\n  reads_rank <- filter(reads, rank == out_rank)\n  # Drop sequences at a higher rank and return unclassified sequences\n  reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n  while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...\n    # Promote read taxids and re-merge with taxid DB, then re-classify and filter\n    reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%\n      select(-parent_taxid, -rank, -name) %>%\n      left_join(taxid_db, by=\"taxid\")\n    reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%\n      bind_rows(reads_rank)\n    reads_norank <- reads_remaining %>%\n      filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n  }\n  # Finally, extract and append reads that were excluded during the process\n  reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)\n  reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%\n    select(-parent_taxid, -rank, -name) %>%\n    left_join(taxid_db, by=\"taxid\")\n  return(reads_out)\n}\nhv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, \"species\")\nhv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, \"genus\")\nhv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, \"family\")\n\n\n\nCodethreshold_major_family <- 0.02\n\n# Count reads for each human-viral family\nhv_family_counts <- hv_reads_family %>% \n  group_by(sample, date, sample_type_short, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample, date, sample_type_short) %>%\n  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nhv_family_major_tab <- hv_family_counts %>% group_by(name) %>% \n  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%\n  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)\nhv_family_counts_major <- hv_family_counts %>%\n  mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, \"Other\")) %>%\n  group_by(sample, date, sample_type_short, name_display) %>%\n  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(hv_family_major_tab$name, \"Other\")))\nhv_family_counts_display <- hv_family_counts_major %>%\n  rename(p_reads = p_reads_hv, classification = name_display)\n\n# Plot\ng_hv_family <- g_comp_base + \n  geom_col(data=hv_family_counts_display, position = \"stack\") +\n  scale_y_continuous(name=\"% HV Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral family\") +\n  labs(title=\"Family composition of human-viral reads\") +\n  guides(fill=guide_legend(ncol=4)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\ng_hv_family\n\n\n\n\n\n\nCode# Get most prominent families for text\nhv_family_collate <- hv_family_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv),\n            p_reads_max = max(p_reads_hv), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\nIn investigating individual viral families, to avoid distortions from a few rare reads, I restricted myself to samples where that family made up at least 10% of human-viral reads:\n\nCodethreshold_major_species <- 0.05\ntaxid_adeno <- 10508\n\n# Get set of adenoviridae reads\nadeno_samples <- hv_family_counts %>% filter(taxid == taxid_adeno) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\nadeno_ids <- hv_reads_family %>% \n  filter(taxid == taxid_adeno, sample %in% adeno_samples) %>%\n  pull(seq_id)\n\n# Count reads for each adenoviridae species\nadeno_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% adeno_ids) %>%\n  group_by(sample, date, sample_type_short, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample, date, sample_type_short) %>%\n  mutate(p_reads_adeno = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nadeno_species_major_tab <- adeno_species_counts %>% group_by(name) %>% \n  filter(p_reads_adeno == max(p_reads_adeno)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_adeno)) %>% \n  filter(p_reads_adeno > threshold_major_species)\nadeno_species_counts_major <- adeno_species_counts %>%\n  mutate(name_display = ifelse(name %in% adeno_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, date, sample_type_short, name_display) %>%\n  summarize(n_reads_adeno = sum(n_reads_hv),\n            p_reads_adeno = sum(p_reads_adeno), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(adeno_species_major_tab$name, \"Other\")))\nadeno_species_counts_display <- adeno_species_counts_major %>%\n  rename(p_reads = p_reads_adeno, classification = name_display)\n\n# Plot\ng_adeno_species <- g_comp_base + \n  geom_col(data=adeno_species_counts_display, position = \"stack\") +\n  scale_y_continuous(name=\"% Adenoviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Adenoviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_adeno_species\n\n\n\n\n\n\nCode# Get most prominent species for text\nadeno_species_collate <- adeno_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_adeno), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\n\nCodethreshold_major_species <- 0.1\ntaxid_picorna <- 12058\n\n# Get set of picornaviridae reads\npicorna_samples <- hv_family_counts %>% filter(taxid == taxid_picorna) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\npicorna_ids <- hv_reads_family %>% \n  filter(taxid == taxid_picorna, sample %in% picorna_samples) %>%\n  pull(seq_id)\n\n# Count reads for each picornaviridae species\npicorna_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% picorna_ids) %>%\n  group_by(sample, date, sample_type_short, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample, date, sample_type_short) %>%\n  mutate(p_reads_picorna = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\npicorna_species_major_tab <- picorna_species_counts %>% group_by(name) %>% \n  filter(p_reads_picorna == max(p_reads_picorna)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_picorna)) %>% \n  filter(p_reads_picorna > threshold_major_species)\npicorna_species_counts_major <- picorna_species_counts %>%\n  mutate(name_display = ifelse(name %in% picorna_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, date, sample_type_short, name_display) %>%\n  summarize(n_reads_picorna = sum(n_reads_hv),\n            p_reads_picorna = sum(p_reads_picorna), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(picorna_species_major_tab$name, \"Other\")))\npicorna_species_counts_display <- picorna_species_counts_major %>%\n  rename(p_reads = p_reads_picorna, classification = name_display)\n\n# Plot\ng_picorna_species <- g_comp_base + \n  geom_col(data=picorna_species_counts_display, position = \"stack\") +\n  scale_y_continuous(name=\"% Picornaviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Picornaviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_picorna_species\n\n\n\n\n\n\nCode# Get most prominent species for text\npicorna_species_collate <- picorna_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_picorna), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\n\nCodethreshold_major_species <- 0.1\ntaxid_polyoma <- 151341\n\n# Get set of polyomaviridae reads\npolyoma_samples <- hv_family_counts %>% filter(taxid == taxid_polyoma) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\npolyoma_ids <- hv_reads_family %>% \n  filter(taxid == taxid_polyoma, sample %in% polyoma_samples) %>%\n  pull(seq_id)\n\n# Count reads for each polyomaviridae species\npolyoma_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% polyoma_ids) %>%\n  group_by(sample, date, sample_type_short, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample, date, sample_type_short) %>%\n  mutate(p_reads_polyoma = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\npolyoma_species_major_tab <- polyoma_species_counts %>% group_by(name) %>% \n  filter(p_reads_polyoma == max(p_reads_polyoma)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_polyoma)) %>% \n  filter(p_reads_polyoma > threshold_major_species)\npolyoma_species_counts_major <- polyoma_species_counts %>%\n  mutate(name_display = ifelse(name %in% polyoma_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, date, sample_type_short, name_display) %>%\n  summarize(n_reads_polyoma = sum(n_reads_hv),\n            p_reads_polyoma = sum(p_reads_polyoma), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(polyoma_species_major_tab$name, \"Other\")))\npolyoma_species_counts_display <- polyoma_species_counts_major %>%\n  rename(p_reads = p_reads_polyoma, classification = name_display)\n\n# Plot\ng_polyoma_species <- g_comp_base + \n  geom_col(data=polyoma_species_counts_display, position = \"stack\") +\n  scale_y_continuous(name=\"% Polyomaviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Polyomaviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_polyoma_species\n\n\n\n\n\n\nCode# Get most prominent species for text\npolyoma_species_collate <- polyoma_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_polyoma), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\nFinally, here again are the overall relative abundances of the specific viral genera I picked out manually in my last entry:\n\nCode# Define reference genera\npath_genera_rna <- c(\"Mamastrovirus\", \"Enterovirus\", \"Salivirus\", \"Kobuvirus\", \"Norovirus\", \"Sapovirus\", \"Rotavirus\", \"Alphacoronavirus\", \"Betacoronavirus\", \"Alphainfluenzavirus\", \"Betainfluenzavirus\", \"Lentivirus\")\npath_genera_dna <- c(\"Mastadenovirus\", \"Alphapolyomavirus\", \"Betapolyomavirus\", \"Alphapapillomavirus\", \"Betapapillomavirus\", \"Gammapapillomavirus\", \"Orthopoxvirus\", \"Simplexvirus\",\n                     \"Lymphocryptovirus\", \"Cytomegalovirus\", \"Dependoparvovirus\")\npath_genera <- bind_rows(tibble(name=path_genera_rna, genome_type=\"RNA genome\"),\n                         tibble(name=path_genera_dna, genome_type=\"DNA genome\")) %>%\n  left_join(viral_taxa, by=\"name\")\n\n# Count in each sample\nmrg_hv_named_all <- mrg_hv %>% left_join(viral_taxa, by=\"taxid\")\nhv_reads_genus_all <- raise_rank(mrg_hv_named_all, viral_taxa, \"genus\")\nn_path_genera <- hv_reads_genus_all %>% \n  group_by(sample, date, sample_type_short, name, taxid) %>% \n  count(name=\"n_reads_viral\") %>% \n  inner_join(path_genera, by=c(\"name\", \"taxid\")) %>%\n  left_join(read_counts_raw, by=c(\"sample\", \"date\", \"sample_type_short\")) %>%\n  mutate(p_reads_viral = n_reads_viral/n_reads_raw)\n\n# Pivot out and back to add zero lines\nn_path_genera_out <- n_path_genera %>% ungroup %>% select(sample, name, n_reads_viral) %>%\n  pivot_wider(names_from=\"name\", values_from=\"n_reads_viral\", values_fill=0) %>%\n  pivot_longer(-sample, names_to=\"name\", values_to=\"n_reads_viral\") %>%\n  left_join(read_counts_raw, by=\"sample\") %>%\n  left_join(path_genera, by=\"name\") %>%\n  mutate(p_reads_viral = n_reads_viral/n_reads_raw)\n\n## Aggregate across dates\nn_path_genera_stype <- n_path_genera_out %>% \n  group_by(name, taxid, genome_type, sample_type_short) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_viral = sum(n_reads_viral), .groups = \"drop\") %>%\n  mutate(sample=\"All samples\", location=\"All locations\",\n         p_reads_viral = n_reads_viral/n_reads_raw,\n         na_type = \"DNA\")\n\n# Plot\ng_path_genera <- ggplot(n_path_genera_stype,\n                        aes(y=name, x=p_reads_viral, color=sample_type_short)) +\n  geom_point() +\n  scale_x_log10(name=\"Relative abundance\") +\n  scale_color_st() +\n  facet_grid(genome_type~., scales=\"free_y\") +\n  theme_base + theme(axis.title.y = element_blank())\ng_path_genera\n\n\n\n\n\n\n\nConclusion\nThis is another dataset with very low HV abundance, arising from lab methods intended to maximize bacterial abundance at the expense of other taxa. Nevertheless, this dataset had higher HV relative abundance than the last one. Interestingly, all three wastewater DNA datasets analyzed so far have exhibited a strong predominance of adenoviruses, and especially human mastadenovirus F, among human-infecting viruses. We’ll see if this pattern persists in the other DNA wastewater datasets I have in the queue."
+  },
+  {
+    "objectID": "notebooks/2024-05-01_maritz.html",
+    "href": "notebooks/2024-05-01_maritz.html",
+    "title": "Workflow analysis of Maritz et al. (2019)",
+    "section": "",
+    "text": "Continuing my analysis of datasets from the P2RA preprint, I analyzed the data from Maritz et al. (2019), a study that used DNA sequencing of wastewater samples to characterize protist diversity and temporal diversity in New York City. Samples for this study underwent direct DNA extraction without a dedicated concentration step, then underwent library prep and Illumina sequencing on a HiSeq Rapid Run (2x250bp).\nThe raw data\n16 samples were collected from 14 treatment plants in NYC in November 2014. These samples yielded 8.6M-18.3M (mean 10.8M) reads per sample, for a total of 172M read pairs (84 gigabases of sequence). Read qualities were mostly high; adapter levels were moderate; inferred duplication levels were low.\n\nCode# Importing the data is a bit more complicated this time as the samples are split across three pipeline runs\ndata_dir <- \"../data/2024-05-01_maritz\"\n\n# Data input paths\nlibraries_path <- file.path(data_dir, \"sample-metadata.csv\")\nbasic_stats_path <- file.path(data_dir, \"qc_basic_stats.tsv.gz\")\nadapter_stats_path <- file.path(data_dir, \"qc_adapter_stats.tsv.gz\")\nquality_base_stats_path <- file.path(data_dir, \"qc_quality_base_stats.tsv.gz\")\nquality_seq_stats_path <- file.path(data_dir, \"qc_quality_sequence_stats.tsv.gz\")\n\n# Import libraries and extract metadata from sample names\nlibraries_raw <- lapply(libraries_path, read_csv, show_col_types = FALSE) %>%\n  bind_rows\nlibraries <- libraries_raw %>%\n  mutate(sample = fct_inorder(sample))\n\n\n\nCode# Import QC data\nstages <- c(\"raw_concat\", \"cleaned\", \"dedup\", \"ribo_initial\", \"ribo_secondary\")\nimport_basic <- function(paths){\n  lapply(paths, read_tsv, show_col_types = FALSE) %>% bind_rows %>%\n    inner_join(libraries, by=\"sample\") %>%\n    arrange(sample) %>%\n    mutate(stage = factor(stage, levels = stages),\n           sample = fct_inorder(sample))\n}\nimport_basic_paired <- function(paths){\n  import_basic(paths) %>% arrange(read_pair) %>% \n    mutate(read_pair = fct_inorder(as.character(read_pair)))\n}\nbasic_stats <- import_basic(basic_stats_path)\nadapter_stats <- import_basic_paired(adapter_stats_path)\nquality_base_stats <- import_basic_paired(quality_base_stats_path)\nquality_seq_stats <- import_basic_paired(quality_seq_stats_path)\n\n# Filter to raw data\nbasic_stats_raw <- basic_stats %>% filter(stage == \"raw_concat\")\nadapter_stats_raw <- adapter_stats %>% filter(stage == \"raw_concat\")\nquality_base_stats_raw <- quality_base_stats %>% filter(stage == \"raw_concat\")\nquality_seq_stats_raw <- quality_seq_stats %>% filter(stage == \"raw_concat\")\n\n# Get key values for readout\nraw_read_counts <- basic_stats_raw %>% ungroup %>% \n  summarize(rmin = min(n_read_pairs), rmax=max(n_read_pairs),\n            rmean=mean(n_read_pairs), \n            rtot = sum(n_read_pairs),\n            btot = sum(n_bases_approx),\n            dmin = min(percent_duplicates), dmax=max(percent_duplicates),\n            dmean=mean(percent_duplicates), .groups = \"drop\")\n\n\n\nCode# Prepare data\nbasic_stats_raw_metrics <- basic_stats_raw %>%\n  select(sample,\n         `# Read pairs` = n_read_pairs,\n         `Total base pairs\\n(approx)` = n_bases_approx,\n         `% Duplicates\\n(FASTQC)` = percent_duplicates) %>%\n  pivot_longer(-(sample), names_to = \"metric\", values_to = \"value\") %>%\n  mutate(metric = fct_inorder(metric))\n\n# Set up plot templates\ng_basic <- ggplot(basic_stats_raw_metrics, aes(x=sample, y=value)) +\n  geom_col(position = \"dodge\") +\n  scale_y_continuous(expand=c(0,0)) +\n  expand_limits(y=c(0,100)) +\n  facet_grid(metric~., scales = \"free\", space=\"free_x\", switch=\"y\") +\n  theme_kit + theme(\n    axis.title.y = element_blank(),\n    strip.text.y = element_text(face=\"plain\")\n  )\ng_basic\n\n\n\n\n\n\n\n\nCode# Set up plotting templates\ng_qual_raw <- ggplot(mapping=aes(linetype=read_pair, \n                         group=interaction(sample,read_pair))) + \n  scale_linetype_discrete(name = \"Read Pair\") +\n  guides(color=guide_legend(nrow=2,byrow=TRUE),\n         linetype = guide_legend(nrow=2,byrow=TRUE)) +\n  theme_base\n\n# Visualize adapters\ng_adapters_raw <- g_qual_raw + \n  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +\n  scale_y_continuous(name=\"% Adapters\", limits=c(0,NA),\n                     breaks = seq(0,100,1), expand=c(0,0)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,500,20), expand=c(0,0)) +\n  facet_grid(.~adapter)\ng_adapters_raw\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base_raw <- g_qual_raw +\n  geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +\n  scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,500,20), expand=c(0,0))\ng_quality_base_raw\n\n\n\n\n\n\nCodeg_quality_seq_raw <- g_qual_raw +\n  geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +\n  scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n  scale_y_continuous(name=\"# Sequences\", expand=c(0,0))\ng_quality_seq_raw\n\n\n\n\n\n\n\nPreprocessing\nAbout 6% of reads on average were lost during cleaning, and a further 2% during deduplication. Very few reads were lost during ribodepletion, as expected for DNA sequencing libraries.\n\nCoden_reads_rel <- basic_stats %>% \n  select(sample, stage, \n         percent_duplicates, n_read_pairs) %>%\n  group_by(sample) %>% arrange(sample, stage) %>%\n  mutate(p_reads_retained = replace_na(n_read_pairs / lag(n_read_pairs), 0),\n         p_reads_lost = 1 - p_reads_retained,\n         p_reads_retained_abs = n_read_pairs / n_read_pairs[1],\n         p_reads_lost_abs = 1-p_reads_retained_abs,\n         p_reads_lost_abs_marginal = replace_na(p_reads_lost_abs - lag(p_reads_lost_abs), 0))\nn_reads_rel_display <- n_reads_rel %>% \n  group_by(Stage=stage) %>% \n  summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), \"-\", round(max(p_reads_lost_abs*100),1), \" (mean \", round(mean(p_reads_lost_abs*100),1), \")\"),\n            `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), \"-\", round(max(p_reads_lost_abs_marginal*100),1), \" (mean \", round(mean(p_reads_lost_abs_marginal*100),1), \")\"), .groups=\"drop\") %>% \n  filter(Stage != \"raw_concat\") %>%\n  mutate(Stage = Stage %>% as.numeric %>% factor(labels=c(\"Trimming & filtering\", \"Deduplication\", \"Initial ribodepletion\", \"Secondary ribodepletion\")))\nn_reads_rel_display\n\n\n  \n\n\n\n\nCodeg_stage_base <- ggplot(mapping=aes(x=stage, group=sample)) +\n  theme_kit\n\n# Plot reads over preprocessing\ng_reads_stages <- g_stage_base +\n  geom_line(aes(y=n_read_pairs), data=basic_stats) +\n  scale_y_continuous(\"# Read pairs\", expand=c(0,0), limits=c(0,NA))\ng_reads_stages\n\n\n\n\n\n\nCode# Plot relative read losses during preprocessing\ng_reads_rel <- g_stage_base +\n  geom_line(aes(y=p_reads_lost_abs_marginal), data=n_reads_rel) +\n  scale_y_continuous(\"% Total Reads Lost\", expand=c(0,0), \n                     labels = function(x) x*100)\ng_reads_rel\n\n\n\n\n\n\n\nData cleaning was very successful at removing adapters and improving read qualities:\n\nCodeg_qual <- ggplot(mapping=aes(linetype=read_pair, \n                         group=interaction(sample,read_pair))) + \n  scale_linetype_discrete(name = \"Read Pair\") +\n  guides(color=guide_legend(nrow=2,byrow=TRUE),\n         linetype = guide_legend(nrow=2,byrow=TRUE)) +\n  theme_base\n\n# Visualize adapters\ng_adapters <- g_qual + \n  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +\n  scale_y_continuous(name=\"% Adapters\", limits=c(0,20),\n                     breaks = seq(0,50,10), expand=c(0,0)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,140,20), expand=c(0,0)) +\n  facet_grid(stage~adapter)\ng_adapters\n\n\n\n\n\n\nCode# Visualize quality\ng_quality_base <- g_qual +\n  geom_hline(yintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_hline(yintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +\n  scale_y_continuous(name=\"Mean Phred score\", expand=c(0,0), limits=c(10,45)) +\n  scale_x_continuous(name=\"Position\", limits=c(0,NA),\n                     breaks=seq(0,140,20), expand=c(0,0)) +\n  facet_grid(stage~.)\ng_quality_base\n\n\n\n\n\n\nCodeg_quality_seq <- g_qual +\n  geom_vline(xintercept=25, linetype=\"dashed\", color=\"red\") +\n  geom_vline(xintercept=30, linetype=\"dashed\", color=\"red\") +\n  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +\n  scale_x_continuous(name=\"Mean Phred score\", expand=c(0,0)) +\n  scale_y_continuous(name=\"# Sequences\", expand=c(0,0)) +\n  facet_grid(stage~.)\ng_quality_seq\n\n\n\n\n\n\n\nAccording to FASTQC, cleaning + deduplication was very effective at reducing measured duplicate levels in the few samples that required it:\n\nCodestage_dup <- basic_stats %>% group_by(stage) %>% \n  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),\n            dmean=mean(percent_duplicates), .groups = \"drop\")\n\ng_dup_stages <- g_stage_base +\n  geom_line(aes(y=percent_duplicates), data=basic_stats) +\n  scale_y_continuous(\"% Duplicates\", limits=c(0,NA), expand=c(0,0))\ng_dup_stages\n\n\n\n\n\n\nCodeg_readlen_stages <- g_stage_base + \n  geom_line(aes(y=mean_seq_len), data=basic_stats) +\n  scale_y_continuous(\"Mean read length (nt)\", expand=c(0,0), limits=c(0,NA))\ng_readlen_stages\n\n\n\n\n\n\n\nHigh-level composition\nAs before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:\n\nCodeclassifications <- c(\"Filtered\", \"Duplicate\", \"Ribosomal\", \"Unassigned\",\n                     \"Bacterial\", \"Archaeal\", \"Viral\", \"Human\")\n\n# Import composition data\ncomp_path <- file.path(data_dir, \"taxonomic_composition.tsv.gz\")\ncomp <- read_tsv(comp_path, show_col_types = FALSE) %>%\n  left_join(libraries, by=\"sample\") %>%\n  mutate(classification = factor(classification, levels = classifications))\n  \n\n# Summarize composition\nread_comp_summ <- comp %>% \n  group_by(classification) %>%\n  summarize(n_reads = sum(n_reads), .groups = \"drop_last\") %>%\n  mutate(n_reads = replace_na(n_reads,0),\n    p_reads = n_reads/sum(n_reads),\n    pc_reads = p_reads*100)\n\n\n\nCode# Prepare plotting templates\ng_comp_base <- ggplot(mapping=aes(x=sample, y=p_reads, fill=classification)) +\n  theme_kit\nscale_y_pc_reads <- purrr::partial(scale_y_continuous, name = \"% Reads\",\n                                   expand = c(0,0), labels = function(y) y*100)\n\n# Plot overall composition\ng_comp <- g_comp_base + geom_col(data = comp, position = \"stack\", width=1) +\n  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +\n  scale_fill_brewer(palette = \"Set1\", name = \"Classification\")\ng_comp\n\n\n\n\n\n\nCode# Plot composition of minor components\ncomp_minor <- comp %>% \n  filter(classification %in% c(\"Archaeal\", \"Viral\", \"Human\", \"Other\"))\npalette_minor <- brewer.pal(9, \"Set1\")[6:9]\ng_comp_minor <- g_comp_base + \n  geom_col(data=comp_minor, position = \"stack\", width=1) +\n  scale_y_pc_reads() +\n  scale_fill_manual(values=palette_minor, name = \"Classification\")\ng_comp_minor\n\n\n\n\n\n\n\n\nCodep_reads_summ_group <- comp %>%\n  mutate(classification = ifelse(classification %in% c(\"Filtered\", \"Duplicate\", \"Unassigned\"), \"Excluded\", as.character(classification)),\n         classification = fct_inorder(classification)) %>%\n  group_by(classification, sample) %>%\n  summarize(p_reads = sum(p_reads), .groups = \"drop\") %>%\n  group_by(classification) %>%\n  summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, \n            pc_mean = mean(p_reads)*100, .groups = \"drop\")\np_reads_summ_prep <- p_reads_summ_group %>%\n  mutate(classification = fct_inorder(classification),\n         pc_min = pc_min %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n         pc_max = pc_max %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n         pc_mean = pc_mean %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),\n         display = paste0(pc_min, \"-\", pc_max, \"% (mean \", pc_mean, \"%)\"))\np_reads_summ <- p_reads_summ_prep %>%\n  select(Classification=classification, \n         `Read Fraction`=display) %>%\n  arrange(Classification)\np_reads_summ\n\n\n  \n\n\n\nAs in previous DNA datasets, the vast majority of classified reads were bacterial in origin. Viral fraction averaged 0.13%, though one samples (NYC-08) reached almost 1%. As is common for DNA data, viral reads were overwhelmingly dominated by Caudoviricetes phages:\n\nCode# Get Kraken reports\nreports_path <- file.path(data_dir, \"kraken_reports.tsv.gz\")\nreports <- read_tsv(reports_path, show_col_types = FALSE)\n\n# Get viral taxonomy\nviral_taxa_path <- file.path(data_dir, \"viral-taxids.tsv.gz\")\nviral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)\n\n# Filter to viral taxa\nkraken_reports_viral <- filter(reports, taxid %in% viral_taxa$taxid) %>%\n  group_by(sample) %>%\n  mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])\nkraken_reports_viral_cleaned <- kraken_reports_viral %>%\n  inner_join(libraries, by=\"sample\") %>%\n  select(-pc_reads_total, -n_reads_direct, -contains(\"minimizers\")) %>%\n  select(name, taxid, p_reads_viral, n_reads_clade, everything())\n\nviral_classes <- kraken_reports_viral_cleaned %>% filter(rank == \"C\")\nviral_families <- kraken_reports_viral_cleaned %>% filter(rank == \"F\")\n\n\n\nCodemajor_threshold <- 0.02\n\n# Identify major viral classes\nviral_classes_major_tab <- viral_classes %>% \n  group_by(name, taxid) %>%\n  summarize(p_reads_viral_max = max(p_reads_viral), .groups=\"drop\") %>%\n  filter(p_reads_viral_max >= major_threshold)\nviral_classes_major_list <- viral_classes_major_tab %>% pull(name)\nviral_classes_major <- viral_classes %>% \n  filter(name %in% viral_classes_major_list) %>%\n  select(name, taxid, sample, p_reads_viral)\nviral_classes_minor <- viral_classes_major %>% \n  group_by(sample) %>%\n  summarize(p_reads_viral_major = sum(p_reads_viral), .groups = \"drop\") %>%\n  mutate(name = \"Other\", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%\n  select(name, taxid, sample, p_reads_viral)\nviral_classes_display <- bind_rows(viral_classes_major, viral_classes_minor) %>%\n  arrange(desc(p_reads_viral)) %>% \n  mutate(name = factor(name, levels=c(viral_classes_major_list, \"Other\")),\n         p_reads_viral = pmax(p_reads_viral, 0)) %>%\n  rename(p_reads = p_reads_viral, classification=name)\n\npalette_viral <- c(brewer.pal(12, \"Set3\"), brewer.pal(8, \"Dark2\"))\ng_classes <- g_comp_base + \n  geom_col(data=viral_classes_display, position = \"stack\", width=1) +\n  scale_y_continuous(name=\"% Viral Reads\", limits=c(0,1.01), breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral class\")\n  \ng_classes\n\n\n\n\n\n\n\nHuman-infecting virus reads: validation\nNext, I investigated the human-infecting virus read content of these unenriched samples. A grand total of 199 reads were identified as putatively human-viral:\n\nCode# Import HV read data\nhv_reads_filtered_path <- file.path(data_dir, \"hv_hits_putative_filtered.tsv.gz\")\nhv_reads_filtered <- lapply(hv_reads_filtered_path, read_tsv,\n                            show_col_types = FALSE) %>%\n  bind_rows() %>%\n  left_join(libraries, by=\"sample\")\n\n# Count reads\nn_hv_filtered <- hv_reads_filtered %>%\n  group_by(sample, seq_id) %>% count %>%\n  group_by(sample) %>% count %>% \n  inner_join(basic_stats %>% filter(stage == \"ribo_initial\") %>% \n               select(sample, n_read_pairs), by=\"sample\") %>% \n  rename(n_putative = n, n_total = n_read_pairs) %>% \n  mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)\nn_hv_filtered_summ <- n_hv_filtered %>% ungroup %>%\n  summarize(n_putative = sum(n_putative), n_total = sum(n_total), \n            .groups=\"drop\") %>% \n  mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)\n\n\n\nCode# Collapse multi-entry sequences\nrmax <- purrr::partial(max, na.rm = TRUE)\ncollapse <- function(x) ifelse(all(x == x[1]), x[1], paste(x, collapse=\"/\"))\nmrg <- hv_reads_filtered %>% \n  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev, na.rm = TRUE)) %>%\n  arrange(desc(adj_score_max)) %>%\n  group_by(seq_id) %>%\n  summarize(sample = collapse(sample),\n            genome_id = collapse(genome_id),\n            taxid_best = taxid[1],\n            taxid = collapse(as.character(taxid)),\n            best_alignment_score_fwd = rmax(best_alignment_score_fwd),\n            best_alignment_score_rev = rmax(best_alignment_score_rev),\n            query_len_fwd = rmax(query_len_fwd),\n            query_len_rev = rmax(query_len_rev),\n            query_seq_fwd = query_seq_fwd[!is.na(query_seq_fwd)][1],\n            query_seq_rev = query_seq_rev[!is.na(query_seq_rev)][1],\n            classified = rmax(classified),\n            assigned_name = collapse(assigned_name),\n            assigned_taxid_best = assigned_taxid[1],\n            assigned_taxid = collapse(as.character(assigned_taxid)),\n            assigned_hv = rmax(assigned_hv),\n            hit_hv = rmax(hit_hv),\n            encoded_hits = collapse(encoded_hits),\n            adj_score_fwd = rmax(adj_score_fwd),\n            adj_score_rev = rmax(adj_score_rev)\n            ) %>%\n  inner_join(libraries, by=\"sample\") %>%\n  mutate(kraken_label = ifelse(assigned_hv, \"Kraken2 HV\\nassignment\",\n                               ifelse(hit_hv, \"Kraken2 HV\\nhit\",\n                                      \"No hit or\\nassignment\"))) %>%\n  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev),\n         highscore = adj_score_max >= 20)\n\n# Plot results\ngeom_vhist <- purrr::partial(geom_histogram, binwidth=5, boundary=0)\ng_vhist_base <- ggplot(mapping=aes(x=adj_score_max)) +\n  geom_vline(xintercept=20, linetype=\"dashed\", color=\"red\") +\n  facet_wrap(~kraken_label, labeller = labeller(kit = label_wrap_gen(20)), scales = \"free_y\") +\n  scale_x_continuous(name = \"Maximum adjusted alignment score\") + \n  scale_y_continuous(name=\"# Read pairs\") + \n  theme_base \ng_vhist_0 <- g_vhist_base + geom_vhist(data=mrg)\ng_vhist_0\n\n\n\n\n\n\n\nBLASTing these reads against nt, we find that the pipeline performs well, with only a single high-scoring false-positive read:\n\nCode# Import paired BLAST results\nblast_paired_path <- file.path(data_dir, \"hv_hits_blast_paired.tsv.gz\")\nblast_paired <- read_tsv(blast_paired_path, show_col_types = FALSE)\n\n# Add viral status\nblast_viral <- mutate(blast_paired, viral = staxid %in% viral_taxa$taxid) %>%\n  mutate(viral_full = viral & n_reads == 2)\n\n# Compare to Kraken & Bowtie assignments\nmatch_taxid <- function(taxid_1, taxid_2){\n  p1 <- mapply(grepl, paste0(\"/\", taxid_1, \"$\"), taxid_2)\n  p2 <- mapply(grepl, paste0(\"^\", taxid_1, \"/\"), taxid_2)\n  p3 <- mapply(grepl, paste0(\"^\", taxid_1, \"$\"), taxid_2)\n  out <- setNames(p1|p2|p3, NULL)\n  return(out)\n}\nmrg_assign <- mrg %>% select(sample, seq_id, taxid, assigned_taxid, adj_score_max)\nblast_assign <- inner_join(blast_viral, mrg_assign, by=\"seq_id\") %>%\n    mutate(taxid_match_bowtie = match_taxid(staxid, taxid),\n           taxid_match_kraken = match_taxid(staxid, assigned_taxid),\n           taxid_match_any = taxid_match_bowtie | taxid_match_kraken)\nblast_out <- blast_assign %>%\n  group_by(seq_id) %>%\n  summarize(viral_status = ifelse(any(viral_full), 2,\n                                  ifelse(any(taxid_match_any), 2,\n                                             ifelse(any(viral), 1, 0))),\n            .groups = \"drop\")\n\n\n\nCode# Merge BLAST results with unenriched read data\nmrg_blast <- full_join(mrg, blast_out, by=\"seq_id\") %>%\n  mutate(viral_status = replace_na(viral_status, 0),\n         viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))\n\n# Plot\ng_vhist_1 <- g_vhist_base + geom_vhist(data=mrg_blast, mapping=aes(fill=viral_status_out)) +\n  scale_fill_brewer(palette = \"Set1\", name = \"Viral status\")\ng_vhist_1\n\n\n\n\n\n\n\nMy usual disjunctive score threshold of 20 gave precision, sensitivity, and F1 scores all >96%:\n\nCodetest_sens_spec <- function(tab, score_threshold){\n  tab_retained <- tab %>% \n    mutate(retain_score = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),\n           retain = assigned_hv | retain_score) %>%\n    group_by(viral_status_out, retain) %>% count\n  pos_tru <- tab_retained %>% filter(viral_status_out == \"TRUE\", retain) %>% pull(n) %>% sum\n  pos_fls <- tab_retained %>% filter(viral_status_out != \"TRUE\", retain) %>% pull(n) %>% sum\n  neg_tru <- tab_retained %>% filter(viral_status_out != \"TRUE\", !retain) %>% pull(n) %>% sum\n  neg_fls <- tab_retained %>% filter(viral_status_out == \"TRUE\", !retain) %>% pull(n) %>% sum\n  sensitivity <- pos_tru / (pos_tru + neg_fls)\n  specificity <- neg_tru / (neg_tru + pos_fls)\n  precision   <- pos_tru / (pos_tru + pos_fls)\n  f1 <- 2 * precision * sensitivity / (precision + sensitivity)\n  out <- tibble(threshold=score_threshold, sensitivity=sensitivity, \n                specificity=specificity, precision=precision, f1=f1)\n  return(out)\n}\nrange_f1 <- function(intab, inrange=15:45){\n  tss <- purrr::partial(test_sens_spec, tab=intab)\n  stats <- lapply(inrange, tss) %>% bind_rows %>%\n    pivot_longer(!threshold, names_to=\"metric\", values_to=\"value\")\n  return(stats)\n}\nstats_0 <- range_f1(mrg_blast)\ng_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +\n  geom_vline(xintercept=20, color = \"red\", linetype = \"dashed\") +\n  geom_line() +\n  scale_y_continuous(name = \"Value\", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +\n  scale_x_continuous(name = \"Adjusted Score Threshold\", expand = c(0,0)) +\n  scale_color_brewer(palette=\"Dark2\") +\n  theme_base\ng_stats_0\n\n\n\n\n\n\nCodestats_0 %>% filter(threshold == 20) %>% \n  select(Threshold=threshold, Metric=metric, Value=value)\n\n\n  \n\n\n\nHuman-infecting viruses: overall relative abundance\n\nCode# Get raw read counts\nread_counts_raw <- basic_stats_raw %>%\n  select(sample, n_reads_raw = n_read_pairs)\n\n# Get HV read counts\nmrg_hv <- mrg %>% mutate(hv_status = assigned_hv | highscore) %>%\n  rename(taxid_all = taxid, taxid = taxid_best)\nread_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% \n  count(name=\"n_reads_hv\")\nread_counts <- read_counts_raw %>% left_join(read_counts_hv, by=\"sample\") %>%\n  mutate(n_reads_hv = replace_na(n_reads_hv, 0))\n\n# Aggregate\nread_counts_grp <- read_counts %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_hv = sum(n_reads_hv), .groups=\"drop\") %>%\n  mutate(sample= \"All samples\")\nread_counts_agg <- bind_rows(read_counts, read_counts_grp) %>%\n  mutate(p_reads_hv = n_reads_hv/n_reads_raw,\n         sample = factor(sample, levels=c(levels(libraries$sample), \"All samples\")))\n\n\nApplying a disjunctive cutoff at S=20 identifies 162 read pairs as human-viral. This gives an overall relative HV abundance of \\(9.42 \\times 10^{-7}\\); higher than Ng and Bengtsson-Palme but lower than most other datasets I’ve analyzed with this pipeline:\n\nCode# Visualize\ng_phv_agg <- ggplot(read_counts_agg, aes(x=sample)) +\n  geom_point(aes(y=p_reads_hv)) +\n  scale_y_log10(\"Relative abundance of human virus reads\") +\n  theme_kit\ng_phv_agg\n\n\n\n\n\n\n\n\nCode# Collate past RA values\nra_past <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,\n                   \"Brumfield\", 5e-5, \"RNA\", FALSE,\n                   \"Brumfield\", 3.66e-7, \"DNA\", FALSE,\n                   \"Spurbeck\", 5.44e-6, \"RNA\", FALSE,\n                   \"Yang\", 3.62e-4, \"RNA\", FALSE,\n                   \"Rothman (unenriched)\", 1.87e-5, \"RNA\", FALSE,\n                   \"Rothman (panel-enriched)\", 3.3e-5, \"RNA\", TRUE,\n                   \"Crits-Christoph (unenriched)\", 1.37e-5, \"RNA\", FALSE,\n                   \"Crits-Christoph (panel-enriched)\", 1.26e-2, \"RNA\", TRUE,\n                   \"Prussin (non-control)\", 1.63e-5, \"RNA\", FALSE,\n                   \"Prussin (non-control)\", 4.16e-5, \"DNA\", FALSE,\n                   \"Rosario (non-control)\", 1.21e-5, \"RNA\", FALSE,\n                   \"Rosario (non-control)\", 1.50e-4, \"DNA\", FALSE,\n                   \"Leung\", 1.73e-5, \"DNA\", FALSE,\n                   \"Brinch\", 3.88e-6, \"DNA\", FALSE,\n                   \"Bengtsson-Palme\", 8.86e-8, \"DNA\", FALSE,\n                   \"Ng\", 2.90e-7, \"DNA\", FALSE\n)\n\n# Collate new RA values\nra_new <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,\n                  \"Maritz\", 9.42e-7, \"DNA\", FALSE)\n\n\n# Plot\nscale_color_na <- purrr::partial(scale_color_brewer, palette=\"Set1\",\n                                 name=\"Nucleic acid type\")\nra_comp <- bind_rows(ra_past, ra_new) %>% mutate(dataset = fct_inorder(dataset))\ng_ra_comp <- ggplot(ra_comp, aes(y=dataset, x=ra, color=na_type)) +\n  geom_point() +\n  scale_color_na() +\n  scale_x_log10(name=\"Relative abundance of human virus reads\") +\n  theme_base + theme(axis.title.y = element_blank())\ng_ra_comp\n\n\n\n\n\n\n\nHuman-infecting viruses: taxonomy and composition\nIn investigating the taxonomy of human-infecting virus reads, I restricted my analysis to samples with more than 5 HV read pairs total across all viruses, to reduce noise arising from extremely low HV read counts in some samples. 10 samples met this criterion.\nAt the family level, most samples were dominated by Adenoviridae, Polyomaviridae and Papillomaviridae. However, one sample, NYC-03, was overwhelmingly dominated by Herpesviridae:\n\nCode# Get viral taxon names for putative HV reads\nviral_taxa$name[viral_taxa$taxid == 249588] <- \"Mamastrovirus\"\nviral_taxa$name[viral_taxa$taxid == 194960] <- \"Kobuvirus\"\nviral_taxa$name[viral_taxa$taxid == 688449] <- \"Salivirus\"\nviral_taxa$name[viral_taxa$taxid == 585893] <- \"Picobirnaviridae\"\nviral_taxa$name[viral_taxa$taxid == 333922] <- \"Betapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 334207] <- \"Betapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 369960] <- \"Porcine type-C oncovirus\"\nviral_taxa$name[viral_taxa$taxid == 333924] <- \"Betapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 687329] <- \"Anelloviridae\"\nviral_taxa$name[viral_taxa$taxid == 325455] <- \"Gammapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 333750] <- \"Alphapapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 694002] <- \"Betacoronavirus\"\nviral_taxa$name[viral_taxa$taxid == 334202] <- \"Mupapillomavirus\"\nviral_taxa$name[viral_taxa$taxid == 197911] <- \"Alphainfluenzavirus\"\nviral_taxa$name[viral_taxa$taxid == 186938] <- \"Respirovirus\"\nviral_taxa$name[viral_taxa$taxid == 333926] <- \"Gammapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337051] <- \"Betapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337043] <- \"Alphapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 694003] <- \"Betacoronavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 334204] <- \"Mupapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 334208] <- \"Betapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 333928] <- \"Gammapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 337039] <- \"Alphapapillomavirus 2\"\nviral_taxa$name[viral_taxa$taxid == 333929] <- \"Gammapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 337042] <- \"Alphapapillomavirus 7\"\nviral_taxa$name[viral_taxa$taxid == 334203] <- \"Mupapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 333757] <- \"Alphapapillomavirus 8\"\nviral_taxa$name[viral_taxa$taxid == 337050] <- \"Alphapapillomavirus 6\"\nviral_taxa$name[viral_taxa$taxid == 333767] <- \"Alphapapillomavirus 3\"\nviral_taxa$name[viral_taxa$taxid == 333754] <- \"Alphapapillomavirus 10\"\nviral_taxa$name[viral_taxa$taxid == 687363] <- \"Torque teno virus 24\"\nviral_taxa$name[viral_taxa$taxid == 687342] <- \"Torque teno virus 3\"\nviral_taxa$name[viral_taxa$taxid == 687359] <- \"Torque teno virus 20\"\nviral_taxa$name[viral_taxa$taxid == 194441] <- \"Primate T-lymphotropic virus 2\"\nviral_taxa$name[viral_taxa$taxid == 334209] <- \"Betapapillomavirus 5\"\nviral_taxa$name[viral_taxa$taxid == 194965] <- \"Aichivirus B\"\nviral_taxa$name[viral_taxa$taxid == 333930] <- \"Gammapapillomavirus 4\"\nviral_taxa$name[viral_taxa$taxid == 337048] <- \"Alphapapillomavirus 1\"\nviral_taxa$name[viral_taxa$taxid == 337041] <- \"Alphapapillomavirus 9\"\nviral_taxa$name[viral_taxa$taxid == 337049] <- \"Alphapapillomavirus 11\"\nviral_taxa$name[viral_taxa$taxid == 337044] <- \"Alphapapillomavirus 5\"\n\n# Filter samples and add viral taxa information\nsamples_keep <- read_counts %>% filter(n_reads_hv > 5) %>% pull(sample)\nmrg_hv_named <- mrg_hv %>% filter(sample %in% samples_keep, hv_status) %>% left_join(viral_taxa, by=\"taxid\") \n\n# Discover viral species & genera for HV reads\nraise_rank <- function(read_db, taxid_db, out_rank = \"species\", verbose = FALSE){\n  # Get higher ranks than search rank\n  ranks <- c(\"subspecies\", \"species\", \"subgenus\", \"genus\", \"subfamily\", \"family\", \"suborder\", \"order\", \"class\", \"subphylum\", \"phylum\", \"kingdom\", \"superkingdom\")\n  rank_match <- which.max(ranks == out_rank)\n  high_ranks <- ranks[rank_match:length(ranks)]\n  # Merge read DB and taxid DB\n  reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%\n    left_join(taxid_db, by=\"taxid\")\n  # Extract sequences that are already at appropriate rank\n  reads_rank <- filter(reads, rank == out_rank)\n  # Drop sequences at a higher rank and return unclassified sequences\n  reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n  while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...\n    # Promote read taxids and re-merge with taxid DB, then re-classify and filter\n    reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%\n      select(-parent_taxid, -rank, -name) %>%\n      left_join(taxid_db, by=\"taxid\")\n    reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%\n      bind_rows(reads_rank)\n    reads_norank <- reads_remaining %>%\n      filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))\n  }\n  # Finally, extract and append reads that were excluded during the process\n  reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)\n  reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%\n    select(-parent_taxid, -rank, -name) %>%\n    left_join(taxid_db, by=\"taxid\")\n  return(reads_out)\n}\nhv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, \"species\")\nhv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, \"genus\")\nhv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, \"family\")\n\n\n\nCodethreshold_major_family <- 0.02\n\n# Count reads for each human-viral family\nhv_family_counts <- hv_reads_family %>% \n  group_by(sample, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample) %>%\n  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nhv_family_major_tab <- hv_family_counts %>% group_by(name) %>% \n  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%\n  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)\nhv_family_counts_major <- hv_family_counts %>%\n  mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, \"Other\")) %>%\n  group_by(sample, name_display) %>%\n  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(hv_family_major_tab$name, \"Other\")))\nhv_family_counts_display <- hv_family_counts_major %>%\n  rename(p_reads = p_reads_hv, classification = name_display)\n\n# Plot\ng_hv_family <- g_comp_base + \n  geom_col(data=hv_family_counts_display, position = \"stack\", width=1) +\n  scale_y_continuous(name=\"% HV Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral family\") +\n  labs(title=\"Family composition of human-viral reads\") +\n  guides(fill=guide_legend(ncol=4)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\ng_hv_family\n\n\n\n\n\n\nCode# Get most prominent families for text\nhv_family_collate <- hv_family_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv),\n            p_reads_max = max(p_reads_hv), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\nIn investigating individual viral families, to avoid distortions from a few rare reads, I restricted myself to samples where that family made up at least 10% of human-viral reads:\n\nCodethreshold_major_species <- 0.05\ntaxid_adeno <- 10508\n\n# Get set of adenoviridae reads\nadeno_samples <- hv_family_counts %>% filter(taxid == taxid_adeno) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\nadeno_ids <- hv_reads_family %>% \n  filter(taxid == taxid_adeno, sample %in% adeno_samples) %>%\n  pull(seq_id)\n\n# Count reads for each adenoviridae species\nadeno_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% adeno_ids) %>%\n  group_by(sample, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample) %>%\n  mutate(p_reads_adeno = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nadeno_species_major_tab <- adeno_species_counts %>% group_by(name) %>% \n  filter(p_reads_adeno == max(p_reads_adeno)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_adeno)) %>% \n  filter(p_reads_adeno > threshold_major_species)\nadeno_species_counts_major <- adeno_species_counts %>%\n  mutate(name_display = ifelse(name %in% adeno_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, name_display) %>%\n  summarize(n_reads_adeno = sum(n_reads_hv),\n            p_reads_adeno = sum(p_reads_adeno), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(adeno_species_major_tab$name, \"Other\")))\nadeno_species_counts_display <- adeno_species_counts_major %>%\n  rename(p_reads = p_reads_adeno, classification = name_display)\n\n# Plot\ng_adeno_species <- g_comp_base + \n  geom_col(data=adeno_species_counts_display, position = \"stack\", width=1) +\n  scale_y_continuous(name=\"% Adenoviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Adenoviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_adeno_species\n\n\n\n\n\n\nCode# Get most prominent species for text\nadeno_species_collate <- adeno_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_adeno), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\n\nCodethreshold_major_species <- 0.1\ntaxid_polyoma <- 151341\n\n# Get set of polyomaviridae reads\npolyoma_samples <- hv_family_counts %>% filter(taxid == taxid_polyoma) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\npolyoma_ids <- hv_reads_family %>% \n  filter(taxid == taxid_polyoma, sample %in% polyoma_samples) %>%\n  pull(seq_id)\n\n# Count reads for each polyomaviridae species\npolyoma_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% polyoma_ids) %>%\n  group_by(sample, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample) %>%\n  mutate(p_reads_polyoma = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\npolyoma_species_major_tab <- polyoma_species_counts %>% group_by(name) %>% \n  filter(p_reads_polyoma == max(p_reads_polyoma)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_polyoma)) %>% \n  filter(p_reads_polyoma > threshold_major_species)\npolyoma_species_counts_major <- polyoma_species_counts %>%\n  mutate(name_display = ifelse(name %in% polyoma_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, name_display) %>%\n  summarize(n_reads_polyoma = sum(n_reads_hv),\n            p_reads_polyoma = sum(p_reads_polyoma), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(polyoma_species_major_tab$name, \"Other\")))\npolyoma_species_counts_display <- polyoma_species_counts_major %>%\n  rename(p_reads = p_reads_polyoma, classification = name_display)\n\n# Plot\ng_polyoma_species <- g_comp_base + \n  geom_col(data=polyoma_species_counts_display, position = \"stack\", width=1) +\n  scale_y_continuous(name=\"% Polyomaviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Polyomaviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_polyoma_species\n\n\n\n\n\n\nCode# Get most prominent species for text\npolyoma_species_collate <- polyoma_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_polyoma), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\n\nCodethreshold_major_species <- 0.1\ntaxid_papilloma <- 151340\n\n# Get set of papillomaviridae reads\npapilloma_samples <- hv_family_counts %>% filter(taxid == taxid_papilloma) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\npapilloma_ids <- hv_reads_family %>% \n  filter(taxid == taxid_papilloma, sample %in% papilloma_samples) %>%\n  pull(seq_id)\n\n# Count reads for each papillomaviridae species\npapilloma_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% papilloma_ids) %>%\n  group_by(sample, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample) %>%\n  mutate(p_reads_papilloma = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\npapilloma_species_major_tab <- papilloma_species_counts %>% group_by(name) %>% \n  filter(p_reads_papilloma == max(p_reads_papilloma)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_papilloma)) %>% \n  filter(p_reads_papilloma > threshold_major_species)\npapilloma_species_counts_major <- papilloma_species_counts %>%\n  mutate(name_display = ifelse(name %in% papilloma_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, name_display) %>%\n  summarize(n_reads_papilloma = sum(n_reads_hv),\n            p_reads_papilloma = sum(p_reads_papilloma), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(papilloma_species_major_tab$name, \"Other\")))\npapilloma_species_counts_display <- papilloma_species_counts_major %>%\n  rename(p_reads = p_reads_papilloma, classification = name_display)\n\n# Plot\ng_papilloma_species <- g_comp_base + \n  geom_col(data=papilloma_species_counts_display, position = \"stack\", width=1) +\n  scale_y_continuous(name=\"% Papillomaviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Papillomaviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_papilloma_species\n\n\n\n\n\n\nCode# Get most prominent species for text\npapilloma_species_collate <- papilloma_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_papilloma), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\n\nCodethreshold_major_species <- 0.1\ntaxid_herpes <- 10292\n\n# Get set of herpesviridae reads\nherpes_samples <- hv_family_counts %>% filter(taxid == taxid_herpes) %>%\n  filter(p_reads_hv >= 0.1) %>%\n  pull(sample)\nherpes_ids <- hv_reads_family %>% \n  filter(taxid == taxid_herpes, sample %in% herpes_samples) %>%\n  pull(seq_id)\n\n# Count reads for each herpesviridae species\nherpes_species_counts <- hv_reads_species %>%\n  filter(seq_id %in% herpes_ids) %>%\n  group_by(sample, name, taxid) %>%\n  count(name = \"n_reads_hv\") %>%\n  group_by(sample) %>%\n  mutate(p_reads_herpes = n_reads_hv/sum(n_reads_hv))\n\n# Identify high-ranking families and group others\nherpes_species_major_tab <- herpes_species_counts %>% group_by(name) %>% \n  filter(p_reads_herpes == max(p_reads_herpes)) %>% \n  filter(row_number() == 1) %>%\n  arrange(desc(p_reads_herpes)) %>% \n  filter(p_reads_herpes > threshold_major_species)\nherpes_species_counts_major <- herpes_species_counts %>%\n  mutate(name_display = ifelse(name %in% herpes_species_major_tab$name, \n                               name, \"Other\")) %>%\n  group_by(sample, name_display) %>%\n  summarize(n_reads_herpes = sum(n_reads_hv),\n            p_reads_herpes = sum(p_reads_herpes), \n            .groups=\"drop\") %>%\n  mutate(name_display = factor(name_display, \n                               levels = c(herpes_species_major_tab$name, \"Other\")))\nherpes_species_counts_display <- herpes_species_counts_major %>%\n  rename(p_reads = p_reads_herpes, classification = name_display)\n\n# Plot\ng_herpes_species <- g_comp_base + \n  geom_col(data=herpes_species_counts_display, position = \"stack\", width=1) +\n  scale_y_continuous(name=\"% Herpesviridae Reads\", limits=c(0,1.01), \n                     breaks = seq(0,1,0.2),\n                     expand=c(0,0), labels = function(y) y*100) +\n  scale_fill_manual(values=palette_viral, name = \"Viral species\") +\n  labs(title=\"Species composition of Herpesviridae reads\") +\n  guides(fill=guide_legend(ncol=3)) +\n  theme(plot.title = element_text(size=rel(1.4), hjust=0, face=\"plain\"))\n\ng_herpes_species\n\n\n\n\n\n\nCode# Get most prominent species for text\nherpes_species_collate <- herpes_species_counts %>% group_by(name, taxid) %>% \n  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_herpes), .groups=\"drop\") %>% \n  arrange(desc(n_reads_tot))\n\n\nI was a bit suspicious of this last result, given that it only occurred in one sample, but according to BLASTN, at least, these human gammaherpesvirus 4 (a.k.a. EBV) matches are real:\n\nCode# Configure\nref_taxids_hv <- c(10376)\nref_names_hv <- sapply(ref_taxids_hv, function(x) viral_taxa %>% filter(taxid == x) %>% pull(name) %>% first)\np_threshold <- 0.1\n\n# Get taxon names\ntax_names_path <- file.path(data_dir, \"taxid-names.tsv.gz\")\ntax_names <- read_tsv(tax_names_path, show_col_types = FALSE)\n\n# Add missing names\ntax_names_new <- tribble(~staxid, ~name,\n                         3050295, \"Cytomegalovirus humanbeta5\",\n                         459231, \"FLAG-tagging vector pFLAG97-TSR\",\n                         257877, \"Macaca thibetana thibetana\",\n                         256321, \"Lentiviral transfer vector pHsCXW\",\n                         419242, \"Shuttle vector pLvCmvMYOCDHA\",\n                         419243, \"Shuttle vector pLvCmvLacZ\",\n                         421868, \"Cloning vector pLvCmvLacZ.Gfp\",\n                         421869, \"Cloning vector pLvCmvMyocardin.Gfp\",\n                         426303, \"Lentiviral vector pNL-GFP-RRE(SA)\",\n                         436015, \"Lentiviral transfer vector pFTMGW\",\n                         454257, \"Shuttle vector pLvCmvMYOCD2aHA\",\n                         476184, \"Shuttle vector pLV.mMyoD::ERT2.eGFP\",\n                         476185, \"Shuttle vector pLV.hMyoD.eGFP\",\n                         591936, \"Piliocolobus tephrosceles\",\n                         627481, \"Lentiviral transfer vector pFTM3GW\",\n                         680261, \"Self-inactivating lentivirus vector pLV.C-EF1a.cyt-bGal.dCpG\",\n                         2952778, \"Expression vector pLV[Exp]-EGFP:T2A:Puro-EF1A\",\n                         3022699, \"Vector PAS_122122\",\n                         3025913, \"Vector pSIN-WP-mPGK-GDNF\",\n                         3105863, \"Vector pLKO.1-ZsGreen1\",\n                         3105864, \"Vector pLKO.1-ZsGreen1 mouse Wfs1 shRNA\",\n                         3108001, \"Cloning vector pLVSIN-CMV_Neo_v4.0\",\n                         3109234, \"Vector pTwist+Kan+High\",\n                         3117662, \"Cloning vector pLV[Exp]-CBA>P301L\",\n                         3117663, \"Cloning vector pLV[Exp]-CBA>P301L:T2A:mRuby3\",\n                         3117664, \"Cloning vector pLV[Exp]-CBA>hMAPT[NM_005910.6](ns):T2A:mRuby3\",\n                         3117665, \"Cloning vector pLV[Exp]-CBA>mRuby3\",\n                         3117666, \"Cloning vector pLV[Exp]-CBA>mRuby3/NFAT3 fusion protein\",\n                         3117667, \"Cloning vector pLV[Exp]-Neo-mPGK>{EGFP-hSEPT6}\",\n                         438045, \"Xenotropic MuLV-related virus\",\n                         447135, \"Myodes glareolus\",\n                         590745, \"Mus musculus mobilized endogenous polytropic provirus\",\n                         181858, \"Murine AIDS virus-related provirus\",\n                         356663, \"Xenotropic MuLV-related virus VP35\",\n                         356664, \"Xenotropic MuLV-related virus VP42\",\n                         373193, \"Xenotropic MuLV-related virus VP62\",\n                         286419, \"Canis lupus dingo\",\n                         415978, \"Sus scrofa scrofa\",\n                         494514, \"Vulpes lagopus\",\n                         3082113, \"Rangifer tarandus platyrhynchus\",\n                         3119969, \"Bubalus kerabau\")\ntax_names <- bind_rows(tax_names, tax_names_new)\n\n# Get matches\nhv_blast_staxids <- hv_reads_species %>% filter(taxid %in% ref_taxids_hv) %>%\n  group_by(taxid) %>% mutate(n_seq = n()) %>%\n  left_join(blast_paired, by=\"seq_id\") %>%\n  mutate(staxid = as.integer(staxid)) %>%\n  left_join(tax_names %>% rename(sname=name), by=\"staxid\")\n\n# Count matches\nhv_blast_counts <- hv_blast_staxids %>%\n  group_by(taxid, name, staxid, sname, n_seq) %>%\n  count %>% mutate(p=n/n_seq)\n\n# Subset to major matches\nhv_blast_counts_major <- hv_blast_counts %>% \n  filter(n>1, p>p_threshold, !is.na(staxid)) %>%\n  arrange(desc(p)) %>% group_by(taxid) %>%\n  filter(row_number() <= 25) %>%\n  mutate(name_display = ifelse(name == ref_names_hv[1], \"EBV\", name))\n\n# Plot\ng_hv_blast <- ggplot(hv_blast_counts_major, mapping=aes(x=p, y=sname)) +\n  geom_col() +\n  facet_grid(name_display~., scales=\"free_y\", space=\"free_y\") +\n  scale_x_continuous(name=\"% mapped reads\", limits=c(0,1), \n                     breaks=seq(0,1,0.2), expand=c(0,0)) +\n  theme_base + theme(axis.title.y = element_blank())\ng_hv_blast\n\n\n\n\n\n\n\nFinally, here again are the overall relative abundances of the specific viral genera I picked out manually in my last entry:\n\nCode# Define reference genera\npath_genera_rna <- c(\"Mamastrovirus\", \"Enterovirus\", \"Salivirus\", \"Kobuvirus\", \"Norovirus\", \"Sapovirus\", \"Rotavirus\", \"Alphacoronavirus\", \"Betacoronavirus\", \"Alphainfluenzavirus\", \"Betainfluenzavirus\", \"Lentivirus\")\npath_genera_dna <- c(\"Mastadenovirus\", \"Alphapolyomavirus\", \"Betapolyomavirus\", \"Alphapapillomavirus\", \"Betapapillomavirus\", \"Gammapapillomavirus\", \"Orthopoxvirus\", \"Simplexvirus\",\n                     \"Lymphocryptovirus\", \"Cytomegalovirus\", \"Dependoparvovirus\")\npath_genera <- bind_rows(tibble(name=path_genera_rna, genome_type=\"RNA genome\"),\n                         tibble(name=path_genera_dna, genome_type=\"DNA genome\")) %>%\n  left_join(viral_taxa, by=\"name\")\n\n# Count in each sample\nmrg_hv_named_all <- mrg_hv %>% left_join(viral_taxa, by=\"taxid\")\nhv_reads_genus_all <- raise_rank(mrg_hv_named_all, viral_taxa, \"genus\")\nn_path_genera <- hv_reads_genus_all %>% \n  group_by(sample, name, taxid) %>% \n  count(name=\"n_reads_viral\") %>% \n  inner_join(path_genera, by=c(\"name\", \"taxid\")) %>%\n  left_join(read_counts_raw, by=c(\"sample\")) %>%\n  mutate(p_reads_viral = n_reads_viral/n_reads_raw)\n\n# Pivot out and back to add zero lines\nn_path_genera_out <- n_path_genera %>% ungroup %>% select(sample, name, n_reads_viral) %>%\n  pivot_wider(names_from=\"name\", values_from=\"n_reads_viral\", values_fill=0) %>%\n  pivot_longer(-sample, names_to=\"name\", values_to=\"n_reads_viral\") %>%\n  left_join(read_counts_raw, by=\"sample\") %>%\n  left_join(path_genera, by=\"name\") %>%\n  mutate(p_reads_viral = n_reads_viral/n_reads_raw)\n\n## Aggregate across dates\nn_path_genera_stype <- n_path_genera_out %>% \n  group_by(name, taxid, genome_type) %>%\n  summarize(n_reads_raw = sum(n_reads_raw),\n            n_reads_viral = sum(n_reads_viral), .groups = \"drop\") %>%\n  mutate(sample=\"All samples\", location=\"All locations\",\n         p_reads_viral = n_reads_viral/n_reads_raw,\n         na_type = \"DNA\")\n\n# Plot\ng_path_genera <- ggplot(n_path_genera_stype,\n                        aes(y=name, x=p_reads_viral)) +\n  geom_point() +\n  scale_x_log10(name=\"Relative abundance\") +\n  facet_grid(genome_type~., scales=\"free_y\") +\n  theme_base + theme(axis.title.y = element_blank())\ng_path_genera\n\n\n\n\n\n\n\nConclusion\nI’ve had trouble with this dataset previously, so I was surprised at how well this analysis went. It seems the improvements I’ve made to the pipeline over the last couple of months have really had an effect. Like other DNA wastewater datasets I’ve looked at recently, this one (a) has very low HV relative abundance overall, and (b) shows a very high preponderance of human mastadenovirus F. I have one more DNA dataset from the P2RA study to analyze with this pipeline, so we’ll see if this pattern persists there."
   }
 ]
\ No newline at end of file
diff --git a/notebooks/2024-05-01_maritz.qmd b/notebooks/2024-05-01_maritz.qmd
new file mode 100644
index 0000000..cb1748f
--- /dev/null
+++ b/notebooks/2024-05-01_maritz.qmd
@@ -0,0 +1,1206 @@
+---
+title: "Workflow analysis of Maritz et al. (2019)"
+subtitle: "Wastewater from NYC."
+author: "Will Bradshaw"
+date: 2024-05-01
+format:
+  html:
+    code-fold: true
+    code-tools: true
+    code-link: true
+    df-print: paged
+editor: visual
+title-block-banner: black
+---
+
+```{r}
+#| label: preamble
+#| include: false
+
+# Load packages
+library(tidyverse)
+library(cowplot)
+library(patchwork)
+library(fastqcr)
+library(RColorBrewer)
+source("../scripts/aux_plot-theme.R")
+
+# GGplot themes and scales
+theme_base <- theme_base + theme(aspect.ratio = NULL)
+theme_rotate <- theme_base + theme(
+    axis.text.x = element_text(hjust = 1, angle = 45),
+)
+theme_kit <- theme_rotate + theme(
+  axis.title.x = element_blank(),
+)
+theme_xblank <- theme_kit + theme(
+  axis.text.x = element_blank()
+)
+tnl <- theme(legend.position = "none")
+```
+
+Continuing my analysis of datasets from the [P2RA preprint](https://doi.org/10.1101/2023.12.22.23300450), I analyzed the data from [Maritz et al. (2019)](https://doi.org/10.1038/s41396-019-0467-z), a study that used DNA sequencing of wastewater samples to characterize protist diversity and temporal diversity in New York City. Samples for this study underwent direct DNA extraction without a dedicated concentration step, then underwent library prep and Illumina sequencing on a HiSeq Rapid Run (2x250bp).
+
+# The raw data
+
+16 samples were collected from 14 treatment plants in NYC in November 2014. These samples yielded 8.6M-18.3M (mean 10.8M) reads per sample, for a total of 172M read pairs (84 gigabases of sequence). Read qualities were mostly high; adapter levels were moderate; inferred duplication levels were low.
+
+```{r}
+#| warning: false
+#| label: import-qc-data
+
+# Importing the data is a bit more complicated this time as the samples are split across three pipeline runs
+data_dir <- "../data/2024-05-01_maritz"
+
+# Data input paths
+libraries_path <- file.path(data_dir, "sample-metadata.csv")
+basic_stats_path <- file.path(data_dir, "qc_basic_stats.tsv.gz")
+adapter_stats_path <- file.path(data_dir, "qc_adapter_stats.tsv.gz")
+quality_base_stats_path <- file.path(data_dir, "qc_quality_base_stats.tsv.gz")
+quality_seq_stats_path <- file.path(data_dir, "qc_quality_sequence_stats.tsv.gz")
+
+# Import libraries and extract metadata from sample names
+libraries_raw <- lapply(libraries_path, read_csv, show_col_types = FALSE) %>%
+  bind_rows
+libraries <- libraries_raw %>%
+  mutate(sample = fct_inorder(sample))
+```
+
+```{r}
+#| label: process-qc-data
+
+# Import QC data
+stages <- c("raw_concat", "cleaned", "dedup", "ribo_initial", "ribo_secondary")
+import_basic <- function(paths){
+  lapply(paths, read_tsv, show_col_types = FALSE) %>% bind_rows %>%
+    inner_join(libraries, by="sample") %>%
+    arrange(sample) %>%
+    mutate(stage = factor(stage, levels = stages),
+           sample = fct_inorder(sample))
+}
+import_basic_paired <- function(paths){
+  import_basic(paths) %>% arrange(read_pair) %>% 
+    mutate(read_pair = fct_inorder(as.character(read_pair)))
+}
+basic_stats <- import_basic(basic_stats_path)
+adapter_stats <- import_basic_paired(adapter_stats_path)
+quality_base_stats <- import_basic_paired(quality_base_stats_path)
+quality_seq_stats <- import_basic_paired(quality_seq_stats_path)
+
+# Filter to raw data
+basic_stats_raw <- basic_stats %>% filter(stage == "raw_concat")
+adapter_stats_raw <- adapter_stats %>% filter(stage == "raw_concat")
+quality_base_stats_raw <- quality_base_stats %>% filter(stage == "raw_concat")
+quality_seq_stats_raw <- quality_seq_stats %>% filter(stage == "raw_concat")
+
+# Get key values for readout
+raw_read_counts <- basic_stats_raw %>% ungroup %>% 
+  summarize(rmin = min(n_read_pairs), rmax=max(n_read_pairs),
+            rmean=mean(n_read_pairs), 
+            rtot = sum(n_read_pairs),
+            btot = sum(n_bases_approx),
+            dmin = min(percent_duplicates), dmax=max(percent_duplicates),
+            dmean=mean(percent_duplicates), .groups = "drop")
+```
+
+```{r}
+#| fig-width: 9
+#| warning: false
+#| label: plot-basic-stats
+
+# Prepare data
+basic_stats_raw_metrics <- basic_stats_raw %>%
+  select(sample,
+         `# Read pairs` = n_read_pairs,
+         `Total base pairs\n(approx)` = n_bases_approx,
+         `% Duplicates\n(FASTQC)` = percent_duplicates) %>%
+  pivot_longer(-(sample), names_to = "metric", values_to = "value") %>%
+  mutate(metric = fct_inorder(metric))
+
+# Set up plot templates
+g_basic <- ggplot(basic_stats_raw_metrics, aes(x=sample, y=value)) +
+  geom_col(position = "dodge") +
+  scale_y_continuous(expand=c(0,0)) +
+  expand_limits(y=c(0,100)) +
+  facet_grid(metric~., scales = "free", space="free_x", switch="y") +
+  theme_kit + theme(
+    axis.title.y = element_blank(),
+    strip.text.y = element_text(face="plain")
+  )
+g_basic
+```
+
+```{r}
+#| label: plot-raw-quality
+
+# Set up plotting templates
+g_qual_raw <- ggplot(mapping=aes(linetype=read_pair, 
+                         group=interaction(sample,read_pair))) + 
+  scale_linetype_discrete(name = "Read Pair") +
+  guides(color=guide_legend(nrow=2,byrow=TRUE),
+         linetype = guide_legend(nrow=2,byrow=TRUE)) +
+  theme_base
+
+# Visualize adapters
+g_adapters_raw <- g_qual_raw + 
+  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats_raw) +
+  scale_y_continuous(name="% Adapters", limits=c(0,NA),
+                     breaks = seq(0,100,1), expand=c(0,0)) +
+  scale_x_continuous(name="Position", limits=c(0,NA),
+                     breaks=seq(0,500,20), expand=c(0,0)) +
+  facet_grid(.~adapter)
+g_adapters_raw
+
+# Visualize quality
+g_quality_base_raw <- g_qual_raw +
+  geom_hline(yintercept=25, linetype="dashed", color="red") +
+  geom_hline(yintercept=30, linetype="dashed", color="red") +
+  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats_raw) +
+  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+  scale_x_continuous(name="Position", limits=c(0,NA),
+                     breaks=seq(0,500,20), expand=c(0,0))
+g_quality_base_raw
+
+g_quality_seq_raw <- g_qual_raw +
+  geom_vline(xintercept=25, linetype="dashed", color="red") +
+  geom_vline(xintercept=30, linetype="dashed", color="red") +
+  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats_raw) +
+  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+  scale_y_continuous(name="# Sequences", expand=c(0,0))
+g_quality_seq_raw
+```
+
+# Preprocessing
+
+About 6% of reads on average were lost during cleaning, and a further 2% during deduplication. Very few reads were lost during ribodepletion, as expected for DNA sequencing libraries.
+
+```{r}
+#| label: preproc-table
+n_reads_rel <- basic_stats %>% 
+  select(sample, stage, 
+         percent_duplicates, n_read_pairs) %>%
+  group_by(sample) %>% arrange(sample, stage) %>%
+  mutate(p_reads_retained = replace_na(n_read_pairs / lag(n_read_pairs), 0),
+         p_reads_lost = 1 - p_reads_retained,
+         p_reads_retained_abs = n_read_pairs / n_read_pairs[1],
+         p_reads_lost_abs = 1-p_reads_retained_abs,
+         p_reads_lost_abs_marginal = replace_na(p_reads_lost_abs - lag(p_reads_lost_abs), 0))
+n_reads_rel_display <- n_reads_rel %>% 
+  group_by(Stage=stage) %>% 
+  summarize(`% Total Reads Lost (Cumulative)` = paste0(round(min(p_reads_lost_abs*100),1), "-", round(max(p_reads_lost_abs*100),1), " (mean ", round(mean(p_reads_lost_abs*100),1), ")"),
+            `% Total Reads Lost (Marginal)` = paste0(round(min(p_reads_lost_abs_marginal*100),1), "-", round(max(p_reads_lost_abs_marginal*100),1), " (mean ", round(mean(p_reads_lost_abs_marginal*100),1), ")"), .groups="drop") %>% 
+  filter(Stage != "raw_concat") %>%
+  mutate(Stage = Stage %>% as.numeric %>% factor(labels=c("Trimming & filtering", "Deduplication", "Initial ribodepletion", "Secondary ribodepletion")))
+n_reads_rel_display
+```
+
+```{r}
+#| label: preproc-figures
+#| warning: false
+#| fig-height: 4
+#| fig-width: 6
+
+g_stage_base <- ggplot(mapping=aes(x=stage, group=sample)) +
+  theme_kit
+
+# Plot reads over preprocessing
+g_reads_stages <- g_stage_base +
+  geom_line(aes(y=n_read_pairs), data=basic_stats) +
+  scale_y_continuous("# Read pairs", expand=c(0,0), limits=c(0,NA))
+g_reads_stages
+
+# Plot relative read losses during preprocessing
+g_reads_rel <- g_stage_base +
+  geom_line(aes(y=p_reads_lost_abs_marginal), data=n_reads_rel) +
+  scale_y_continuous("% Total Reads Lost", expand=c(0,0), 
+                     labels = function(x) x*100)
+g_reads_rel
+```
+
+Data cleaning was very successful at removing adapters and improving read qualities:
+
+```{r}
+#| warning: false
+#| label: plot-quality
+#| fig-height: 7
+
+g_qual <- ggplot(mapping=aes(linetype=read_pair, 
+                         group=interaction(sample,read_pair))) + 
+  scale_linetype_discrete(name = "Read Pair") +
+  guides(color=guide_legend(nrow=2,byrow=TRUE),
+         linetype = guide_legend(nrow=2,byrow=TRUE)) +
+  theme_base
+
+# Visualize adapters
+g_adapters <- g_qual + 
+  geom_line(aes(x=position, y=pc_adapters), data=adapter_stats) +
+  scale_y_continuous(name="% Adapters", limits=c(0,20),
+                     breaks = seq(0,50,10), expand=c(0,0)) +
+  scale_x_continuous(name="Position", limits=c(0,NA),
+                     breaks=seq(0,140,20), expand=c(0,0)) +
+  facet_grid(stage~adapter)
+g_adapters
+
+# Visualize quality
+g_quality_base <- g_qual +
+  geom_hline(yintercept=25, linetype="dashed", color="red") +
+  geom_hline(yintercept=30, linetype="dashed", color="red") +
+  geom_line(aes(x=position, y=mean_phred_score), data=quality_base_stats) +
+  scale_y_continuous(name="Mean Phred score", expand=c(0,0), limits=c(10,45)) +
+  scale_x_continuous(name="Position", limits=c(0,NA),
+                     breaks=seq(0,140,20), expand=c(0,0)) +
+  facet_grid(stage~.)
+g_quality_base
+
+g_quality_seq <- g_qual +
+  geom_vline(xintercept=25, linetype="dashed", color="red") +
+  geom_vline(xintercept=30, linetype="dashed", color="red") +
+  geom_line(aes(x=mean_phred_score, y=n_sequences), data=quality_seq_stats) +
+  scale_x_continuous(name="Mean Phred score", expand=c(0,0)) +
+  scale_y_continuous(name="# Sequences", expand=c(0,0)) +
+  facet_grid(stage~.)
+g_quality_seq
+```
+
+According to FASTQC, cleaning + deduplication was very effective at reducing measured duplicate levels in the few samples that required it:
+
+```{r}
+#| label: preproc-dedup
+#| fig-height: 3.5
+#| fig-width: 6
+
+stage_dup <- basic_stats %>% group_by(stage) %>% 
+  summarize(dmin = min(percent_duplicates), dmax=max(percent_duplicates),
+            dmean=mean(percent_duplicates), .groups = "drop")
+
+g_dup_stages <- g_stage_base +
+  geom_line(aes(y=percent_duplicates), data=basic_stats) +
+  scale_y_continuous("% Duplicates", limits=c(0,NA), expand=c(0,0))
+g_dup_stages
+
+g_readlen_stages <- g_stage_base + 
+  geom_line(aes(y=mean_seq_len), data=basic_stats) +
+  scale_y_continuous("Mean read length (nt)", expand=c(0,0), limits=c(0,NA))
+g_readlen_stages
+```
+
+# High-level composition
+
+As before, to assess the high-level composition of the reads, I ran the ribodepleted files through Kraken (using the Standard 16 database) and summarized the results with Bracken. Combining these results with the read counts above gives us a breakdown of the inferred composition of the samples:
+
+```{r}
+#| label: prepare-composition
+
+classifications <- c("Filtered", "Duplicate", "Ribosomal", "Unassigned",
+                     "Bacterial", "Archaeal", "Viral", "Human")
+
+# Import composition data
+comp_path <- file.path(data_dir, "taxonomic_composition.tsv.gz")
+comp <- read_tsv(comp_path, show_col_types = FALSE) %>%
+  left_join(libraries, by="sample") %>%
+  mutate(classification = factor(classification, levels = classifications))
+  
+
+# Summarize composition
+read_comp_summ <- comp %>% 
+  group_by(classification) %>%
+  summarize(n_reads = sum(n_reads), .groups = "drop_last") %>%
+  mutate(n_reads = replace_na(n_reads,0),
+    p_reads = n_reads/sum(n_reads),
+    pc_reads = p_reads*100)
+```
+
+```{r}
+#| label: plot-composition-all
+#| fig-height: 7
+#| fig-width: 8
+
+# Prepare plotting templates
+g_comp_base <- ggplot(mapping=aes(x=sample, y=p_reads, fill=classification)) +
+  theme_kit
+scale_y_pc_reads <- purrr::partial(scale_y_continuous, name = "% Reads",
+                                   expand = c(0,0), labels = function(y) y*100)
+
+# Plot overall composition
+g_comp <- g_comp_base + geom_col(data = comp, position = "stack", width=1) +
+  scale_y_pc_reads(limits = c(0,1.01), breaks = seq(0,1,0.2)) +
+  scale_fill_brewer(palette = "Set1", name = "Classification")
+g_comp
+
+# Plot composition of minor components
+comp_minor <- comp %>% 
+  filter(classification %in% c("Archaeal", "Viral", "Human", "Other"))
+palette_minor <- brewer.pal(9, "Set1")[6:9]
+g_comp_minor <- g_comp_base + 
+  geom_col(data=comp_minor, position = "stack", width=1) +
+  scale_y_pc_reads() +
+  scale_fill_manual(values=palette_minor, name = "Classification")
+g_comp_minor
+
+```
+
+```{r}
+#| label: composition-summary
+
+p_reads_summ_group <- comp %>%
+  mutate(classification = ifelse(classification %in% c("Filtered", "Duplicate", "Unassigned"), "Excluded", as.character(classification)),
+         classification = fct_inorder(classification)) %>%
+  group_by(classification, sample) %>%
+  summarize(p_reads = sum(p_reads), .groups = "drop") %>%
+  group_by(classification) %>%
+  summarize(pc_min = min(p_reads)*100, pc_max = max(p_reads)*100, 
+            pc_mean = mean(p_reads)*100, .groups = "drop")
+p_reads_summ_prep <- p_reads_summ_group %>%
+  mutate(classification = fct_inorder(classification),
+         pc_min = pc_min %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+         pc_max = pc_max %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+         pc_mean = pc_mean %>% signif(digits=2) %>% sapply(format, scientific=FALSE, trim=TRUE, digits=2),
+         display = paste0(pc_min, "-", pc_max, "% (mean ", pc_mean, "%)"))
+p_reads_summ <- p_reads_summ_prep %>%
+  select(Classification=classification, 
+         `Read Fraction`=display) %>%
+  arrange(Classification)
+p_reads_summ
+```
+
+As in previous DNA datasets, the vast majority of classified reads were bacterial in origin. Viral fraction averaged 0.13%, though one samples (NYC-08) reached almost 1%. As is common for DNA data, viral reads were overwhelmingly dominated by *Caudoviricetes* phages:
+
+```{r}
+#| label: extract-viral-taxa
+
+# Get Kraken reports
+reports_path <- file.path(data_dir, "kraken_reports.tsv.gz")
+reports <- read_tsv(reports_path, show_col_types = FALSE)
+
+# Get viral taxonomy
+viral_taxa_path <- file.path(data_dir, "viral-taxids.tsv.gz")
+viral_taxa <- read_tsv(viral_taxa_path, show_col_types = FALSE)
+
+# Filter to viral taxa
+kraken_reports_viral <- filter(reports, taxid %in% viral_taxa$taxid) %>%
+  group_by(sample) %>%
+  mutate(p_reads_viral = n_reads_clade/n_reads_clade[1])
+kraken_reports_viral_cleaned <- kraken_reports_viral %>%
+  inner_join(libraries, by="sample") %>%
+  select(-pc_reads_total, -n_reads_direct, -contains("minimizers")) %>%
+  select(name, taxid, p_reads_viral, n_reads_clade, everything())
+
+viral_classes <- kraken_reports_viral_cleaned %>% filter(rank == "C")
+viral_families <- kraken_reports_viral_cleaned %>% filter(rank == "F")
+
+```
+
+```{r}
+#| label: viral-class-composition
+
+major_threshold <- 0.02
+
+# Identify major viral classes
+viral_classes_major_tab <- viral_classes %>% 
+  group_by(name, taxid) %>%
+  summarize(p_reads_viral_max = max(p_reads_viral), .groups="drop") %>%
+  filter(p_reads_viral_max >= major_threshold)
+viral_classes_major_list <- viral_classes_major_tab %>% pull(name)
+viral_classes_major <- viral_classes %>% 
+  filter(name %in% viral_classes_major_list) %>%
+  select(name, taxid, sample, p_reads_viral)
+viral_classes_minor <- viral_classes_major %>% 
+  group_by(sample) %>%
+  summarize(p_reads_viral_major = sum(p_reads_viral), .groups = "drop") %>%
+  mutate(name = "Other", taxid=NA, p_reads_viral = 1-p_reads_viral_major) %>%
+  select(name, taxid, sample, p_reads_viral)
+viral_classes_display <- bind_rows(viral_classes_major, viral_classes_minor) %>%
+  arrange(desc(p_reads_viral)) %>% 
+  mutate(name = factor(name, levels=c(viral_classes_major_list, "Other")),
+         p_reads_viral = pmax(p_reads_viral, 0)) %>%
+  rename(p_reads = p_reads_viral, classification=name)
+
+palette_viral <- c(brewer.pal(12, "Set3"), brewer.pal(8, "Dark2"))
+g_classes <- g_comp_base + 
+  geom_col(data=viral_classes_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Viral Reads", limits=c(0,1.01), breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral class")
+  
+g_classes
+
+```
+
+# Human-infecting virus reads: validation
+
+Next, I investigated the human-infecting virus read content of these unenriched samples. A grand total of 199 reads were identified as putatively human-viral:
+
+```{r}
+#| label: hv-read-counts
+
+# Import HV read data
+hv_reads_filtered_path <- file.path(data_dir, "hv_hits_putative_filtered.tsv.gz")
+hv_reads_filtered <- lapply(hv_reads_filtered_path, read_tsv,
+                            show_col_types = FALSE) %>%
+  bind_rows() %>%
+  left_join(libraries, by="sample")
+
+# Count reads
+n_hv_filtered <- hv_reads_filtered %>%
+  group_by(sample, seq_id) %>% count %>%
+  group_by(sample) %>% count %>% 
+  inner_join(basic_stats %>% filter(stage == "ribo_initial") %>% 
+               select(sample, n_read_pairs), by="sample") %>% 
+  rename(n_putative = n, n_total = n_read_pairs) %>% 
+  mutate(p_reads = n_putative/n_total, pc_reads = p_reads * 100)
+n_hv_filtered_summ <- n_hv_filtered %>% ungroup %>%
+  summarize(n_putative = sum(n_putative), n_total = sum(n_total), 
+            .groups="drop") %>% 
+  mutate(p_reads = n_putative/n_total, pc_reads = p_reads*100)
+```
+
+```{r}
+#| label: plot-hv-scores
+#| warning: false
+#| fig-width: 8
+
+# Collapse multi-entry sequences
+rmax <- purrr::partial(max, na.rm = TRUE)
+collapse <- function(x) ifelse(all(x == x[1]), x[1], paste(x, collapse="/"))
+mrg <- hv_reads_filtered %>% 
+  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev, na.rm = TRUE)) %>%
+  arrange(desc(adj_score_max)) %>%
+  group_by(seq_id) %>%
+  summarize(sample = collapse(sample),
+            genome_id = collapse(genome_id),
+            taxid_best = taxid[1],
+            taxid = collapse(as.character(taxid)),
+            best_alignment_score_fwd = rmax(best_alignment_score_fwd),
+            best_alignment_score_rev = rmax(best_alignment_score_rev),
+            query_len_fwd = rmax(query_len_fwd),
+            query_len_rev = rmax(query_len_rev),
+            query_seq_fwd = query_seq_fwd[!is.na(query_seq_fwd)][1],
+            query_seq_rev = query_seq_rev[!is.na(query_seq_rev)][1],
+            classified = rmax(classified),
+            assigned_name = collapse(assigned_name),
+            assigned_taxid_best = assigned_taxid[1],
+            assigned_taxid = collapse(as.character(assigned_taxid)),
+            assigned_hv = rmax(assigned_hv),
+            hit_hv = rmax(hit_hv),
+            encoded_hits = collapse(encoded_hits),
+            adj_score_fwd = rmax(adj_score_fwd),
+            adj_score_rev = rmax(adj_score_rev)
+            ) %>%
+  inner_join(libraries, by="sample") %>%
+  mutate(kraken_label = ifelse(assigned_hv, "Kraken2 HV\nassignment",
+                               ifelse(hit_hv, "Kraken2 HV\nhit",
+                                      "No hit or\nassignment"))) %>%
+  mutate(adj_score_max = pmax(adj_score_fwd, adj_score_rev),
+         highscore = adj_score_max >= 20)
+
+# Plot results
+geom_vhist <- purrr::partial(geom_histogram, binwidth=5, boundary=0)
+g_vhist_base <- ggplot(mapping=aes(x=adj_score_max)) +
+  geom_vline(xintercept=20, linetype="dashed", color="red") +
+  facet_wrap(~kraken_label, labeller = labeller(kit = label_wrap_gen(20)), scales = "free_y") +
+  scale_x_continuous(name = "Maximum adjusted alignment score") + 
+  scale_y_continuous(name="# Read pairs") + 
+  theme_base 
+g_vhist_0 <- g_vhist_base + geom_vhist(data=mrg)
+g_vhist_0
+```
+
+BLASTing these reads against nt, we find that the pipeline performs well, with only a single high-scoring false-positive read:
+
+```{r}
+#| label: process-blast-data
+#| warning: false
+
+# Import paired BLAST results
+blast_paired_path <- file.path(data_dir, "hv_hits_blast_paired.tsv.gz")
+blast_paired <- read_tsv(blast_paired_path, show_col_types = FALSE)
+
+# Add viral status
+blast_viral <- mutate(blast_paired, viral = staxid %in% viral_taxa$taxid) %>%
+  mutate(viral_full = viral & n_reads == 2)
+
+# Compare to Kraken & Bowtie assignments
+match_taxid <- function(taxid_1, taxid_2){
+  p1 <- mapply(grepl, paste0("/", taxid_1, "$"), taxid_2)
+  p2 <- mapply(grepl, paste0("^", taxid_1, "/"), taxid_2)
+  p3 <- mapply(grepl, paste0("^", taxid_1, "$"), taxid_2)
+  out <- setNames(p1|p2|p3, NULL)
+  return(out)
+}
+mrg_assign <- mrg %>% select(sample, seq_id, taxid, assigned_taxid, adj_score_max)
+blast_assign <- inner_join(blast_viral, mrg_assign, by="seq_id") %>%
+    mutate(taxid_match_bowtie = match_taxid(staxid, taxid),
+           taxid_match_kraken = match_taxid(staxid, assigned_taxid),
+           taxid_match_any = taxid_match_bowtie | taxid_match_kraken)
+blast_out <- blast_assign %>%
+  group_by(seq_id) %>%
+  summarize(viral_status = ifelse(any(viral_full), 2,
+                                  ifelse(any(taxid_match_any), 2,
+                                             ifelse(any(viral), 1, 0))),
+            .groups = "drop")
+```
+
+```{r}
+#| label: plot-blast-results
+#| fig-height: 6
+#| warning: false
+
+# Merge BLAST results with unenriched read data
+mrg_blast <- full_join(mrg, blast_out, by="seq_id") %>%
+  mutate(viral_status = replace_na(viral_status, 0),
+         viral_status_out = ifelse(viral_status == 0, FALSE, TRUE))
+
+# Plot
+g_vhist_1 <- g_vhist_base + geom_vhist(data=mrg_blast, mapping=aes(fill=viral_status_out)) +
+  scale_fill_brewer(palette = "Set1", name = "Viral status")
+g_vhist_1
+```
+
+My usual disjunctive score threshold of 20 gave precision, sensitivity, and F1 scores all \>96%:
+
+```{r}
+#| label: plot-f1
+test_sens_spec <- function(tab, score_threshold){
+  tab_retained <- tab %>% 
+    mutate(retain_score = (adj_score_fwd > score_threshold | adj_score_rev > score_threshold),
+           retain = assigned_hv | retain_score) %>%
+    group_by(viral_status_out, retain) %>% count
+  pos_tru <- tab_retained %>% filter(viral_status_out == "TRUE", retain) %>% pull(n) %>% sum
+  pos_fls <- tab_retained %>% filter(viral_status_out != "TRUE", retain) %>% pull(n) %>% sum
+  neg_tru <- tab_retained %>% filter(viral_status_out != "TRUE", !retain) %>% pull(n) %>% sum
+  neg_fls <- tab_retained %>% filter(viral_status_out == "TRUE", !retain) %>% pull(n) %>% sum
+  sensitivity <- pos_tru / (pos_tru + neg_fls)
+  specificity <- neg_tru / (neg_tru + pos_fls)
+  precision   <- pos_tru / (pos_tru + pos_fls)
+  f1 <- 2 * precision * sensitivity / (precision + sensitivity)
+  out <- tibble(threshold=score_threshold, sensitivity=sensitivity, 
+                specificity=specificity, precision=precision, f1=f1)
+  return(out)
+}
+range_f1 <- function(intab, inrange=15:45){
+  tss <- purrr::partial(test_sens_spec, tab=intab)
+  stats <- lapply(inrange, tss) %>% bind_rows %>%
+    pivot_longer(!threshold, names_to="metric", values_to="value")
+  return(stats)
+}
+stats_0 <- range_f1(mrg_blast)
+g_stats_0 <- ggplot(stats_0, aes(x=threshold, y=value, color=metric)) +
+  geom_vline(xintercept=20, color = "red", linetype = "dashed") +
+  geom_line() +
+  scale_y_continuous(name = "Value", limits=c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +
+  scale_x_continuous(name = "Adjusted Score Threshold", expand = c(0,0)) +
+  scale_color_brewer(palette="Dark2") +
+  theme_base
+g_stats_0
+stats_0 %>% filter(threshold == 20) %>% 
+  select(Threshold=threshold, Metric=metric, Value=value)
+```
+
+# Human-infecting viruses: overall relative abundance
+
+```{r}
+#| label: count-hv-reads
+
+# Get raw read counts
+read_counts_raw <- basic_stats_raw %>%
+  select(sample, n_reads_raw = n_read_pairs)
+
+# Get HV read counts
+mrg_hv <- mrg %>% mutate(hv_status = assigned_hv | highscore) %>%
+  rename(taxid_all = taxid, taxid = taxid_best)
+read_counts_hv <- mrg_hv %>% filter(hv_status) %>% group_by(sample) %>% 
+  count(name="n_reads_hv")
+read_counts <- read_counts_raw %>% left_join(read_counts_hv, by="sample") %>%
+  mutate(n_reads_hv = replace_na(n_reads_hv, 0))
+
+# Aggregate
+read_counts_grp <- read_counts %>%
+  summarize(n_reads_raw = sum(n_reads_raw),
+            n_reads_hv = sum(n_reads_hv), .groups="drop") %>%
+  mutate(sample= "All samples")
+read_counts_agg <- bind_rows(read_counts, read_counts_grp) %>%
+  mutate(p_reads_hv = n_reads_hv/n_reads_raw,
+         sample = factor(sample, levels=c(levels(libraries$sample), "All samples")))
+```
+
+Applying a disjunctive cutoff at S=20 identifies 162 read pairs as human-viral. This gives an overall relative HV abundance of $9.42 \times 10^{-7}$; higher than [Ng](https://data.securebio.org/wills-public-notebook/notebooks/2024-05-01_ng.html) and [Bengtsson-Palme](https://data.securebio.org/wills-public-notebook/notebooks/2024-05-01_bengtsson-palme.html) but lower than most other datasets I've analyzed with this pipeline:
+
+```{r}
+#| label: plot-hv-ra
+#| warning: false
+# Visualize
+g_phv_agg <- ggplot(read_counts_agg, aes(x=sample)) +
+  geom_point(aes(y=p_reads_hv)) +
+  scale_y_log10("Relative abundance of human virus reads") +
+  theme_kit
+g_phv_agg
+```
+
+```{r}
+#| label: ra-hv-past
+
+# Collate past RA values
+ra_past <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
+                   "Brumfield", 5e-5, "RNA", FALSE,
+                   "Brumfield", 3.66e-7, "DNA", FALSE,
+                   "Spurbeck", 5.44e-6, "RNA", FALSE,
+                   "Yang", 3.62e-4, "RNA", FALSE,
+                   "Rothman (unenriched)", 1.87e-5, "RNA", FALSE,
+                   "Rothman (panel-enriched)", 3.3e-5, "RNA", TRUE,
+                   "Crits-Christoph (unenriched)", 1.37e-5, "RNA", FALSE,
+                   "Crits-Christoph (panel-enriched)", 1.26e-2, "RNA", TRUE,
+                   "Prussin (non-control)", 1.63e-5, "RNA", FALSE,
+                   "Prussin (non-control)", 4.16e-5, "DNA", FALSE,
+                   "Rosario (non-control)", 1.21e-5, "RNA", FALSE,
+                   "Rosario (non-control)", 1.50e-4, "DNA", FALSE,
+                   "Leung", 1.73e-5, "DNA", FALSE,
+                   "Brinch", 3.88e-6, "DNA", FALSE,
+                   "Bengtsson-Palme", 8.86e-8, "DNA", FALSE,
+                   "Ng", 2.90e-7, "DNA", FALSE
+)
+
+# Collate new RA values
+ra_new <- tribble(~dataset, ~ra, ~na_type, ~panel_enriched,
+                  "Maritz", 9.42e-7, "DNA", FALSE)
+
+
+# Plot
+scale_color_na <- purrr::partial(scale_color_brewer, palette="Set1",
+                                 name="Nucleic acid type")
+ra_comp <- bind_rows(ra_past, ra_new) %>% mutate(dataset = fct_inorder(dataset))
+g_ra_comp <- ggplot(ra_comp, aes(y=dataset, x=ra, color=na_type)) +
+  geom_point() +
+  scale_color_na() +
+  scale_x_log10(name="Relative abundance of human virus reads") +
+  theme_base + theme(axis.title.y = element_blank())
+g_ra_comp
+```
+
+# Human-infecting viruses: taxonomy and composition
+
+In investigating the taxonomy of human-infecting virus reads, I restricted my analysis to samples with more than 5 HV read pairs total across all viruses, to reduce noise arising from extremely low HV read counts in some samples. 10 samples met this criterion.
+
+At the family level, most samples were dominated by *Adenoviridae*, *Polyomaviridae* and *Papillomaviridae.* However, one sample, NYC-03, was overwhelmingly dominated by *Herpesviridae*:
+
+```{r}
+#| label: raise-hv-taxa
+
+# Get viral taxon names for putative HV reads
+viral_taxa$name[viral_taxa$taxid == 249588] <- "Mamastrovirus"
+viral_taxa$name[viral_taxa$taxid == 194960] <- "Kobuvirus"
+viral_taxa$name[viral_taxa$taxid == 688449] <- "Salivirus"
+viral_taxa$name[viral_taxa$taxid == 585893] <- "Picobirnaviridae"
+viral_taxa$name[viral_taxa$taxid == 333922] <- "Betapapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 334207] <- "Betapapillomavirus 3"
+viral_taxa$name[viral_taxa$taxid == 369960] <- "Porcine type-C oncovirus"
+viral_taxa$name[viral_taxa$taxid == 333924] <- "Betapapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 687329] <- "Anelloviridae"
+viral_taxa$name[viral_taxa$taxid == 325455] <- "Gammapapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 333750] <- "Alphapapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 694002] <- "Betacoronavirus"
+viral_taxa$name[viral_taxa$taxid == 334202] <- "Mupapillomavirus"
+viral_taxa$name[viral_taxa$taxid == 197911] <- "Alphainfluenzavirus"
+viral_taxa$name[viral_taxa$taxid == 186938] <- "Respirovirus"
+viral_taxa$name[viral_taxa$taxid == 333926] <- "Gammapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 337051] <- "Betapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 337043] <- "Alphapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid == 694003] <- "Betacoronavirus 1"
+viral_taxa$name[viral_taxa$taxid == 334204] <- "Mupapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 334208] <- "Betapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid == 333928] <- "Gammapapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 337039] <- "Alphapapillomavirus 2"
+viral_taxa$name[viral_taxa$taxid == 333929] <- "Gammapapillomavirus 3"
+viral_taxa$name[viral_taxa$taxid == 337042] <- "Alphapapillomavirus 7"
+viral_taxa$name[viral_taxa$taxid == 334203] <- "Mupapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 333757] <- "Alphapapillomavirus 8"
+viral_taxa$name[viral_taxa$taxid == 337050] <- "Alphapapillomavirus 6"
+viral_taxa$name[viral_taxa$taxid == 333767] <- "Alphapapillomavirus 3"
+viral_taxa$name[viral_taxa$taxid == 333754] <- "Alphapapillomavirus 10"
+viral_taxa$name[viral_taxa$taxid == 687363] <- "Torque teno virus 24"
+viral_taxa$name[viral_taxa$taxid == 687342] <- "Torque teno virus 3"
+viral_taxa$name[viral_taxa$taxid == 687359] <- "Torque teno virus 20"
+viral_taxa$name[viral_taxa$taxid == 194441] <- "Primate T-lymphotropic virus 2"
+viral_taxa$name[viral_taxa$taxid == 334209] <- "Betapapillomavirus 5"
+viral_taxa$name[viral_taxa$taxid == 194965] <- "Aichivirus B"
+viral_taxa$name[viral_taxa$taxid == 333930] <- "Gammapapillomavirus 4"
+viral_taxa$name[viral_taxa$taxid == 337048] <- "Alphapapillomavirus 1"
+viral_taxa$name[viral_taxa$taxid == 337041] <- "Alphapapillomavirus 9"
+viral_taxa$name[viral_taxa$taxid == 337049] <- "Alphapapillomavirus 11"
+viral_taxa$name[viral_taxa$taxid == 337044] <- "Alphapapillomavirus 5"
+
+# Filter samples and add viral taxa information
+samples_keep <- read_counts %>% filter(n_reads_hv > 5) %>% pull(sample)
+mrg_hv_named <- mrg_hv %>% filter(sample %in% samples_keep, hv_status) %>% left_join(viral_taxa, by="taxid") 
+
+# Discover viral species & genera for HV reads
+raise_rank <- function(read_db, taxid_db, out_rank = "species", verbose = FALSE){
+  # Get higher ranks than search rank
+  ranks <- c("subspecies", "species", "subgenus", "genus", "subfamily", "family", "suborder", "order", "class", "subphylum", "phylum", "kingdom", "superkingdom")
+  rank_match <- which.max(ranks == out_rank)
+  high_ranks <- ranks[rank_match:length(ranks)]
+  # Merge read DB and taxid DB
+  reads <- read_db %>% select(-parent_taxid, -rank, -name) %>%
+    left_join(taxid_db, by="taxid")
+  # Extract sequences that are already at appropriate rank
+  reads_rank <- filter(reads, rank == out_rank)
+  # Drop sequences at a higher rank and return unclassified sequences
+  reads_norank <- reads %>% filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+  while(nrow(reads_norank) > 0){ # As long as there are unclassified sequences...
+    # Promote read taxids and re-merge with taxid DB, then re-classify and filter
+    reads_remaining <- reads_norank %>% mutate(taxid = parent_taxid) %>%
+      select(-parent_taxid, -rank, -name) %>%
+      left_join(taxid_db, by="taxid")
+    reads_rank <- reads_remaining %>% filter(rank == out_rank) %>%
+      bind_rows(reads_rank)
+    reads_norank <- reads_remaining %>%
+      filter(rank != out_rank, !rank %in% high_ranks, !is.na(taxid))
+  }
+  # Finally, extract and append reads that were excluded during the process
+  reads_dropped <- reads %>% filter(!seq_id %in% reads_rank$seq_id)
+  reads_out <- reads_rank %>% bind_rows(reads_dropped) %>%
+    select(-parent_taxid, -rank, -name) %>%
+    left_join(taxid_db, by="taxid")
+  return(reads_out)
+}
+hv_reads_species <- raise_rank(mrg_hv_named, viral_taxa, "species")
+hv_reads_genus <- raise_rank(mrg_hv_named, viral_taxa, "genus")
+hv_reads_family <- raise_rank(mrg_hv_named, viral_taxa, "family")
+```
+
+```{r}
+#| label: hv-family
+#| fig-height: 5
+#| fig-width: 7
+
+threshold_major_family <- 0.02
+
+# Count reads for each human-viral family
+hv_family_counts <- hv_reads_family %>% 
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_hv = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+hv_family_major_tab <- hv_family_counts %>% group_by(name) %>% 
+  filter(p_reads_hv == max(p_reads_hv)) %>% filter(row_number() == 1) %>%
+  arrange(desc(p_reads_hv)) %>% filter(p_reads_hv > threshold_major_family)
+hv_family_counts_major <- hv_family_counts %>%
+  mutate(name_display = ifelse(name %in% hv_family_major_tab$name, name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_hv = sum(n_reads_hv), p_reads_hv = sum(p_reads_hv), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(hv_family_major_tab$name, "Other")))
+hv_family_counts_display <- hv_family_counts_major %>%
+  rename(p_reads = p_reads_hv, classification = name_display)
+
+# Plot
+g_hv_family <- g_comp_base + 
+  geom_col(data=hv_family_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% HV Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral family") +
+  labs(title="Family composition of human-viral reads") +
+  guides(fill=guide_legend(ncol=4)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+g_hv_family
+
+# Get most prominent families for text
+hv_family_collate <- hv_family_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv),
+            p_reads_max = max(p_reads_hv), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+```
+
+In investigating individual viral families, to avoid distortions from a few rare reads, I restricted myself to samples where that family made up at least 10% of human-viral reads:
+
+```{r}
+#| label: hv-species-adeno
+#| fig-height: 5
+#| fig-width: 7
+
+threshold_major_species <- 0.05
+taxid_adeno <- 10508
+
+# Get set of adenoviridae reads
+adeno_samples <- hv_family_counts %>% filter(taxid == taxid_adeno) %>%
+  filter(p_reads_hv >= 0.1) %>%
+  pull(sample)
+adeno_ids <- hv_reads_family %>% 
+  filter(taxid == taxid_adeno, sample %in% adeno_samples) %>%
+  pull(seq_id)
+
+# Count reads for each adenoviridae species
+adeno_species_counts <- hv_reads_species %>%
+  filter(seq_id %in% adeno_ids) %>%
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_adeno = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+adeno_species_major_tab <- adeno_species_counts %>% group_by(name) %>% 
+  filter(p_reads_adeno == max(p_reads_adeno)) %>% 
+  filter(row_number() == 1) %>%
+  arrange(desc(p_reads_adeno)) %>% 
+  filter(p_reads_adeno > threshold_major_species)
+adeno_species_counts_major <- adeno_species_counts %>%
+  mutate(name_display = ifelse(name %in% adeno_species_major_tab$name, 
+                               name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_adeno = sum(n_reads_hv),
+            p_reads_adeno = sum(p_reads_adeno), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(adeno_species_major_tab$name, "Other")))
+adeno_species_counts_display <- adeno_species_counts_major %>%
+  rename(p_reads = p_reads_adeno, classification = name_display)
+
+# Plot
+g_adeno_species <- g_comp_base + 
+  geom_col(data=adeno_species_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Adenoviridae Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral species") +
+  labs(title="Species composition of Adenoviridae reads") +
+  guides(fill=guide_legend(ncol=3)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+
+g_adeno_species
+
+# Get most prominent species for text
+adeno_species_collate <- adeno_species_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_adeno), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+```
+
+```{r}
+#| label: hv-species-polyoma
+#| fig-height: 5
+#| fig-width: 7
+
+threshold_major_species <- 0.1
+taxid_polyoma <- 151341
+
+# Get set of polyomaviridae reads
+polyoma_samples <- hv_family_counts %>% filter(taxid == taxid_polyoma) %>%
+  filter(p_reads_hv >= 0.1) %>%
+  pull(sample)
+polyoma_ids <- hv_reads_family %>% 
+  filter(taxid == taxid_polyoma, sample %in% polyoma_samples) %>%
+  pull(seq_id)
+
+# Count reads for each polyomaviridae species
+polyoma_species_counts <- hv_reads_species %>%
+  filter(seq_id %in% polyoma_ids) %>%
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_polyoma = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+polyoma_species_major_tab <- polyoma_species_counts %>% group_by(name) %>% 
+  filter(p_reads_polyoma == max(p_reads_polyoma)) %>% 
+  filter(row_number() == 1) %>%
+  arrange(desc(p_reads_polyoma)) %>% 
+  filter(p_reads_polyoma > threshold_major_species)
+polyoma_species_counts_major <- polyoma_species_counts %>%
+  mutate(name_display = ifelse(name %in% polyoma_species_major_tab$name, 
+                               name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_polyoma = sum(n_reads_hv),
+            p_reads_polyoma = sum(p_reads_polyoma), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(polyoma_species_major_tab$name, "Other")))
+polyoma_species_counts_display <- polyoma_species_counts_major %>%
+  rename(p_reads = p_reads_polyoma, classification = name_display)
+
+# Plot
+g_polyoma_species <- g_comp_base + 
+  geom_col(data=polyoma_species_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Polyomaviridae Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral species") +
+  labs(title="Species composition of Polyomaviridae reads") +
+  guides(fill=guide_legend(ncol=3)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+
+g_polyoma_species
+
+# Get most prominent species for text
+polyoma_species_collate <- polyoma_species_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_polyoma), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+```
+
+```{r}
+#| label: hv-species-papilloma
+#| fig-height: 5
+#| fig-width: 7
+
+threshold_major_species <- 0.1
+taxid_papilloma <- 151340
+
+# Get set of papillomaviridae reads
+papilloma_samples <- hv_family_counts %>% filter(taxid == taxid_papilloma) %>%
+  filter(p_reads_hv >= 0.1) %>%
+  pull(sample)
+papilloma_ids <- hv_reads_family %>% 
+  filter(taxid == taxid_papilloma, sample %in% papilloma_samples) %>%
+  pull(seq_id)
+
+# Count reads for each papillomaviridae species
+papilloma_species_counts <- hv_reads_species %>%
+  filter(seq_id %in% papilloma_ids) %>%
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_papilloma = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+papilloma_species_major_tab <- papilloma_species_counts %>% group_by(name) %>% 
+  filter(p_reads_papilloma == max(p_reads_papilloma)) %>% 
+  filter(row_number() == 1) %>%
+  arrange(desc(p_reads_papilloma)) %>% 
+  filter(p_reads_papilloma > threshold_major_species)
+papilloma_species_counts_major <- papilloma_species_counts %>%
+  mutate(name_display = ifelse(name %in% papilloma_species_major_tab$name, 
+                               name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_papilloma = sum(n_reads_hv),
+            p_reads_papilloma = sum(p_reads_papilloma), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(papilloma_species_major_tab$name, "Other")))
+papilloma_species_counts_display <- papilloma_species_counts_major %>%
+  rename(p_reads = p_reads_papilloma, classification = name_display)
+
+# Plot
+g_papilloma_species <- g_comp_base + 
+  geom_col(data=papilloma_species_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Papillomaviridae Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral species") +
+  labs(title="Species composition of Papillomaviridae reads") +
+  guides(fill=guide_legend(ncol=3)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+
+g_papilloma_species
+
+# Get most prominent species for text
+papilloma_species_collate <- papilloma_species_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_papilloma), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+```
+
+```{r}
+#| label: hv-species-herpes
+#| fig-height: 5
+#| fig-width: 7
+
+threshold_major_species <- 0.1
+taxid_herpes <- 10292
+
+# Get set of herpesviridae reads
+herpes_samples <- hv_family_counts %>% filter(taxid == taxid_herpes) %>%
+  filter(p_reads_hv >= 0.1) %>%
+  pull(sample)
+herpes_ids <- hv_reads_family %>% 
+  filter(taxid == taxid_herpes, sample %in% herpes_samples) %>%
+  pull(seq_id)
+
+# Count reads for each herpesviridae species
+herpes_species_counts <- hv_reads_species %>%
+  filter(seq_id %in% herpes_ids) %>%
+  group_by(sample, name, taxid) %>%
+  count(name = "n_reads_hv") %>%
+  group_by(sample) %>%
+  mutate(p_reads_herpes = n_reads_hv/sum(n_reads_hv))
+
+# Identify high-ranking families and group others
+herpes_species_major_tab <- herpes_species_counts %>% group_by(name) %>% 
+  filter(p_reads_herpes == max(p_reads_herpes)) %>% 
+  filter(row_number() == 1) %>%
+  arrange(desc(p_reads_herpes)) %>% 
+  filter(p_reads_herpes > threshold_major_species)
+herpes_species_counts_major <- herpes_species_counts %>%
+  mutate(name_display = ifelse(name %in% herpes_species_major_tab$name, 
+                               name, "Other")) %>%
+  group_by(sample, name_display) %>%
+  summarize(n_reads_herpes = sum(n_reads_hv),
+            p_reads_herpes = sum(p_reads_herpes), 
+            .groups="drop") %>%
+  mutate(name_display = factor(name_display, 
+                               levels = c(herpes_species_major_tab$name, "Other")))
+herpes_species_counts_display <- herpes_species_counts_major %>%
+  rename(p_reads = p_reads_herpes, classification = name_display)
+
+# Plot
+g_herpes_species <- g_comp_base + 
+  geom_col(data=herpes_species_counts_display, position = "stack", width=1) +
+  scale_y_continuous(name="% Herpesviridae Reads", limits=c(0,1.01), 
+                     breaks = seq(0,1,0.2),
+                     expand=c(0,0), labels = function(y) y*100) +
+  scale_fill_manual(values=palette_viral, name = "Viral species") +
+  labs(title="Species composition of Herpesviridae reads") +
+  guides(fill=guide_legend(ncol=3)) +
+  theme(plot.title = element_text(size=rel(1.4), hjust=0, face="plain"))
+
+g_herpes_species
+
+# Get most prominent species for text
+herpes_species_collate <- herpes_species_counts %>% group_by(name, taxid) %>% 
+  summarize(n_reads_tot = sum(n_reads_hv), p_reads_mean = mean(p_reads_herpes), .groups="drop") %>% 
+  arrange(desc(n_reads_tot))
+```
+
+I was a bit suspicious of this last result, given that it only occurred in one sample, but according to BLASTN, at least, these human gammaherpesvirus 4 (a.k.a. EBV) matches are real:
+
+```{r}
+#| label: hv-blast-hits
+#| fig-width: 6
+
+# Configure
+ref_taxids_hv <- c(10376)
+ref_names_hv <- sapply(ref_taxids_hv, function(x) viral_taxa %>% filter(taxid == x) %>% pull(name) %>% first)
+p_threshold <- 0.1
+
+# Get taxon names
+tax_names_path <- file.path(data_dir, "taxid-names.tsv.gz")
+tax_names <- read_tsv(tax_names_path, show_col_types = FALSE)
+
+# Add missing names
+tax_names_new <- tribble(~staxid, ~name,
+                         3050295, "Cytomegalovirus humanbeta5",
+                         459231, "FLAG-tagging vector pFLAG97-TSR",
+                         257877, "Macaca thibetana thibetana",
+                         256321, "Lentiviral transfer vector pHsCXW",
+                         419242, "Shuttle vector pLvCmvMYOCDHA",
+                         419243, "Shuttle vector pLvCmvLacZ",
+                         421868, "Cloning vector pLvCmvLacZ.Gfp",
+                         421869, "Cloning vector pLvCmvMyocardin.Gfp",
+                         426303, "Lentiviral vector pNL-GFP-RRE(SA)",
+                         436015, "Lentiviral transfer vector pFTMGW",
+                         454257, "Shuttle vector pLvCmvMYOCD2aHA",
+                         476184, "Shuttle vector pLV.mMyoD::ERT2.eGFP",
+                         476185, "Shuttle vector pLV.hMyoD.eGFP",
+                         591936, "Piliocolobus tephrosceles",
+                         627481, "Lentiviral transfer vector pFTM3GW",
+                         680261, "Self-inactivating lentivirus vector pLV.C-EF1a.cyt-bGal.dCpG",
+                         2952778, "Expression vector pLV[Exp]-EGFP:T2A:Puro-EF1A",
+                         3022699, "Vector PAS_122122",
+                         3025913, "Vector pSIN-WP-mPGK-GDNF",
+                         3105863, "Vector pLKO.1-ZsGreen1",
+                         3105864, "Vector pLKO.1-ZsGreen1 mouse Wfs1 shRNA",
+                         3108001, "Cloning vector pLVSIN-CMV_Neo_v4.0",
+                         3109234, "Vector pTwist+Kan+High",
+                         3117662, "Cloning vector pLV[Exp]-CBA>P301L",
+                         3117663, "Cloning vector pLV[Exp]-CBA>P301L:T2A:mRuby3",
+                         3117664, "Cloning vector pLV[Exp]-CBA>hMAPT[NM_005910.6](ns):T2A:mRuby3",
+                         3117665, "Cloning vector pLV[Exp]-CBA>mRuby3",
+                         3117666, "Cloning vector pLV[Exp]-CBA>mRuby3/NFAT3 fusion protein",
+                         3117667, "Cloning vector pLV[Exp]-Neo-mPGK>{EGFP-hSEPT6}",
+                         438045, "Xenotropic MuLV-related virus",
+                         447135, "Myodes glareolus",
+                         590745, "Mus musculus mobilized endogenous polytropic provirus",
+                         181858, "Murine AIDS virus-related provirus",
+                         356663, "Xenotropic MuLV-related virus VP35",
+                         356664, "Xenotropic MuLV-related virus VP42",
+                         373193, "Xenotropic MuLV-related virus VP62",
+                         286419, "Canis lupus dingo",
+                         415978, "Sus scrofa scrofa",
+                         494514, "Vulpes lagopus",
+                         3082113, "Rangifer tarandus platyrhynchus",
+                         3119969, "Bubalus kerabau")
+tax_names <- bind_rows(tax_names, tax_names_new)
+
+# Get matches
+hv_blast_staxids <- hv_reads_species %>% filter(taxid %in% ref_taxids_hv) %>%
+  group_by(taxid) %>% mutate(n_seq = n()) %>%
+  left_join(blast_paired, by="seq_id") %>%
+  mutate(staxid = as.integer(staxid)) %>%
+  left_join(tax_names %>% rename(sname=name), by="staxid")
+
+# Count matches
+hv_blast_counts <- hv_blast_staxids %>%
+  group_by(taxid, name, staxid, sname, n_seq) %>%
+  count %>% mutate(p=n/n_seq)
+
+# Subset to major matches
+hv_blast_counts_major <- hv_blast_counts %>% 
+  filter(n>1, p>p_threshold, !is.na(staxid)) %>%
+  arrange(desc(p)) %>% group_by(taxid) %>%
+  filter(row_number() <= 25) %>%
+  mutate(name_display = ifelse(name == ref_names_hv[1], "EBV", name))
+
+# Plot
+g_hv_blast <- ggplot(hv_blast_counts_major, mapping=aes(x=p, y=sname)) +
+  geom_col() +
+  facet_grid(name_display~., scales="free_y", space="free_y") +
+  scale_x_continuous(name="% mapped reads", limits=c(0,1), 
+                     breaks=seq(0,1,0.2), expand=c(0,0)) +
+  theme_base + theme(axis.title.y = element_blank())
+g_hv_blast
+```
+
+Finally, here again are the overall relative abundances of the specific viral genera I picked out manually in my last entry:
+
+```{r}
+#| fig-height: 5
+#| label: ra-genera
+#| warning: false
+
+# Define reference genera
+path_genera_rna <- c("Mamastrovirus", "Enterovirus", "Salivirus", "Kobuvirus", "Norovirus", "Sapovirus", "Rotavirus", "Alphacoronavirus", "Betacoronavirus", "Alphainfluenzavirus", "Betainfluenzavirus", "Lentivirus")
+path_genera_dna <- c("Mastadenovirus", "Alphapolyomavirus", "Betapolyomavirus", "Alphapapillomavirus", "Betapapillomavirus", "Gammapapillomavirus", "Orthopoxvirus", "Simplexvirus",
+                     "Lymphocryptovirus", "Cytomegalovirus", "Dependoparvovirus")
+path_genera <- bind_rows(tibble(name=path_genera_rna, genome_type="RNA genome"),
+                         tibble(name=path_genera_dna, genome_type="DNA genome")) %>%
+  left_join(viral_taxa, by="name")
+
+# Count in each sample
+mrg_hv_named_all <- mrg_hv %>% left_join(viral_taxa, by="taxid")
+hv_reads_genus_all <- raise_rank(mrg_hv_named_all, viral_taxa, "genus")
+n_path_genera <- hv_reads_genus_all %>% 
+  group_by(sample, name, taxid) %>% 
+  count(name="n_reads_viral") %>% 
+  inner_join(path_genera, by=c("name", "taxid")) %>%
+  left_join(read_counts_raw, by=c("sample")) %>%
+  mutate(p_reads_viral = n_reads_viral/n_reads_raw)
+
+# Pivot out and back to add zero lines
+n_path_genera_out <- n_path_genera %>% ungroup %>% select(sample, name, n_reads_viral) %>%
+  pivot_wider(names_from="name", values_from="n_reads_viral", values_fill=0) %>%
+  pivot_longer(-sample, names_to="name", values_to="n_reads_viral") %>%
+  left_join(read_counts_raw, by="sample") %>%
+  left_join(path_genera, by="name") %>%
+  mutate(p_reads_viral = n_reads_viral/n_reads_raw)
+
+## Aggregate across dates
+n_path_genera_stype <- n_path_genera_out %>% 
+  group_by(name, taxid, genome_type) %>%
+  summarize(n_reads_raw = sum(n_reads_raw),
+            n_reads_viral = sum(n_reads_viral), .groups = "drop") %>%
+  mutate(sample="All samples", location="All locations",
+         p_reads_viral = n_reads_viral/n_reads_raw,
+         na_type = "DNA")
+
+# Plot
+g_path_genera <- ggplot(n_path_genera_stype,
+                        aes(y=name, x=p_reads_viral)) +
+  geom_point() +
+  scale_x_log10(name="Relative abundance") +
+  facet_grid(genome_type~., scales="free_y") +
+  theme_base + theme(axis.title.y = element_blank())
+g_path_genera
+```
+
+# Conclusion
+
+I've had trouble with this dataset previously, so I was surprised at how well this analysis went. It seems the improvements I've made to the pipeline over the last couple of months have really had an effect. Like other DNA wastewater datasets I've looked at recently, this one (a) has very low HV relative abundance overall, and (b) shows a very high preponderance of human mastadenovirus F. I have one more DNA dataset from the P2RA study to analyze with this pipeline, so we'll see if this pattern persists there.
diff --git a/notebooks/2024-05-01_ng.qmd b/notebooks/2024-05-01_ng.qmd
index 009808a..cab0210 100644
--- a/notebooks/2024-05-01_ng.qmd
+++ b/notebooks/2024-05-01_ng.qmd
@@ -403,7 +403,7 @@ p_reads_summ <- p_reads_summ_prep %>%
 p_reads_summ
 ```
 
-As in previous DNA datasets, the vast majority of classified reads were bacterial in origin. The fraction of virus reads varied substantially between sample types, averaging \<0.01% in influent and final effluent but closer to 0.05% in other sample types. Interestingly (though not particularly relevantly for this analysis), the fraction of archaeal reads was much higher in influent than other sample types, in contrast to [Bengtsson-Palme](https://data.securebio.org/wills-public-notebook/notebooks/2024-05-01_bengtsson-palme.html) where it was highest in slidge.
+As in previous DNA datasets, the vast majority of classified reads were bacterial in origin. The fraction of virus reads varied substantially between sample types, averaging \<0.01% in influent and final effluent but closer to 0.05% in other sample types. Interestingly (though not particularly relevantly for this analysis), the fraction of archaeal reads was much higher in influent than other sample types, in contrast to [Bengtsson-Palme](https://data.securebio.org/wills-public-notebook/notebooks/2024-05-01_bengtsson-palme.html) where it was highest in sludge.
 
 As is common for DNA data, viral reads were overwhelmingly dominated by *Caudoviricetes* phages, though one wet-well sample contained a substantial fraction of *Alsuviricetes* (a class of mainly plant pathogens that includes *Virgaviridae*):