One relevant question for both Project Runway and other NAO sequencing is: what is the maximum read depth at which we can sequence a given sample while retaining an acceptable level of sequence duplication?
+
As discussed in a previous entry, duplicate reads can arise in sequencing data from a variety of processes, including true biological duplicates present in the raw sample; processing duplicates arising from amplification and other processes during sample and library prep; and sequencing duplicates arising from various processes on the actual flow cell.
+
As we sequence more deeply, we expect the fraction of biological and processing duplicates (but not, I think, sequencing duplicates) in our read data to increase. In the former case, this is because we are capturing a larger fraction of all the input molecules in our sample; in the latter, because we are sequencing copies of the same sequence over and over again. Intuitively, I expect the increase in processing duplicates to swamp that in biological duplicates at high read depth, at least for library prep protocols that involve amplification1.
+
One simple approach to investigate the overall effect of read depth on duplication levels in the sample is rarefaction: downsampling a library to different numbers of reads and seeing how the duplication rate changes as a function of read count. In this notebook entry, I apply this approach to sequencing data from the Project Runway initial DNA dataset, to see how duplication rate behaves in this case.
+
Methods
+
To analyze the effect of read depth on duplication rates in this data, I first concatenated the raw reads from the two sequencing replicates of each sample together. This allowed the analysis to detect duplicates across replicates, that would have been missed by analyzing them separately.
+
Next, I performed preprocessing to remove adapter sequences and low-quality bases that might interfere with duplicate detection. I didn’t collapse read pairs together or discard read pairs with overall low quality.
+
for p in $(cat prefixes.txt); do echo $p; fastp -i raw/${p}_1.fastq.gz -I raw/${p}_2.fastq.gz -o preproc/${p}_fastp_1.fastq.gz -O preproc/${p}_fastp_2.fastq.gz --failed_out preproc/${p}_fastp_failed.fastq.gz --cut_tail --correction --detect_adapter_for_pe --adapter_fasta adapters.fa --trim_poly_x --thread 16 -Q -L; done
+
I then took each pair of preprocessed FASTQ files and downsampled them to specified numbers of reads, from 10,000 to 100,000,000 in units of 1 OOM, using seqtk sample. I performed downsampling 3 times for each read count, with the intent of calculating duplication rate separately for each replicate and taking the average.
+
Finally, I ran deduplication with Clumpify on each downsampled pair of read files, as well as the raw preprocessed files, and recorded the fraction of reads discarded in each case. I performed this twice, using different Clumpify settings each time:
+
+
+
First, I attempted to perform deduplication in a maximally comprehensive way, using Clumpify’s unpair repair options to identify and remove duplicates in opposite orientation across read pairs (as discussed here). This configuration finds more duplicates, but (a) might overestimate duplicates in cases where only one read in a pair matches another read, and (b) causes memory-related errors for large subsample sizes.
Second, due to the aforementioned issues with approach 1, I repeated deduplication without Clumpify’s unpair repair options enabled, providing a lower-bound estimate of duplication levels which should be more consistent with estimates provided by, for example, FASTQC.
We can see that the fraction of duplicates reaches quite high levels as we approach the full read count, especially when unpair repair is enabled. It also looks like the gradient of increase is quite high on the linear-log plot, which would suggest that further OOM increases in read depth might result in quite large increases in the fraction of duplicates. It’s also apparent that, for whatever reason, D23-13406 has substantially fewer duplicates at any given read depth than the other two samples.
+
However, further interpretation of these results, including extrapolation to greater read depths, is made difficult by the lack of a theoretical model for what we expect to see. It’s also not clear what mode of visualization (Linear-log? Log-linear? Log-log? Fraction of duplicates expressed as probabilities or odds? Etc) is most meaningful for interpretation.
+
To start resolving some of these roadblocks, I spent some time working on a very simple model of read duplication, to see what it might tell us about the expected pattern of duplicates as a function of read depth.
+
A very very simple model of read duplication
+
+
Imagine a sample containing \(N\) distinct molecules, which are uniformly amplified up to \(M = N \times 2^C\) molecules by a perfectly unbiased \(C\)-cycle PCR reaction. Adapters are ligated and the resulting library is washed across the flow cell to generate \(X\) total clusters (again, without bias2). These clusters are then sequenced by a process that generates optical duplicates at some rate \(O\), for a total expectation of \(R=X\times(1+O)\) reads.
+
Each cluster is selected from the \(M\) molecules in the library, without replacement. When \(C\) is large and amplification is uniform across molecules, this is well-approximated by selecting from the \(N\) input molecules with replacement. Under these conditions, the number of clusters generated from input molecule \(i\) is approximately \(X_i \sim \mathrm{B}(X,N^{-1})\).
+
The number of optical duplicates generated from a given input molecule is then approximately \(P_i \sim \mathrm{B}(X_i,O)\), and the total number of reads corresponding to a given input molecule is thus \(R_i = X_i + P_i\). The number of duplicates generated from that molecule is then \(D_i = \max(0,R_i - 1)\), and the overall fraction of duplicates is \(F=\frac{D}{R}=\frac{\sum_iD_i}{\sum_iR_i}\) .
+
+
The expected fraction of duplicates under this model can be estimated analytically as follows3:
+
+
\(E(R_i)=\sum_{r=1}^{2X} r \cdot P(R_i = r) = \sum_{r=1}^{2X}r\cdot\left[\sum_{k=0}^{r}P(X_i = k)\cdot{}P(P_i=r-k|X_i=k)\right]\)
Using computational implementations of these formulae, we can investigate how the fraction of duplicates varies with the number of clusters for different parameter values.
I don’t want to take these results too seriously, since they’re based on an extremely simple model, but there are some qualitative takeaways that I found helpful to keep in mind when looking at the real data. In particular, the general pattern of the model results is that the fraction of duplicate reads follows a sigmoidal pattern on a linear-log plot:
+
+
When \(X << N\), the fraction of duplicates \(F\) is close to zero.
+
As \(X\) approaches \(N\), the fraction of duplicates begins increasing, first slowly and then (after \(F\) reaches about 0.1) rapidly, before leveling off after \(F\) exceeds about 0.9.
+
When \(X >> N\), \(F \approx 1\).
+
+
At least under the assumptions used here, once the fraction of duplicates goes above 15% or so, further OOM increases in the number of clusters (e.g. by buying more or larger flow cells) will lead to a dramatic increase in the fraction of duplicate reads.
+
Applying model takeaways
+
Returning to the results from the BMC data with these modeling results in mind:
+
+Code
g_dup_flat_lin
+
+
+
+
+
Assuming that the real data will follow a sigmoidal pattern roughly resembling that from the model, we see that all samples (with the possible exception of D23-13406 when unpair repair is disabled) are in the “danger zone”, such that further OOM increases in read depth will likely lead to a dramatic increase in the fraction of duplicate reads. As such, it probably isn’t worth paying for a further OOM increase (or even half-OOM increase) in read depth for these samples.
+
Conclusions
+
+
+
Overall, running this analysis was a frustrating experience, due to difficulties finding a configuration of Clumpify (or any other deduplication tool I know of) that (i) I trust to remove duplicates appropriately without predictably over- or under-counting, and (ii) runs well on large FASTQ files. Ultimately, I think we should treat the true level of duplicates as somewhere in between that measured by method 1 (Clumpify with unpair repair enabled) and method 2 (Clumpify with unpair repair disabled).
+
+
For the actual virus-detection pipeline, my current best bet is that we should run method 2 on the full dataset, then method 1 on the specific viral hits identified by Kraken2 (or the alignment tool used for validation, once that’s been implemented). However, we ultimately might want to implement our own tool for dealing with this problem.
+
+
+
Nevertheless, I was able to generate rarefaction curves for duplication rate as a function of the number of reads using both methods. The results, in combination with a very simple model I generated to aid interpretation, suggest that we’re probably at about the highest OOM read depth we should aim for in terms of getting useful information from these samples (with the possible exception of D23-13406).
+
+
+
+
+
+
Footnotes
+
+
It might be worth explicitly modeling the difference in behavior between different kinds of duplicates as sequencing depth increases, to see if these intuitions are borne out.↩︎
+
Probably the biggest two improvements that could be made to this model in future are (i) introducing biological duplicates, and (ii) introducing sequence-specific bias in PCR amplification and cluster formation.↩︎
+
I’d appreciate it if someone else on the team can check my math here.↩︎
+
Source Code
+
---
+title: "Estimating the effect of read depth on duplication rate for Project Runway DNA data"
+subtitle: "How deep can we go?"
+author: "Will Bradshaw"
+date: 2023-11-02
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ code-link: true
+ df-print: paged
+editor: visual
+title-block-banner: black
+---
+
+```{r}
+#| label: load-packages
+#| include: false
+library(tidyverse)
+library(cowplot)
+library(patchwork)
+library(fastqcr)
+source("../scripts/aux_plot-theme.R")
+```
+
+One relevant question for both Project Runway and other NAO sequencing is: what is the maximum read depth at which we can sequence a given sample while retaining an acceptable level of sequence duplication?
+
+As discussed in a previous entry, duplicate reads can arise in sequencing data from a variety of processes, including true biological duplicates present in the raw sample; processing duplicates arising from amplification and other processes during sample and library prep; and sequencing duplicates arising from various processes on the actual flow cell.
+
+As we sequence more deeply, we expect the fraction of biological and processing duplicates (but not, I think, sequencing duplicates) in our read data to increase. In the former case, this is because we are capturing a larger fraction of all the input molecules in our sample; in the latter, because we are sequencing copies of the same sequence over and over again. Intuitively, I expect the increase in processing duplicates to swamp that in biological duplicates at high read depth, at least for library prep protocols that involve amplification[^1].
+
+[^1]: It might be worth explicitly modeling the difference in behavior between different kinds of duplicates as sequencing depth increases, to see if these intuitions are borne out.
+
+One simple approach to investigate the overall effect of read depth on duplication levels in the sample is rarefaction: downsampling a library to different numbers of reads and seeing how the duplication rate changes as a function of read count. In this notebook entry, I apply this approach to sequencing data from the Project Runway initial DNA dataset, to see how duplication rate behaves in this case.
+
+# Methods
+
+To analyze the effect of read depth on duplication rates in this data, I first concatenated the raw reads from the two sequencing replicates of each sample together. This allowed the analysis to detect duplicates across replicates, that would have been missed by analyzing them separately.
+
+Next, I performed preprocessing to remove adapter sequences and low-quality bases that might interfere with duplicate detection. I didn't collapse read pairs together or discard read pairs with overall low quality.
+
+```
+for p in $(cat prefixes.txt); do echo $p; fastp -i raw/${p}_1.fastq.gz -I raw/${p}_2.fastq.gz -o preproc/${p}_fastp_1.fastq.gz -O preproc/${p}_fastp_2.fastq.gz --failed_out preproc/${p}_fastp_failed.fastq.gz --cut_tail --correction --detect_adapter_for_pe --adapter_fasta adapters.fa --trim_poly_x --thread 16 -Q -L; done
+```
+
+I then took each pair of preprocessed FASTQ files and downsampled them to specified numbers of reads, from 10,000 to 100,000,000 in units of 1 OOM, using `seqtk sample`. I performed downsampling 3 times for each read count, with the intent of calculating duplication rate separately for each replicate and taking the average.
+
+Finally, I ran deduplication with Clumpify on each downsampled pair of read files, as well as the raw preprocessed files, and recorded the fraction of reads discarded in each case. I performed this twice, using different Clumpify settings each time:
+
+1. First, I attempted to perform deduplication in a maximally comprehensive way, using Clumpify's `unpair repair` options to identify and remove duplicates in opposite orientation across read pairs (as discussed [here](https://data.securebio.org/wills-public-notebook/notebooks/2023-11-02_project-runway-comparison.html#conclusions)). This configuration finds more duplicates, but (a) might overestimate duplicates in cases where only one read in a pair matches another read, and (b) causes memory-related errors for large subsample sizes.
+
+```
+ clumpify.sh in=<in1> in2=<in2> out=<out1> out2=<out2> dedupe containment unpair repair
+ ```
+
+2. Second, due to the aforementioned issues with approach 1, I repeated deduplication without Clumpify's `unpair repair` options enabled, providing a lower-bound estimate of duplication levels which should be more consistent with estimates provided by, for example, FASTQC.
+
+```
+ clumpify.sh in=<in1> in2=<in2> out=<out1> out2=<out2> dedupe containment
+ ```
+
+# Results
+
+Plotting the fraction of duplicate reads (with or without Clumpify's `unpair repair` options) gives us the following result:
+
+```{r}
+#| warning: false
+# Import data
+data_dir <-"../data/2023-11-06_pr-dedup/"
+n_dup_path <-file.path(data_dir, "n_dup.csv")
+n_dup <-read_csv(n_dup_path, show_col_types =FALSE) %>%
+mutate(n_dup = read_pairs_in - read_pairs_out, p_dup = n_dup/read_pairs_in,
+o_dup = p_dup/(1-p_dup))
+
+# Reshape data
+n_dup_flat <- n_dup %>%group_by(sample, unpair_repair, read_pairs_in) %>%
+summarize(p_dup_mean =mean(p_dup), p_dup_sd =sd(p_dup), p_dup_min =max(0,p_dup_mean-p_dup_sd), p_dup_max =min(1, p_dup_mean+p_dup_sd),
+o_dup_mean =mean(o_dup), o_dup_sd =sd(o_dup), o_dup_min =max(0,o_dup_mean-o_dup_sd), o_dup_max =min(1, o_dup_mean+o_dup_sd), .groups ="drop")
+
+# Plot data
+g_dup_flat_base <-ggplot(n_dup_flat, aes(x=read_pairs_in, y=p_dup_mean, color = sample)) +
+geom_line() +geom_errorbar(aes(ymin=p_dup_min, ymax=p_dup_max)) +geom_point(shape =16) +
+scale_x_log10(name ="# Input Read Pairs") +
+facet_grid(.~unpair_repair, labeller ="label_both") +
+scale_color_brewer(palette ="Dark2") +
+ theme_base +theme(aspect.ratio =1)
+g_dup_flat_lin <- g_dup_flat_base +
+scale_y_continuous(name ="Fraction of duplicate reads", limits =c(0,0.41), breaks =seq(0,1,0.1), expand =c(0,0))
+g_dup_flat_log <- g_dup_flat_base +scale_y_log10(name ="Fraction of duplicate reads")
+
+# Show plats
+g_dup_flat_lin
+g_dup_flat_log
+```
+
+We can see that the fraction of duplicates reaches quite high levels as we approach the full read count, especially when `unpair repair` is enabled. It also looks like the gradient of increase is quite high on the linear-log plot, which would suggest that further OOM increases in read depth might result in quite large increases in the fraction of duplicates. It's also apparent that, for whatever reason, D23-13406 has substantially fewer duplicates at any given read depth than the other two samples.
+
+However, further interpretation of these results, including extrapolation to greater read depths, is made difficult by the lack of a theoretical model for what we expect to see. It's also not clear what mode of visualization (Linear-log? Log-linear? Log-log? Fraction of duplicates expressed as probabilities or odds? Etc) is most meaningful for interpretation.
+
+To start resolving some of these roadblocks, I spent some time working on a very simple model of read duplication, to see what it might tell us about the expected pattern of duplicates as a function of read depth.
+
+# A very very simple model of read duplication
+
+- Imagine a sample containing $N$ distinct molecules, which are uniformly amplified up to $M = N \times 2^C$ molecules by a perfectly unbiased $C$-cycle PCR reaction. Adapters are ligated and the resulting library is washed across the flow cell to generate $X$ total clusters (again, without bias[^2]). These clusters are then sequenced by a process that generates optical duplicates at some rate $O$, for a total expectation of $R=X\times(1+O)$ reads.
+
+- Each cluster is selected from the $M$ molecules in the library, without replacement. When $C$ is large and amplification is uniform across molecules, this is well-approximated by selecting from the $N$ input molecules with replacement. Under these conditions, the number of clusters generated from input molecule $i$ is approximately $X_i \sim \mathrm{B}(X,N^{-1})$.
+
+- The number of optical duplicates generated from a given input molecule is then approximately $P_i \sim \mathrm{B}(X_i,O)$, and the total number of reads corresponding to a given input molecule is thus $R_i = X_i + P_i$. The number of duplicates generated from that molecule is then $D_i = \max(0,R_i - 1)$, and the overall fraction of duplicates is $F=\frac{D}{R}=\frac{\sum_iD_i}{\sum_iR_i}$ .
+
+- The expected fraction of duplicates under this model can be estimated analytically as follows[^3]:
+
+ - $E(R_i)=\sum_{r=1}^{2X} r \cdot P(R_i = r) = \sum_{r=1}^{2X}r\cdot\left[\sum_{k=0}^{r}P(X_i = k)\cdot{}P(P_i=r-k|X_i=k)\right]$
+
+ - $E(D_i) = \sum_{r=1}^{2X} (r-1) \cdot P(R_i = r)$
+
+ - $E(D) = E\left(\sum_i D_i\right) = N \cdot E(D_i)$
+
+ - $R = \sum_i R_i = \sum_r r \cdot \mathbb{N}(R_i = r)$
+
+ - When $N$ is large, $\mathbb{N}(R_i = r) \approx N \cdot P(R_i = r)$, and so $R \approx \sum_r r \cdot N \cdot P(R_i = r) = N \cdot E(R_i)$
+
+ - Hence $E\left(\frac{1}{R}\right) \approx \frac{1}{N \cdot E(R_i)}$
+
+ - Thus $E(F) = E\left(\frac{D}{R}\right) = E(D) \cdot E\left(\frac{1}{R}\right) \approx \frac{N \cdot E(D_i)}{N \cdot E(R_i)} = \frac{E(D_i)}{E(R_i)}$
+
+[^2]: Probably the biggest two improvements that could be made to this model in future are (i) introducing biological duplicates, and (ii) introducing sequence-specific bias in PCR amplification and cluster formation.
+
+[^3]: I'd appreciate it if someone else on the team can check my math here.
+
+Using computational implementations of these formulae, we can investigate how the fraction of duplicates varies with the number of clusters for different parameter values.
+
+```{r}
+# Define auxiliary functions
+log_p_clusters <-function(n_clusters, total_clusters, n_molecules) dbinom(n_clusters, total_clusters, 1/n_molecules, log=TRUE)
+log_p_opt_dups <-function(n_opt_dups, n_clusters, p_opt_dup) dbinom(n_opt_dups, n_clusters, p_opt_dup, log=TRUE)
+p_reads <-function(n_reads, total_clusters, n_molecules, p_opt_dup){
+ n_clusters <- n_reads - (0:(n_reads/2))
+ log_p_n_clusters <-log_p_clusters(n_clusters, total_clusters, n_molecules)
+ log_p_n_opt_dups <-log_p_opt_dups(n_reads - n_clusters, n_clusters, p_opt_dup)
+ log_p_reads_clusters <- log_p_n_clusters + log_p_n_opt_dups
+return(sum(exp(log_p_reads_clusters)))
+}
+
+# Define main function
+exp_duplicates <-function(n_molecules, total_clusters, p_opt_dup, initial_vector_length =1e8){
+ pr <- purrr::partial(p_reads, total_clusters = total_clusters, n_molecules = n_molecules, p_opt_dup = p_opt_dup)
+# Calculate read count probabilities
+ n_reads <-1:initial_vector_length
+ p_n_reads <-numeric(initial_vector_length)
+ break_zero <-FALSE
+for (n in1:length(n_reads)){
+ p <-pr(n_reads[n])
+if (p ==0&& break_zero){
+ n_reads <- n_reads[1:n]
+ p_n_reads <- p_n_reads[1:n]
+break
+ } elseif (p !=0&&!break_zero){
+ break_zero <-TRUE
+ }
+ p_n_reads[n] <- p
+ }
+# Calculate fraction of duplicates
+ n_duplicates <- n_reads -1
+ p_duplicates <-sum(n_duplicates * p_n_reads)/sum(n_reads * p_n_reads)
+return(p_duplicates)
+}
+
+# Parameter set 1: N = 1e6, O = 1e-6
+tab_model_1 <-tibble(n_molecules =1e6,
+p_opt_dup =1e-6,
+total_clusters =round(10^seq(4,8,0.5))) %>%
+group_by(n_molecules, p_opt_dup) %>%
+mutate(p_duplicates =sapply(total_clusters, function(x) exp_duplicates(n_molecules, x, p_opt_dup)))
+g_1 <-ggplot(tab_model_1, aes(x=total_clusters, y=p_duplicates)) +
+geom_vline(aes(xintercept = n_molecules), linetype ="dashed", colour ="red") +geom_line() +geom_point(shape =16) +
+scale_x_log10(name ="# Total Clusters") +
+scale_y_continuous(name ="Fraction of duplicate reads", limits =c(0,1), breaks =seq(0,1,0.2), expand =c(0,0)) +
+labs(title =paste0("N = ", tab_model_1$n_molecules[1], ", O = ", tab_model_1$p_opt_dup[1])) +
+ theme_base +theme(aspect.ratio =1)
+
+# Parameter set 2: N = 1e9, O = 1e-6
+tab_model_2 <-tibble(n_molecules =1e9,
+p_opt_dup =1e-6,
+total_clusters =round(10^seq(6,12,0.5))) %>%
+group_by(n_molecules, p_opt_dup) %>%
+mutate(p_duplicates =sapply(total_clusters, function(x) exp_duplicates(n_molecules, x, p_opt_dup)))
+g_2 <-ggplot(tab_model_2, aes(x=total_clusters, y=p_duplicates)) +
+geom_vline(aes(xintercept = n_molecules), linetype ="dashed", colour ="red") +geom_line() +geom_point(shape =16) +
+scale_x_log10(name ="# Total Clusters") +
+scale_y_continuous(name ="Fraction of duplicate reads", limits =c(0,1), breaks =seq(0,1,0.2), expand =c(0,0)) +
+labs(title =paste0("N = ", tab_model_2$n_molecules[1], ", O = ", tab_model_2$p_opt_dup[1])) +
+ theme_base +theme(aspect.ratio =1)
+
+# Parameter set 3: N = 1e12, O = 1e-6
+tab_model_3 <-tibble(n_molecules =1e12,
+p_opt_dup =1e-6,
+total_clusters =round(10^seq(9,15,0.5))) %>%
+group_by(n_molecules, p_opt_dup) %>%
+mutate(p_duplicates =sapply(total_clusters, function(x) exp_duplicates(n_molecules, x, p_opt_dup)))
+g_3 <-ggplot(tab_model_3, aes(x=total_clusters, y=p_duplicates)) +
+geom_vline(aes(xintercept = n_molecules), linetype ="dashed", colour ="red") +geom_line() +geom_point(shape =16) +
+scale_x_log10(name ="# Total Clusters") +
+scale_y_continuous(name ="Fraction of duplicate reads", limits =c(0,1), breaks =seq(0,1,0.2), expand =c(0,0)) +
+labs(title =paste0("N = ", tab_model_3$n_molecules[1], ", O = ", tab_model_3$p_opt_dup[1])) +
+ theme_base +theme(aspect.ratio =1)
+g_1 + g_2 + g_3
+```
+
+I don't want to take these results too seriously, since they're based on an extremely simple model, but there are some qualitative takeaways that I found helpful to keep in mind when looking at the real data. In particular, the general pattern of the model results is that **the fraction of duplicate reads follows a sigmoidal pattern on a linear-log plot**:
+
+- When $X << N$, the fraction of duplicates $F$ is close to zero.
+
+- As $X$ approaches $N$, the fraction of duplicates begins increasing, first slowly and then (after $F$ reaches about 0.1) rapidly, before leveling off after $F$ exceeds about 0.9.
+
+- When $X >> N$, $F \approx 1$.
+
+At least under the assumptions used here, once the fraction of duplicates goes above 15% or so, further OOM increases in the number of clusters (e.g. by buying more or larger flow cells) will lead to a dramatic increase in the fraction of duplicate reads.
+
+# Applying model takeaways
+
+Returning to the results from the BMC data with these modeling results in mind:
+
+```{r}
+g_dup_flat_lin
+```
+
+Assuming that the real data will follow a sigmoidal pattern roughly resembling that from the model, we see that all samples (with the possible exception of D23-13406 when `unpair repair` is disabled) are in the "danger zone", such that further OOM increases in read depth will likely lead to a dramatic increase in the fraction of duplicate reads. As such, **it probably isn't worth paying for a further OOM increase (or even half-OOM increase) in read depth for these samples**.
+
+# Conclusions
+
+- Overall, running this analysis was a frustrating experience, due to difficulties finding a configuration of Clumpify (or any other deduplication tool I know of) that (i) I trust to remove duplicates appropriately without predictably over- or under-counting, and (ii) runs well on large FASTQ files. Ultimately, I think we should treat the true level of duplicates as somewhere in between that measured by method 1 (Clumpify with `unpair repair` enabled) and method 2 (Clumpify with `unpair repair` disabled).
+
+ - For the actual virus-detection pipeline, my current best bet is that we should run method 2 on the full dataset, then method 1 on the specific viral hits identified by Kraken2 (or the alignment tool used for validation, once that's been implemented). However, we ultimately might want to implement our own tool for dealing with this problem.
+
+- Nevertheless, I was able to generate rarefaction curves for duplication rate as a function of the number of reads using both methods. The results, in combination with a very simple model I generated to aid interpretation, suggest that we're probably at about the highest OOM read depth we should aim for in terms of getting useful information from these samples (with the possible exception of D23-13406).
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-2-1.png b/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-2-1.png
new file mode 100644
index 0000000..fa3f8e4
Binary files /dev/null and b/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-2-1.png differ
diff --git a/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-2-2.png b/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-2-2.png
new file mode 100644
index 0000000..de354b7
Binary files /dev/null and b/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-2-2.png differ
diff --git a/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-3-1.png b/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-3-1.png
new file mode 100644
index 0000000..81f6c21
Binary files /dev/null and b/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-3-1.png differ
diff --git a/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-4-1.png b/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-4-1.png
new file mode 100644
index 0000000..fa3f8e4
Binary files /dev/null and b/docs/notebooks/2023-11-02_project-runway-dna-deduplication_files/figure-html/unnamed-chunk-4-1.png differ
diff --git a/docs/search.json b/docs/search.json
index dd3ba52..aa5732c 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -32,7 +32,7 @@
"href": "index.html",
"title": "Will's Public NAO Notebook",
"section": "",
- "text": "Comparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n \n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n \n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n \n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
+ "text": "Comparing viral read assignments between pipelines on Project Runway data\n\n\n\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n \n\n\n\n\nEstimating the effect of read depth on duplication rate for Project Runway DNA data\n\n\nHow deep can we go?\n\n\n\n\n\n\nNov 2, 2023\n\n\n\n\n\n\n \n\n\n\n\nInitial analysis of Project Runway protocol testing data\n\n\n\n\n\n\n\n\n\nOct 31, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing options for read deduplication\n\n\nClumpify vs fastp\n\n\n\n\n\n\nOct 19, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing Ribodetector and bbduk for rRNA detection\n\n\nIn search of quick rRNA filtering.\n\n\n\n\n\n\nOct 16, 2023\n\n\n\n\n\n\n \n\n\n\n\nComparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\n\nOct 12, 2023\n\n\n\n\n\n\n \n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n \n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
},
{
"objectID": "notebooks/2023-10-12_fastp-vs-adapterremoval.html",
@@ -138,5 +138,19 @@
"title": "Comparing viral read assignments between pipelines on Project Runway data",
"section": "",
"text": "In my last notebook entry, I reviewed some basic initial analyses of Project Runway DNA sequencing data. One notable result was that the number of reads assigned to human-infecting viruses differed significantly between the pipeline I ran for that entry and the current public pipeline. In this entry, I dig into these differences in more depth, to see whether they tell us anything about which tools to incorporate into the next version of the public pipeline.\nAt a high level, there are three main differences between the two pipelines:\n\nThe public pipeline uses AdapterRemoval for removal of adapters and quality trimming, while my pipeline uses FASTP.\nMy pipeline uses bbduk to identify and remove ribosomal reads prior to Kraken analysis, while the public pipeline does not.\nMy pipeline applies deduplication prior to Kraken analysis using clumpify, while the public pipeline applies it after Kraken analysis via a manual method.\n\nIn principle, any of these differences could be responsible for the differences in read assignment we observe. However, since very few reads were identified as ribosomal or as duplicates by the new pipeline, it’s unlikely that these differences are those responsible.\nTo investigate this, I decided to manually identify and trace the reads assigned to human-infecting viruses in both pipelines, to see whether that tells us anything about the likely cause of the differences. To do this, I selected the three samples from the dataset that show the largest difference in the number of assigned human-infecting virus reads (henceforth HV reads):\n\nD23-13405-1 (14 HV reads assigned by the public pipeline vs 8 by the new pipeline)\nD23-13405-2 (12 vs 8)\nD23-13406-2 (17 vs 9)\n\nThe way the public pipeline does deduplication (after Kraken2 analysis, during dashboard generation) makes it difficult to directly extract the final list of HV read IDs for that pipeline, but it is quite easy to do this for the list of reads immediately prior to deduplication. Doing this for the samples specified above returned the following results:\n\nD23-13405-1: 14 read IDs (no reads lost during deduplication)\nD23-13405-2: 14 read IDs (2 reads to Simian Agent 10 lost during deduplication)\nD23-13406-2: 17 read IDs (no reads lost during deduplication)\n\nIn the first and third of these cases, I could thus directly compare the Kraken output from the two pipelines to investigate the source of the disagreements. In the second case, it wasn’t immediately possible to identify which two out of the three Simian Agent 10 reads were considered duplicates in the dashboard, but I was at least able to restrict the possibility to those three reads. (As we’ll see below, information from the new pipeline also helped narrow this down.)\n\nCodedata_dir <- \"../data/2023-11-01_pr-comp\"\nread_status_path <- file.path(data_dir, \"read-status.csv\")\nread_status <- read_csv(read_status_path, show_col_types = FALSE) %>%\n mutate(status = fct_inorder(status))\ntheme_kit <- theme_base + theme(\n aspect.ratio = 1/2,\n axis.text.x = element_text(hjust = 1, angle = 45),\n axis.title.x = element_blank()\n)\ng_status <- ggplot(read_status, aes(x=sample, y=n_reads, fill=status)) +\n geom_col(position = \"dodge\") +\n scale_fill_brewer(palette = \"Set1\", name = \"Status\") +\n scale_y_continuous(name = \"# Putative HV read pairs\", limits = c(0,10),\n breaks = seq(0,10,2), expand = c(0,0)) +\n theme_base + theme_kit\ng_status\n\n\n\n\nD23-13405-1\n\nIn this case, 8 HV reads were assigned by the new pipeline, and 14 reads were assigned by the the public pipeline.\nAll 8 of the reads assigned by the new pipeline were among the 14 HV reads assigned by the public pipeline.\n\nAmong the 6 remaining reads that were assigned by only the public pipeline:\n\n3 appeared in the list of HV hits for the new pipeline; that is to say, for these four reads, the new pipeline was able to identify hits to human-infecting viruses but not make an overall assignment.\n2 were included in the Kraken2 output for the new pipeline, but were not found to contain any HV k-mers and so were excluded from the list of hits. This is a more extreme case of the above situation: in this case, more stringent trimming by FASTQ has removed the putative HV k-mers.\n1 was found among the reads that FASTP discarded due to not passing quality filters; in this case, read 2 was discarded due to low quality, and read 1 was then discarded due to lacking a read pair.\n\n\n\nIn all 6 cases, therefore, the difference in assignment between the two pipelines was found to be due to difference 1, i.e. the use of different preprocessing tools. In general, FASTP appears to be more stringent than AdapterRemoval in a way that resulted in fewer HV read assignments. But are these reads false positives for the old pipeline, or false negatives for the new one?\n\nTo address this, I extracted the raw sequences of the six read pairs, manually removed adapters, and manually analyzed them with NCBI BLAST (blastn vs viral nt, then vs full nt).\nIn all six cases, no match was found by blastn between the read sequence and the human-infecting virus putatively assigned by Kraken2 in the public pipeline. In four out of six cases, the best match for the read was to a bacterial sequence; in one case, the best match for the forward read was bacterial while the reverse matched a phage; and in one case no significant match was found for either read.\nThese results suggest to me that FASTP’s more stringent trimming and filtering is ensuring true negatives, rather than causing false ones.\n\n\nD23-13405-2\n\nIn this case, 8 HV reads were assigned by the new pipeline, and 14 by the public pipeline excluding deduplication; 2 of the latter were removed during deduplication for the dashboard.\n\nAll 8 of the reads assigned by the new pipeline were among the 14 pre-deduplication HV reads assigned by the public pipeline; however, two of these were among the group of three reads (all to Simian Agent 10) that were collapsed into one by deduplication in the public pipeline.\n\nThis indicates that one of the reads present in the new pipeline results was removed by deduplication in the public pipeline results – that is to say, the two pipelines disagree slightly more than the raw HV read counts would suggest.\nOne read was removed as a duplicate by both pipelines; I discarded this one from consideration, bringing the number of read IDs for consideration down to 13.\n\n\n\nAmong the remaining 5 HV reads from the public pipeline:\n\n4 appeared in the list of HV hits for the new pipeline; that is to say, for these four reads, the new pipeline was able to identify hits to human-infecting viruses but not make an overall assignment.\n1 was discarded by FASTP during quality filtering: read 2 was discarded due to low quality, and read 1 was then discarded due to lacking a read pair.\n\n\n\nAs before, I extracted the raw reads corresponding to these 5 disagreements from the raw sequencing data, removed adapters manually, and ran the resulting sequences through NCBI BLAST (blastn vs viral nt, then vs full nt). 4 out of 5 read pairs showed no match to the assigned virus, while one showed a good, though partial, match.\n\nIn the case of the match, it looks like in both the public pipeline and the new pipeline, Kraken2 only identified a single k-mer from the origin virus (Sandfly fever Turkey virus); however, the public pipeline also identified 3 k-mers assigned to taxid 1 (root), while these were trimmed away in the new pipeline, and this was sufficient to prevent the overall assignment.\nI’m not sure how to feel about this case. Ex ante, making an assignment on the basis of a single unique k-mer and three uninformative k-mers feels quite dubious, but ex post it does appear that at least part of the read was correctly assigned.\n\n\n\nInvestigating the pair of reads that were flagged as duplicates by the public pipeline but not the new pipeline, I found that they were a perfect match, but with the sequence of the forward read in one pair matching the reverse read in the second pair.\n\nThis was surprising to me, as it suggests that clumpify won’t remove duplicates if one is in reverse-complement to the other, which seems like a major oversight. I checked this, and indeed Clumpify fails to detect the duplicate in the current state but succeeds if I swap the forward and reverse reads for one of the read pairs.\nThis makes me less excited about using Clumpify for deduplication. UPDATE: I found an option for Clumpify that seems to solve this problem, at least in this case. See Conclusions for more.\nIt’s worth noting that, due to the way FASTQC analyses files, it will also fail to detect RC duplicates.\n\n\nD23-13406-2\n\nIn this case, 9 HV reads were assigned by the new pipeline, and 17 by the public pipeline.\n\nAs before, I confirmed that all 9 HV reads assigned by the new pipeline were also assigned by the public pipeline, leaving 8 disagreements. Of these:\n\n6 appeared in the list of reads for which the new pipeline found HV hits, but wasn’t able to make an overall assignment; in all six of these cases, the HV hits were to the taxon assigned by the public pipeline.\n1 was included in the new pipeline’s Kraken output but had no viral hits.\nThe final clash is the most interesting, as this one was excluded not by FASTP or Kraken, but by bbduk: it was identified as ribosomal and discarded prior to deduplication.\n\n\n\nAs before, I extracted the raw reads corresponding to these 8 disagreements from the raw sequencing data, removed adapters manually, and ran the resulting sequences through NCBI BLAST (blastn vs viral nt, then vs nt).\n\nFor 7 of the 8 disagreements – all those arising from FASTP preprocessing – BLAST found no match between the read sequence and the taxon assigned by the public pipeline. As before, this suggests to be that FASTP is doing a good job preventing false positives through better preprocessing.\nThe BLAST result for the final disagreement is again the most interesting. The forward read indeed showed strong alignment to bacterial rRNA sequences, as found by bbduk. The reverse read, however, showed good alignment to Influenza A virus, which was the virus assigned by Kraken2 in the public pipeline. In this case, it appears we have a chimeric read, which both pipelines are in some sense processing correctly: bbduk is correctly identifying the forward read as ribosomal, and Kraken2 is correctly identifying the reverse read as viral. It’s not a priori obvious to me how we should handle these cases.\n\n\nSanity checking\n\nFinally, I wanted to check that my findings here weren’t just the result of some issue with how I’m using NCBI BLAST – that is, that BLAST as I’m using it is able to detect true positives rather than just failing to find matches all over the place.\nTo check this, I took the 24 read pairs (8 + 7 + 9) from the three samples above that both pipelines agreed arose from human-infecting viruses, and ran these through BLAST in the same way as the disagreed-upon read-pairs above.\nWhile the results weren’t as unequivocal as I’d hoped, they nevertheless showed a strong difference from those for the clashing read pairs. In total, 16/24 read pairs showed strong matching to the assigned virus, and an additional 2/24 showed a short partial match sufficient to explain the Kraken assignment. The remaining 6/24 failed to match the assigned virus. In total, 75% (18/24) of agreed-upon sequences showed a match to the assigned virus, compared to only 10% (2/20) for the clashing sequences.\nThis cautiously updates me further toward believing that the FASTP component of the new pipeline is mostly doing well at correctly rejecting true negatives, at least compared to the current public pipeline. That said, it also suggests that either (a) BLAST is generating a significant number of false negatives, or (b) even the new pipeline is generating a significant number of false positives.\n\n\nCoderead_hit_path <- file.path(data_dir, \"read-hit-count.csv\")\nread_hit <- read_csv(read_hit_path, show_col_types = FALSE) %>%\n mutate(status = fct_inorder(status),\n p_hit = n_hit/n_reads)\ng_hit <- ggplot(read_hit, aes(x=sample, y=p_hit, fill=status)) +\n geom_col(position = \"dodge\") +\n scale_fill_brewer(palette = \"Set3\", name = \"Status\") +\n scale_y_continuous(name = \"% reads matching HV assignment\", limits = c(0,1),\n breaks = seq(0,1,0.2), expand = c(0,0), labels = function(y) y*100) +\n theme_base + theme_kit\ng_hit\n\n\n\n\nConclusions\n\nMy updates from this exercise are different for different parts of the pipeline.\n\nFor difference 1 (preprocessing tool) I mostly found that, in cases where the new pipeline rejects a read due to preprocessing and the public pipeline does not, that read appears to be a true negative. I’m not super confident about this, since I don’t 100% trust BLAST to not be producing false negatives here, but overall I think the evidence points to FASTP doing a better job here than AdapterRemoval.\n\nI’d ideally like to shift to a version of the pipeline where we’re not relying on Kraken to make assignment decisions, and are instead running all Kraken hits through an alignment-based validation pipeline to determine final assignments. I’d be interested in seeing how these results look after making that change.\n\n\nFor difference 2 (ribodepletion) results here are equivocal. The single read pair I inspected that got removed during ribodepletion appears to include both a true ribosomal read and a true viral read. I think some internal discussion is needed to decide how to handle these cases.\n\nFor difference 3 (deduplication) I initially updated negatively about Clumpify, which appeared to be unable to handle duplicates where the forward and reverse reads in a pair are switched (this is also a case where FASTQC will be unable to detect these duplicates).\n\nHowever, I found an option for Clumpify which addresses this issue, at least in this case. Specifically, one can configure Clumpify to unpair reads, perform deduplication on the forward and reverse reads all together, and then restore pairing. This successfully removes this class of duplicates.\nI’m a little worried that this approach might sometimes cause complete loss of all duplicate reads (rather than all-but-one-pair) when the best-quality duplicate differs between forward and reverse reads. I tried this out by artificially modifying the quality scores for the duplicate reads from D23-13405-2, and this doesn’t seem to be the case at least in this instance: when quality across read pairs was concordant, I was able to control which read pair survived as expected, and when it was discordant, one read pair survived anyway. Still, this remains a niggling doubt."
+ },
+ {
+ "objectID": "notebooks/2023-11-02_project-runway-dna-deduplication.html",
+ "href": "notebooks/2023-11-02_project-runway-dna-deduplication.html",
+ "title": "Estimating the effect of read depth on duplication rate for Project Runway DNA data",
+ "section": "",
+ "text": "One relevant question for both Project Runway and other NAO sequencing is: what is the maximum read depth at which we can sequence a given sample while retaining an acceptable level of sequence duplication?\nAs discussed in a previous entry, duplicate reads can arise in sequencing data from a variety of processes, including true biological duplicates present in the raw sample; processing duplicates arising from amplification and other processes during sample and library prep; and sequencing duplicates arising from various processes on the actual flow cell.\nAs we sequence more deeply, we expect the fraction of biological and processing duplicates (but not, I think, sequencing duplicates) in our read data to increase. In the former case, this is because we are capturing a larger fraction of all the input molecules in our sample; in the latter, because we are sequencing copies of the same sequence over and over again. Intuitively, I expect the increase in processing duplicates to swamp that in biological duplicates at high read depth, at least for library prep protocols that involve amplification1.\nOne simple approach to investigate the overall effect of read depth on duplication levels in the sample is rarefaction: downsampling a library to different numbers of reads and seeing how the duplication rate changes as a function of read count. In this notebook entry, I apply this approach to sequencing data from the Project Runway initial DNA dataset, to see how duplication rate behaves in this case."
+ },
+ {
+ "objectID": "notebooks/2023-11-02_project-runway-dna-deduplication.html#footnotes",
+ "href": "notebooks/2023-11-02_project-runway-dna-deduplication.html#footnotes",
+ "title": "Estimating the effect of read depth on duplication rate for Project Runway DNA data",
+ "section": "Footnotes",
+ "text": "Footnotes\n\nIt might be worth explicitly modeling the difference in behavior between different kinds of duplicates as sequencing depth increases, to see if these intuitions are borne out.↩︎\nProbably the biggest two improvements that could be made to this model in future are (i) introducing biological duplicates, and (ii) introducing sequence-specific bias in PCR amplification and cluster formation.↩︎\nI’d appreciate it if someone else on the team can check my math here.↩︎"
}
]
\ No newline at end of file
diff --git a/notebooks/2023-11-02_project-runway-dna-deduplication.qmd b/notebooks/2023-11-02_project-runway-dna-deduplication.qmd
new file mode 100644
index 0000000..7baa9de
--- /dev/null
+++ b/notebooks/2023-11-02_project-runway-dna-deduplication.qmd
@@ -0,0 +1,235 @@
+---
+title: "Estimating the effect of read depth on duplication rate for Project Runway DNA data"
+subtitle: "How deep can we go?"
+author: "Will Bradshaw"
+date: 2023-11-02
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ code-link: true
+ df-print: paged
+editor: visual
+title-block-banner: black
+---
+
+```{r}
+#| label: load-packages
+#| include: false
+library(tidyverse)
+library(cowplot)
+library(patchwork)
+library(fastqcr)
+source("../scripts/aux_plot-theme.R")
+```
+
+One relevant question for both Project Runway and other NAO sequencing is: what is the maximum read depth at which we can sequence a given sample while retaining an acceptable level of sequence duplication?
+
+As discussed in a previous entry, duplicate reads can arise in sequencing data from a variety of processes, including true biological duplicates present in the raw sample; processing duplicates arising from amplification and other processes during sample and library prep; and sequencing duplicates arising from various processes on the actual flow cell.
+
+As we sequence more deeply, we expect the fraction of biological and processing duplicates (but not, I think, sequencing duplicates) in our read data to increase. In the former case, this is because we are capturing a larger fraction of all the input molecules in our sample; in the latter, because we are sequencing copies of the same sequence over and over again. Intuitively, I expect the increase in processing duplicates to swamp that in biological duplicates at high read depth, at least for library prep protocols that involve amplification[^1].
+
+[^1]: It might be worth explicitly modeling the difference in behavior between different kinds of duplicates as sequencing depth increases, to see if these intuitions are borne out.
+
+One simple approach to investigate the overall effect of read depth on duplication levels in the sample is rarefaction: downsampling a library to different numbers of reads and seeing how the duplication rate changes as a function of read count. In this notebook entry, I apply this approach to sequencing data from the Project Runway initial DNA dataset, to see how duplication rate behaves in this case.
+
+# Methods
+
+To analyze the effect of read depth on duplication rates in this data, I first concatenated the raw reads from the two sequencing replicates of each sample together. This allowed the analysis to detect duplicates across replicates, that would have been missed by analyzing them separately.
+
+Next, I performed preprocessing to remove adapter sequences and low-quality bases that might interfere with duplicate detection. I didn't collapse read pairs together or discard read pairs with overall low quality.
+
+```
+for p in $(cat prefixes.txt); do echo $p; fastp -i raw/${p}_1.fastq.gz -I raw/${p}_2.fastq.gz -o preproc/${p}_fastp_1.fastq.gz -O preproc/${p}_fastp_2.fastq.gz --failed_out preproc/${p}_fastp_failed.fastq.gz --cut_tail --correction --detect_adapter_for_pe --adapter_fasta adapters.fa --trim_poly_x --thread 16 -Q -L; done
+```
+
+I then took each pair of preprocessed FASTQ files and downsampled them to specified numbers of reads, from 10,000 to 100,000,000 in units of 1 OOM, using `seqtk sample`. I performed downsampling 3 times for each read count, with the intent of calculating duplication rate separately for each replicate and taking the average.
+
+Finally, I ran deduplication with Clumpify on each downsampled pair of read files, as well as the raw preprocessed files, and recorded the fraction of reads discarded in each case. I performed this twice, using different Clumpify settings each time:
+
+1. First, I attempted to perform deduplication in a maximally comprehensive way, using Clumpify's `unpair repair` options to identify and remove duplicates in opposite orientation across read pairs (as discussed [here](https://data.securebio.org/wills-public-notebook/notebooks/2023-11-02_project-runway-comparison.html#conclusions)). This configuration finds more duplicates, but (a) might overestimate duplicates in cases where only one read in a pair matches another read, and (b) causes memory-related errors for large subsample sizes.
+
+ ```
+ clumpify.sh in= in2= out= out2= dedupe containment unpair repair
+ ```
+
+2. Second, due to the aforementioned issues with approach 1, I repeated deduplication without Clumpify's `unpair repair` options enabled, providing a lower-bound estimate of duplication levels which should be more consistent with estimates provided by, for example, FASTQC.
+
+ ```
+ clumpify.sh in= in2= out= out2= dedupe containment
+ ```
+
+# Results
+
+Plotting the fraction of duplicate reads (with or without Clumpify's `unpair repair` options) gives us the following result:
+
+```{r}
+#| warning: false
+# Import data
+data_dir <- "../data/2023-11-06_pr-dedup/"
+n_dup_path <- file.path(data_dir, "n_dup.csv")
+n_dup <- read_csv(n_dup_path, show_col_types = FALSE) %>%
+ mutate(n_dup = read_pairs_in - read_pairs_out, p_dup = n_dup/read_pairs_in,
+ o_dup = p_dup/(1-p_dup))
+
+# Reshape data
+n_dup_flat <- n_dup %>% group_by(sample, unpair_repair, read_pairs_in) %>%
+ summarize(p_dup_mean = mean(p_dup), p_dup_sd = sd(p_dup), p_dup_min = max(0,p_dup_mean-p_dup_sd), p_dup_max = min(1, p_dup_mean+p_dup_sd),
+ o_dup_mean = mean(o_dup), o_dup_sd = sd(o_dup), o_dup_min = max(0,o_dup_mean-o_dup_sd), o_dup_max = min(1, o_dup_mean+o_dup_sd), .groups = "drop")
+
+# Plot data
+g_dup_flat_base <- ggplot(n_dup_flat, aes(x=read_pairs_in, y=p_dup_mean, color = sample)) +
+ geom_line() + geom_errorbar(aes(ymin=p_dup_min, ymax=p_dup_max)) + geom_point(shape = 16) +
+ scale_x_log10(name = "# Input Read Pairs") +
+ facet_grid(.~unpair_repair, labeller = "label_both") +
+ scale_color_brewer(palette = "Dark2") +
+ theme_base + theme(aspect.ratio = 1)
+g_dup_flat_lin <- g_dup_flat_base +
+ scale_y_continuous(name = "Fraction of duplicate reads", limits = c(0,0.41), breaks = seq(0,1,0.1), expand = c(0,0))
+g_dup_flat_log <- g_dup_flat_base + scale_y_log10(name = "Fraction of duplicate reads")
+
+# Show plats
+g_dup_flat_lin
+g_dup_flat_log
+```
+
+We can see that the fraction of duplicates reaches quite high levels as we approach the full read count, especially when `unpair repair` is enabled. It also looks like the gradient of increase is quite high on the linear-log plot, which would suggest that further OOM increases in read depth might result in quite large increases in the fraction of duplicates. It's also apparent that, for whatever reason, D23-13406 has substantially fewer duplicates at any given read depth than the other two samples.
+
+However, further interpretation of these results, including extrapolation to greater read depths, is made difficult by the lack of a theoretical model for what we expect to see. It's also not clear what mode of visualization (Linear-log? Log-linear? Log-log? Fraction of duplicates expressed as probabilities or odds? Etc) is most meaningful for interpretation.
+
+To start resolving some of these roadblocks, I spent some time working on a very simple model of read duplication, to see what it might tell us about the expected pattern of duplicates as a function of read depth.
+
+# A very very simple model of read duplication
+
+- Imagine a sample containing $N$ distinct molecules, which are uniformly amplified up to $M = N \times 2^C$ molecules by a perfectly unbiased $C$-cycle PCR reaction. Adapters are ligated and the resulting library is washed across the flow cell to generate $X$ total clusters (again, without bias[^2]). These clusters are then sequenced by a process that generates optical duplicates at some rate $O$, for a total expectation of $R=X\times(1+O)$ reads.
+
+- Each cluster is selected from the $M$ molecules in the library, without replacement. When $C$ is large and amplification is uniform across molecules, this is well-approximated by selecting from the $N$ input molecules with replacement. Under these conditions, the number of clusters generated from input molecule $i$ is approximately $X_i \sim \mathrm{B}(X,N^{-1})$.
+
+- The number of optical duplicates generated from a given input molecule is then approximately $P_i \sim \mathrm{B}(X_i,O)$, and the total number of reads corresponding to a given input molecule is thus $R_i = X_i + P_i$. The number of duplicates generated from that molecule is then $D_i = \max(0,R_i - 1)$, and the overall fraction of duplicates is $F=\frac{D}{R}=\frac{\sum_iD_i}{\sum_iR_i}$ .
+
+- The expected fraction of duplicates under this model can be estimated analytically as follows[^3]:
+
+ - $E(R_i)=\sum_{r=1}^{2X} r \cdot P(R_i = r) = \sum_{r=1}^{2X}r\cdot\left[\sum_{k=0}^{r}P(X_i = k)\cdot{}P(P_i=r-k|X_i=k)\right]$
+
+ - $E(D_i) = \sum_{r=1}^{2X} (r-1) \cdot P(R_i = r)$
+
+ - $E(D) = E\left(\sum_i D_i\right) = N \cdot E(D_i)$
+
+ - $R = \sum_i R_i = \sum_r r \cdot \mathbb{N}(R_i = r)$
+
+ - When $N$ is large, $\mathbb{N}(R_i = r) \approx N \cdot P(R_i = r)$, and so $R \approx \sum_r r \cdot N \cdot P(R_i = r) = N \cdot E(R_i)$
+
+ - Hence $E\left(\frac{1}{R}\right) \approx \frac{1}{N \cdot E(R_i)}$
+
+ - Thus $E(F) = E\left(\frac{D}{R}\right) = E(D) \cdot E\left(\frac{1}{R}\right) \approx \frac{N \cdot E(D_i)}{N \cdot E(R_i)} = \frac{E(D_i)}{E(R_i)}$
+
+[^2]: Probably the biggest two improvements that could be made to this model in future are (i) introducing biological duplicates, and (ii) introducing sequence-specific bias in PCR amplification and cluster formation.
+
+[^3]: I'd appreciate it if someone else on the team can check my math here.
+
+Using computational implementations of these formulae, we can investigate how the fraction of duplicates varies with the number of clusters for different parameter values.
+
+```{r}
+# Define auxiliary functions
+log_p_clusters <- function(n_clusters, total_clusters, n_molecules) dbinom(n_clusters, total_clusters, 1/n_molecules, log=TRUE)
+log_p_opt_dups <- function(n_opt_dups, n_clusters, p_opt_dup) dbinom(n_opt_dups, n_clusters, p_opt_dup, log=TRUE)
+p_reads <- function(n_reads, total_clusters, n_molecules, p_opt_dup){
+ n_clusters <- n_reads - (0:(n_reads/2))
+ log_p_n_clusters <- log_p_clusters(n_clusters, total_clusters, n_molecules)
+ log_p_n_opt_dups <- log_p_opt_dups(n_reads - n_clusters, n_clusters, p_opt_dup)
+ log_p_reads_clusters <- log_p_n_clusters + log_p_n_opt_dups
+ return(sum(exp(log_p_reads_clusters)))
+}
+
+# Define main function
+exp_duplicates <- function(n_molecules, total_clusters, p_opt_dup, initial_vector_length = 1e8){
+ pr <- purrr::partial(p_reads, total_clusters = total_clusters, n_molecules = n_molecules, p_opt_dup = p_opt_dup)
+ # Calculate read count probabilities
+ n_reads <- 1:initial_vector_length
+ p_n_reads <- numeric(initial_vector_length)
+ break_zero <- FALSE
+ for (n in 1:length(n_reads)){
+ p <- pr(n_reads[n])
+ if (p == 0 && break_zero){
+ n_reads <- n_reads[1:n]
+ p_n_reads <- p_n_reads[1:n]
+ break
+ } else if (p != 0 && !break_zero){
+ break_zero <- TRUE
+ }
+ p_n_reads[n] <- p
+ }
+ # Calculate fraction of duplicates
+ n_duplicates <- n_reads - 1
+ p_duplicates <- sum(n_duplicates * p_n_reads)/sum(n_reads * p_n_reads)
+ return(p_duplicates)
+}
+
+# Parameter set 1: N = 1e6, O = 1e-6
+tab_model_1 <- tibble(n_molecules = 1e6,
+ p_opt_dup = 1e-6,
+ total_clusters = round(10^seq(4,8,0.5))) %>%
+ group_by(n_molecules, p_opt_dup) %>%
+ mutate(p_duplicates = sapply(total_clusters, function(x) exp_duplicates(n_molecules, x, p_opt_dup)))
+g_1 <- ggplot(tab_model_1, aes(x=total_clusters, y=p_duplicates)) +
+ geom_vline(aes(xintercept = n_molecules), linetype = "dashed", colour = "red") + geom_line() + geom_point(shape = 16) +
+ scale_x_log10(name = "# Total Clusters") +
+ scale_y_continuous(name = "Fraction of duplicate reads", limits = c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +
+ labs(title = paste0("N = ", tab_model_1$n_molecules[1], ", O = ", tab_model_1$p_opt_dup[1])) +
+ theme_base + theme(aspect.ratio = 1)
+
+# Parameter set 2: N = 1e9, O = 1e-6
+tab_model_2 <- tibble(n_molecules = 1e9,
+ p_opt_dup = 1e-6,
+ total_clusters = round(10^seq(6,12,0.5))) %>%
+ group_by(n_molecules, p_opt_dup) %>%
+ mutate(p_duplicates = sapply(total_clusters, function(x) exp_duplicates(n_molecules, x, p_opt_dup)))
+g_2 <- ggplot(tab_model_2, aes(x=total_clusters, y=p_duplicates)) +
+ geom_vline(aes(xintercept = n_molecules), linetype = "dashed", colour = "red") + geom_line() + geom_point(shape = 16) +
+ scale_x_log10(name = "# Total Clusters") +
+ scale_y_continuous(name = "Fraction of duplicate reads", limits = c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +
+ labs(title = paste0("N = ", tab_model_2$n_molecules[1], ", O = ", tab_model_2$p_opt_dup[1])) +
+ theme_base + theme(aspect.ratio = 1)
+
+# Parameter set 3: N = 1e12, O = 1e-6
+tab_model_3 <- tibble(n_molecules = 1e12,
+ p_opt_dup = 1e-6,
+ total_clusters = round(10^seq(9,15,0.5))) %>%
+ group_by(n_molecules, p_opt_dup) %>%
+ mutate(p_duplicates = sapply(total_clusters, function(x) exp_duplicates(n_molecules, x, p_opt_dup)))
+g_3 <- ggplot(tab_model_3, aes(x=total_clusters, y=p_duplicates)) +
+ geom_vline(aes(xintercept = n_molecules), linetype = "dashed", colour = "red") + geom_line() + geom_point(shape = 16) +
+ scale_x_log10(name = "# Total Clusters") +
+ scale_y_continuous(name = "Fraction of duplicate reads", limits = c(0,1), breaks = seq(0,1,0.2), expand = c(0,0)) +
+ labs(title = paste0("N = ", tab_model_3$n_molecules[1], ", O = ", tab_model_3$p_opt_dup[1])) +
+ theme_base + theme(aspect.ratio = 1)
+g_1 + g_2 + g_3
+```
+
+I don't want to take these results too seriously, since they're based on an extremely simple model, but there are some qualitative takeaways that I found helpful to keep in mind when looking at the real data. In particular, the general pattern of the model results is that **the fraction of duplicate reads follows a sigmoidal pattern on a linear-log plot**:
+
+- When $X << N$, the fraction of duplicates $F$ is close to zero.
+
+- As $X$ approaches $N$, the fraction of duplicates begins increasing, first slowly and then (after $F$ reaches about 0.1) rapidly, before leveling off after $F$ exceeds about 0.9.
+
+- When $X >> N$, $F \approx 1$.
+
+At least under the assumptions used here, once the fraction of duplicates goes above 15% or so, further OOM increases in the number of clusters (e.g. by buying more or larger flow cells) will lead to a dramatic increase in the fraction of duplicate reads.
+
+# Applying model takeaways
+
+Returning to the results from the BMC data with these modeling results in mind:
+
+```{r}
+g_dup_flat_lin
+```
+
+Assuming that the real data will follow a sigmoidal pattern roughly resembling that from the model, we see that all samples (with the possible exception of D23-13406 when `unpair repair` is disabled) are in the "danger zone", such that further OOM increases in read depth will likely lead to a dramatic increase in the fraction of duplicate reads. As such, **it probably isn't worth paying for a further OOM increase (or even half-OOM increase) in read depth for these samples**.
+
+# Conclusions
+
+- Overall, running this analysis was a frustrating experience, due to difficulties finding a configuration of Clumpify (or any other deduplication tool I know of) that (i) I trust to remove duplicates appropriately without predictably over- or under-counting, and (ii) runs well on large FASTQ files. Ultimately, I think we should treat the true level of duplicates as somewhere in between that measured by method 1 (Clumpify with `unpair repair` enabled) and method 2 (Clumpify with `unpair repair` disabled).
+
+ - For the actual virus-detection pipeline, my current best bet is that we should run method 2 on the full dataset, then method 1 on the specific viral hits identified by Kraken2 (or the alignment tool used for validation, once that's been implemented). However, we ultimately might want to implement our own tool for dealing with this problem.
+
+- Nevertheless, I was able to generate rarefaction curves for duplication rate as a function of the number of reads using both methods. The results, in combination with a very simple model I generated to aid interpretation, suggest that we're probably at about the highest OOM read depth we should aim for in terms of getting useful information from these samples (with the possible exception of D23-13406).