Adapter removal test notebook

naobservatory · Oct 12, 2023 · 5a6481d · 5a6481d
1 parent aea1826
commit 5a6481d
Show file tree

Hide file tree

Showing 20 changed files with 1,230 additions and 2 deletions.
diff --git a/docs/img/2023-10-12_cc-fastqc-adapters-ar.png b/docs/img/2023-10-12_cc-fastqc-adapters-ar.png
diff --git a/docs/img/2023-10-12_cc-fastqc-adapters-fastp.png b/docs/img/2023-10-12_cc-fastqc-adapters-fastp.png
diff --git a/docs/img/2023-10-12_cc-fastqc-adapters-raw.png b/docs/img/2023-10-12_cc-fastqc-adapters-raw.png
diff --git a/docs/img/2023-10-12_rothman-fastqc-adapters-ar.png b/docs/img/2023-10-12_rothman-fastqc-adapters-ar.png
diff --git a/docs/img/2023-10-12_rothman-fastqc-adapters-fastp.png b/docs/img/2023-10-12_rothman-fastqc-adapters-fastp.png
diff --git a/docs/img/2023-10-12_rothman-fastqc-adapters-raw.png b/docs/img/2023-10-12_rothman-fastqc-adapters-raw.png
diff --git a/docs/index.html b/docs/index.html
@@ -129,7 +129,29 @@
 
 <div class="quarto-listing quarto-listing-container-default" id="listing-listing">
 <div class="list quarto-listing-default">
-<div class="quarto-post image-right" data-index="0" data-listing-date-sort="1695268800000" data-listing-file-modified-sort="1695331351195" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="11">
+<div class="quarto-post image-right" data-index="0" data-listing-date-sort="1697860800000" data-listing-file-modified-sort="1697146223230" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="11">
+<div class="thumbnail">
+<p><a href="./notebooks/2023-10-12_fastp-vs-adapterremoval.html"> <p class="card-img-top"><img src="notebooks/2023-10-12_fastp-vs-adapterremoval_files/figure-html/unnamed-chunk-2-1.png"  class="thumbnail-image card-img"/></p> </a></p>
+</div>
+<div class="body">
+<a href="./notebooks/2023-10-12_fastp-vs-adapterremoval.html">
+<h3 class="no-anchor listing-title">
+Comparing FASTP and AdapterRemoval for MGS pre-processing
+</h3>
+<div class="listing-subtitle">
+Two tools – how do they perform?
+</div>
+</a>
+</div>
+<div class="metadata">
+<a href="./notebooks/2023-10-12_fastp-vs-adapterremoval.html">
+<div class="listing-date">
+Oct 21, 2023
+</div>
+</a>
+</div>
+</div>
+<div class="quarto-post image-right" data-index="1" data-listing-date-sort="1695268800000" data-listing-file-modified-sort="1695331351195" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="11">
 <div class="thumbnail">
 <p><a href="./notebooks/2023-09-12_settled-solids-extraction-test.html"> <p class="card-img-top"><img src="notebooks/2023-09-12_settled-solids-extraction-test_files/figure-html/plot-concentrations-1.png"  class="thumbnail-image card-img"/></p> </a></p>
 </div>

diff --git a/docs/listings.json b/docs/listings.json
@@ -2,6 +2,7 @@
   {
     "listing": "/index.html",
     "items": [
+      "/notebooks/2023-10-12_fastp-vs-adapterremoval.html",
       "/notebooks/2023-09-12_settled-solids-extraction-test.html"
     ]
   }

diff --git a/docs/notebooks/2023-10-12_fastp-vs-adapterremoval.html b/docs/notebooks/2023-10-12_fastp-vs-adapterremoval.html
diff --git a/...ooks/2023-10-12_fastp-vs-adapterremoval_files/figure-html/unnamed-chunk-2-1.png b/...ooks/2023-10-12_fastp-vs-adapterremoval_files/figure-html/unnamed-chunk-2-1.png
diff --git a/...ooks/2023-10-12_fastp-vs-adapterremoval_files/figure-html/unnamed-chunk-3-1.png b/...ooks/2023-10-12_fastp-vs-adapterremoval_files/figure-html/unnamed-chunk-3-1.png
diff --git a/...ooks/2023-10-12_fastp-vs-adapterremoval_files/figure-html/unnamed-chunk-4-1.png b/...ooks/2023-10-12_fastp-vs-adapterremoval_files/figure-html/unnamed-chunk-4-1.png
diff --git a/docs/search.json b/docs/search.json
@@ -32,6 +32,13 @@
     "href": "index.html",
     "title": "Will's Public NAO Notebook",
     "section": "",
-    "text": "Extraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
+    "text": "Comparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\n\nOct 21, 2023\n\n\n\n\n\n\n  \n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items"
+  },
+  {
+    "objectID": "notebooks/2023-10-12_fastp-vs-adapterremoval.html",
+    "href": "notebooks/2023-10-12_fastp-vs-adapterremoval.html",
+    "title": "Comparing FASTP and AdapterRemoval for MGS pre-processing",
+    "section": "",
+    "text": "See also:\n\n[…]\n\nThe first major step in our current MGS pipeline uses AdapterRemoval to automatically identify and remove sequencing adapters, as well as trimming low-quality bases and collapsing overlapping read pairs (it can also discard low-quality reads entirely, but our current pipeline doesn’t use this). An alternative tool, that can do all of this as well as read deduplication, is fastp. I asked the pipeline’s current primary maintainer if there was a good reason we were using one tool instead of the other, and he said that there wasn’t. So I decided to do a shallow investigation of their relative behavior on some example MGS datasets to see how they compare.\nThe data\nTo carry out this test, I selected three pairs of raw Illumina FASTQC files, corresponding to one sample each from two different published studies as well as one dataset provided to us by Marc Johnson:\n\n\nStudy\nBioproject\nSample\n\n\n\nRothman et al. (2021)\nPRJNA729801\nSRR19607374\n\n\nCrits-Cristoph et al. (2021)\nPRJNA661613\nSRR23998357\n\n\nJohnson (2023)\nN/A\nCOMO4\n\n\n\nFor each sample, I generated FASTQC report files for the raw data, then ran FASTP and AdapterRemoval independently on the FASTQ files and tabulated the results\nThe commands\nFor processing with FASTP, I ran the following command:\nfastp -i &lt;raw-reads-1&gt; -I &lt;raw-reads-2&gt; -o &lt;output-path-1&gt; -O &lt;output-path-2&gt; --failed_out &lt;output-path-failed-reads&gt; --cut_tail --correction\n(I didn’t run deduplication for this test, as AdapterRemoval doesn’t have that functionality.)\nFor processing with AdapterRemoval, I first ran the following command to identify adapters:\nAdapterRemoval --file1 &lt;raw-reads-1&gt; --file2 &lt;raw-reads-2&gt; --identify-adapters --threads 4 &gt; adapter_report.txt\nI then ran the following command to actually carry out pre-processing, using the adapter sequences identified in the previous step (NB the minlength: and maxns values are chosen to match the FASTP defaults):\nAdapterRemoval --file1 &lt;raw-reads-1&gt; --file2 &lt;raw-reads-2&gt; --basename &lt;output-prefix&gt; --adapter1 &lt;adapter1&gt; --adapter2 &lt;adapter2&gt; --gzip --trimns --trimqualities --minlength 15 --maxns 5\n1. Rothman et al. (SRR19607374)\nThis sample from Rothman et al. contains 11.58M read pairs in the raw FASTQ files.\nFASTP:\n\nRunning FASTP took a total of 39 seconds.\nFASTP detected and trimmed adapters on 3.88M reads (note: not read pairs).\nA total of 133 Mb of sequence was trimmed due to adapter trimming, and 55 Mb due to other trimming processes, for a total of 188 Mb of trimmed sequence.\nA total of 367,938 read pairs were discarded due to failing various filters, leaving 11.21M read pairs remaining.\n\nAdapterRemoval:\n\nRunning AdapterRemoval took a total of 323.9 seconds (a bit under 5.5 minutes).\nAdapterRemoval detected and trimmed adapters on 3.96M reads (note: again, not read pairs).\nA total of 135 Mb of sequence was trimmed across all reads; the information isn’t provided to distinguish trimmed adapter sequences vs other trimming.\nOnly 2,347 read pairs were discarded due to failing various filters, leaving the final read number almost unchanged.\n\n\nCode# Calculate read allocations for Rothman\nstatus = c(\"raw\", \"fastp\", \"AdapterRemoval\")\nbp_passed = c(1748896043+1748896043,1627794563+1627794563,1681118328+1681115431)\nbp_discarded = c(0,54014538,276107+271850)\nbp_trimmed = c(0,bp_passed[1]-bp_passed[2]-bp_discarded[2],bp_passed[1]-bp_passed[3]-bp_discarded[3])\n# Tabulate\ntab_rothman &lt;- tibble(status = status, bp_passed = bp_passed, bp_discarded = bp_discarded, bp_trimmed = bp_trimmed)\ntab_rothman\n\n\n\n  \n\n\nCode# Visualize\ntab_rothman_gathered &lt;- gather(tab_rothman, key = sequence_group, value = bp, -status) |&gt;\n  mutate(sequence_group = sub(\"bp_\", \"\", sequence_group),\n         sequence_group = factor(sequence_group, \n                                 levels = c(\"passed\", \"trimmed\", \"discarded\")),\n         status = factor(status, levels = c(\"raw\", \"fastp\", \"AdapterRemoval\")))\ng_rothman &lt;- ggplot(tab_rothman_gathered, aes(x=status, y=bp, fill = sequence_group)) +\n  geom_col(position = \"stack\") + scale_fill_brewer(palette = \"Dark2\") + \n  theme_base + theme(axis.title.x = element_blank())\ng_rothman\n\n\n\n\nFASTQC results:\n\nPrior to adapter removal with either tool, the sequencing reads appear good quality, with a consistent average quality score of 30 across all bases in the forward read and ~29 in the reverse read. FASTP successfully raises the average quality score in the reverse read to 30 through trimming and read filtering, while AdapterRemoval leaves it unchanged.\nFASTQC judges the data to have iffy sequence composition (%A/C/G/T); neither tool affects this much.\nAll reads in the raw data are 151bp long; unsurprisingly, trimming by both tools results in a left tail in the sequencing length distribution that was absent in the raw data.\nAs previously observed, the raw data has very high duplicate levels, with only ~26% of sequences estimated by FASTQC to remain after deduplication. Increasing the comparison window to 100bp (from a default of 50bp) increases this to ~35%. Neither tool has much effect on this number – unsurprisingly, since neither carried out deduplication.\nFinally, adapter removal. Unsurprisingly, the raw data shows substantial adapter content. AdapterRemoval does a good job of removing adapters, resulting in a “pass” grade from FASTQC. Surprisingly, despite trimming adapters from fewer reads, fastp does even better (according to FASTQC) at removing adapters.\n\nThe images below show raw, fastp, and AR adapter content:\n\n\n\n\n\n\n2. Crits-Christoph et al. (SRR23998357)\nThis sample from Rothman et al. contains 48.46M read pairs in the raw FASTQ files.\nFASTP:\n\nRunning FASTP took a total of 99 seconds.\nFASTP detected and trimmed adapters on 13.41M reads (note: not read pairs).\nA total of 270 Mb of sequence was trimmed due to adapter trimming, and 43 Mb due to other trimming processes, for a total of 313 Mb of trimmed sequence.\nA total of 1.99M read pairs were discarded due to failing various filters, leaving 47.47M read pairs remaining.\n\nAdapterRemoval:\n\nRunning AdapterRemoval took a total of 1041.3 seconds (a bit over 17 minutes).\nAdapterRemoval detected and trimmed adapters on 8.22M reads (note: again, not read pairs).\nA total of 93.7 Mb of sequence was trimmed across all reads; the information isn’t provided to distinguish trimmed adapter sequences vs other trimming.\nOnly 32,381 read pairs were discarded due to failing various filters, leaving the final read number (again) almost unchanged.\n\n\nCode# Calculate read allocations for CritsCristoph\nstatus = c(\"raw\", \"fastp\", \"AdapterRemoval\")\nbp_passed_cc = c(3683175308+3683175308,3465517525+3467624847,3634186441+3634143668)\nbp_discarded_cc = c(0,120701257,2057612+2224529)\nbp_trimmed_cc = c(0,bp_passed_cc[1]-bp_passed_cc[2]-bp_discarded_cc[2],bp_passed_cc[1]-bp_passed_cc[3]-bp_discarded_cc[3])\n# Tabulate\ntab_cc &lt;- tibble(status = status, bp_passed = bp_passed_cc, bp_discarded = bp_discarded_cc, bp_trimmed = bp_trimmed_cc)\ntab_cc\n\n\n\n  \n\n\nCode# Visualize\ntab_cc_gathered &lt;- gather(tab_cc, key = sequence_group, value = bp, -status) |&gt;\n  mutate(sequence_group = sub(\"bp_\", \"\", sequence_group),\n         sequence_group = factor(sequence_group, \n                                 levels = c(\"passed\", \"trimmed\", \"discarded\")),\n         status = factor(status, levels = c(\"raw\", \"fastp\", \"AdapterRemoval\")))\ng_cc &lt;- ggplot(tab_cc_gathered, aes(x=status, y=bp, fill = sequence_group)) +\n  geom_col(position = \"stack\") + scale_fill_brewer(palette = \"Dark2\") + \n  theme_base + theme(axis.title.x = element_blank())\ng_cc\n\n\n\n\nFASTQC results:\n\nAs with Rothman, the raw data shows good sequence quality (though with some tailing off at later read positions), poor sequence composition, uniform read length (76bp in this case) and high numbers of duplicates. They also, unsurprisingly, have high adaptor content.\nAs with Rothman, fastp successfully improves read quality scores, while AdapterRemoval has little effect. Also as with Rothman, neither tool (as configured) has much effect on sequence composition or duplicates.\n\nIn this case, fastp is highly effective at removing adapter sequences, while AdapterRemoval is only weakly effective. I wonder if I misconfigured AR somehow, because I’m surprised at how many adapter sequences remain in this case. The images below show raw, fastp, and AR adapter content:\n\n\n\n\n\n\n3. Johnson (COMO4)\nThis sample from Johnson contains 15.58M read pairs in the raw FASTQ files.\nFASTP:\n\nRunning FASTP took a total of 33 seconds.\nFASTP detected and trimmed adapters on 158,114 reads (note: not read pairs).\nA total of 1.3 Mb of sequence was trimmed due to adapter trimming, and 14.6 Mb due to other trimming processes, for a total of 15.9 Mb of trimmed sequence.\nA total of 0.33M read pairs were discarded due to failing various filters, leaving 15.25M read pairs remaining.\n\nAdapterRemoval:\n\nRunning AdapterRemoval took a total of 311.4 seconds (a bit over 5 minutes).\nAdapterRemoval detected and trimmed adapters on 155,360 reads (note: again, not read pairs).\nA total of 93.7 Mb of sequence was trimmed across all reads; the information isn’t provided to distinguish trimmed adapter sequences vs other trimming.\nOnly 5,512 read pairs were discarded due to failing various filters, leaving the final read number (again) almost unchanged.\n\nFASTQC results:\n\nAs with previous samples, the raw data shows good sequence quality (though with some tailing off at later read positions), poor sequence composition, uniform read length (76bp again) and high numbers of duplicates.\nUnlike previous samples, the raw data for this sample shows very low adapter content – plausibly they underwent adapter trimming before they were sent to us?\nNeither tool achieves much visible improvement on adapter content – unsurprisingly, given the very low levels in the raw data.\n\n\nCode# Calculate read allocations for Johnson\nstatus = c(\"raw\", \"fastp\", \"AdapterRemoval\")\nbp_passed_como = c(1183987584+1183987584,1157539804+1157540364,1182999562+1182970312)\nbp_discarded_como = c(0,36950446,418230+257734)\nbp_trimmed_como = c(0,bp_passed_como[1]-bp_passed_como[2]-bp_discarded_como[2],bp_passed_como[1]-bp_passed_como[3]-bp_discarded_como[3])\n# Tabulate\ntab_como &lt;- tibble(status = status, bp_passed = bp_passed_como, bp_discarded = bp_discarded_como, bp_trimmed = bp_trimmed_como)\ntab_como\n\n\n\n  \n\n\nCode# Visualize\ntab_como_gathered &lt;- gather(tab_como, key = sequence_group, value = bp, -status) |&gt;\n  mutate(sequence_group = sub(\"bp_\", \"\", sequence_group),\n         sequence_group = factor(sequence_group, \n                                 levels = c(\"passed\", \"trimmed\", \"discarded\")),\n         status = factor(status, levels = c(\"raw\", \"fastp\", \"AdapterRemoval\")))\ng_cc &lt;- ggplot(tab_como_gathered, aes(x=status, y=bp, fill = sequence_group)) +\n  geom_col(position = \"stack\") + scale_fill_brewer(palette = \"Dark2\") + \n  theme_base + theme(axis.title.x = element_blank())\ng_cc\n\n\n\n\nDeduplication with fastp\nGiven that all three of these samples contain high levels of sequence duplicates, I was curious to see to what degree fastp was able to improve on this metric. To test this, I reran fastp on all three samples, with the --dedup option enabled. I observed the following:\n\nRuntimes were consistently very slightly longer than without deduplication.\nThe number of successful output reads declined from 11.21M to 9.29M for the Rothman sample, from 47.47M to 31.47M, and from 15.25M to 11.03M for the Johnson sample.\nRelative to the raw data, and using the default FASTQC settings, the predicted fraction of reads surviving deduplication rose from 26% to 29% for the Rothman sample, from 45% to 64% for the Crits-Cristoph sample, and from 26% to 32% for the Johnson sample, following fastp deduplication. That is to say, by this metric, deduplication was mildly but not very effective.\nThis relative lack of efficacy may simply be because FASTP identifies duplicates as read pairs that are entirely identical in sequence, while FASTQC only looks at the first 50 base pairs of each read in isolation.\nI think I need to learn more about read duplicates and deduplication before I have strong takeaways here.\nConclusions\nTaken together, I think these data make a decent case for using FASTP, rather than AdapterRemoval, for pre-processing and adapter trimming.\n\nFASTP is much faster than AdapterRemoval.\nFor those samples with high adapter content, FASTP appeared more effective than AdapterRemoval at removing adapters, at least for those adapter sequences that could be detected by FASTQC.\nFASTP provides substantially more functionality than AdapterRemoval, making it easier for us to add additional preprocessing steps like read filtering and (some) deduplication down the line."
   }
 ]
diff --git a/img/2023-10-12_cc-fastqc-adapters-ar.png b/img/2023-10-12_cc-fastqc-adapters-ar.png
diff --git a/img/2023-10-12_cc-fastqc-adapters-fastp.png b/img/2023-10-12_cc-fastqc-adapters-fastp.png
diff --git a/img/2023-10-12_cc-fastqc-adapters-raw.png b/img/2023-10-12_cc-fastqc-adapters-raw.png
diff --git a/img/2023-10-12_rothman-fastqc-adapters-ar.png b/img/2023-10-12_rothman-fastqc-adapters-ar.png
diff --git a/img/2023-10-12_rothman-fastqc-adapters-fastp.png b/img/2023-10-12_rothman-fastqc-adapters-fastp.png
diff --git a/img/2023-10-12_rothman-fastqc-adapters-raw.png b/img/2023-10-12_rothman-fastqc-adapters-raw.png