diff --git a/docs/img/2023-10-11_element-avidite-arm.png b/docs/img/2023-10-11_element-avidite-arm.png new file mode 100644 index 0000000..8b7314a Binary files /dev/null and b/docs/img/2023-10-11_element-avidite-arm.png differ diff --git a/docs/img/2023-10-11_element-avidite-binding.png b/docs/img/2023-10-11_element-avidite-binding.png new file mode 100644 index 0000000..6c6e3d9 Binary files /dev/null and b/docs/img/2023-10-11_element-avidite-binding.png differ diff --git a/docs/img/2023-10-11_element-avidite.png b/docs/img/2023-10-11_element-avidite.png new file mode 100644 index 0000000..5c656f4 Binary files /dev/null and b/docs/img/2023-10-11_element-avidite.png differ diff --git a/docs/img/2023-10-11_element-base-calling.png b/docs/img/2023-10-11_element-base-calling.png new file mode 100644 index 0000000..684bc02 Binary files /dev/null and b/docs/img/2023-10-11_element-base-calling.png differ diff --git a/docs/img/2023-10-11_element-elongation.png b/docs/img/2023-10-11_element-elongation.png new file mode 100644 index 0000000..5f1de84 Binary files /dev/null and b/docs/img/2023-10-11_element-elongation.png differ diff --git a/docs/img/2023-10-11_element-imaging.png b/docs/img/2023-10-11_element-imaging.png new file mode 100644 index 0000000..80c6947 Binary files /dev/null and b/docs/img/2023-10-11_element-imaging.png differ diff --git a/docs/img/2023-10-11_rolling-circle-amplification.png b/docs/img/2023-10-11_rolling-circle-amplification.png new file mode 100644 index 0000000..e2e8d7a Binary files /dev/null and b/docs/img/2023-10-11_rolling-circle-amplification.png differ diff --git a/docs/index.html b/docs/index.html index c93175b..a0cd625 100644 --- a/docs/index.html +++ b/docs/index.html @@ -151,7 +151,29 @@

-
+
+
+

+
+ + +
+
diff --git a/docs/listings.json b/docs/listings.json index 32c6c28..21a646a 100644 --- a/docs/listings.json +++ b/docs/listings.json @@ -3,6 +3,7 @@ "listing": "/index.html", "items": [ "/notebooks/2023-10-12_fastp-vs-adapterremoval.html", + "/notebooks/2023-10-12_how-does-element-sequencing-work.html", "/notebooks/2023-09-12_settled-solids-extraction-test.html" ] } diff --git a/docs/notebooks/2023-10-12_how-does-element-sequencing-work.html b/docs/notebooks/2023-10-12_how-does-element-sequencing-work.html new file mode 100644 index 0000000..5468f7e --- /dev/null +++ b/docs/notebooks/2023-10-12_how-does-element-sequencing-work.html @@ -0,0 +1,668 @@ + + + + + + + + + + + +Will’s Public NAO Notebook - How does Element AVITI sequencing work? + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

How does Element AVITI sequencing work?

+

Findings of a shallow investigation

+
+
+ + +
+ +
+
Author
+
+

Will Bradshaw

+
+
+ +
+
Published
+
+

October 11, 2023

+
+
+ + +
+ + +
+ + + + +
+ + + + +

In September 2023, the NAO team sent several samples to the MIT BioMicro Center, for library preparation and sequencing using their new Element AVITI sequencer. This machine works on quite different principles from Illumina sequencing, but also produces high-volume, paired-end, high-accuracy short reads. Since it looks like we might be using this machine quite a lot in the future, it pays to understand what it's doing. However, I found most quick explanations of Element sequencing much harder to follow than equivalent explanations of Illumina's sequencing technology (e.g. here).

+

To try and understand this better, I dug deeper, using a combination of talks by Element staff on YouTube, their core methods paper, and aggressive interrogation of Claude 2. Given my difficulty understanding this, I figured others on the team might also benefit from a quick-ish write-up of my current best understanding, presented here. Note that this does not go into the performance of Element sequencing, only the underlying mechanisms. Note also that, given the lack of very detailed documentation about many aspects of the process, my understanding here is inevitably more high-level than it would be for e.g. Illumina sequencing.

+
+

1. Library prep

+
    +
  • The fundamental stages required in library prep for Element sequencing are mostly very similar to Illumina sequencing: fragmentation, addition of terminal adapter sequences, optional amplification, size selection, and cleanup. The main additional step required is circularization: to be compatible with Element's cluster generation method (see below) mature Element library molecules must be circular, with the 5' and 3' adapters joined end-to-end.

  • +
  • Given the similarity with Illumina library prep procedures, Element have sensibly designed their processes to be compatible with many standard Illumina library prep kits. There are two main ways to adapt Illumina library prep kits for Element sequencing:

    +
      +
    • In the first (Elevate) workflow, the standard kit protocol is followed, but with Element adapter oligos (including sample indices) replacing Illumina adaptors. The library molecules are then circularized by ligating the ends of the adapters together, cleaned to remove linear molecules, and are ready to be introduced to the flow cell.

    • +
    • In the second (Adept) workflow, the standard kit protocol is followed completely (including use of Illumina adapters) followed by additional steps to convert the resulting library into an Element library: addition of terminal Element adapter oligos, circularization, and cleanup.

    • +
  • +
+
+
+

2. Cluster generation

+
    +
  • Following library prep, the libraries are denatured to produce single-stranded circular DNA molecules, then washed across a flat flow-cell studded with attached oligos complementary to the adapter sequences. 

  • +
  • Library molecules bind to these oligos, with unbound library molecules washed away. 

  • +
  • Polymerases and nucleotides are added, and elongate from each attached oligo via rolling circle amplification.

    +
      +
    • Briefly, the polymerase starts at the hybridized adapter/oligo double-stranded sequence and moves around the circular library molecule. When it reaches the end of the circle, it continues on to another revolution, displacing its own previously-synthesized daughter strand as it goes.

    • +
    • This continues over repeated passes, producing a long single-stranded molecule containing many concatenated copies of the (complement of) the original library molecule sequence:

      +
        +
      • +
      • Imagine this picture, but with the blue primer attached to a flow-cell at one end.

      • +
    • +
    • The resulting long, attached molecule is referred to as a concatemer, or polony. A prepared AVITI high-output flow-cell contains roughly 1 billion polonies, each of which corresponds to one read pair. (An AVITI run comprises two flow cells run in parallel, for roughly 2 billion read pairs per run.)

    • +
  • +
+
+
+

3. Daughter strand elongation

+
    +
  • Although Element sequencing is not sequencing-by-synthesis, it is, as it were, sequencing-with-synthesis. Like Illumina, the core of each sequencing cycle is the stepwise elongation of daughter strands complementary to the library molecule sequence in each cluster, followed by imaging to determine the next nucleotide in the sequence. The mechanism of base calling is completely different (and will be described in the next section) but the stepwise elongation process is closely related.

  • +
  • In the case of Element sequencing, elongation begins with annealing of a sequencing primer complementary to one of the Element adapter sequences. These primers will bind many times to a given polony molecule, at the beginning of each copy of the RCA-duplicated library sequence.

  • +
  • A mixture of DNA polymerase and reversible chain terminator nucleotides is then washed across the flow cell. The polymerases bind the double-stranded primer sequences and incorporate a complementary terminator nucleotide, extending the double-stranded sequence by one base pair (after which further elongation is blocked).

  • +
  • The polymerases and free nucleotides are displaced and washed away, after which the blocking group on the incorporated terminator nucleotides is removed (enabling further elongation). Base-calling occurs (see below), after which the cycle repeats with the addition of polymerase and terminator nucleotides.

  • +
  • +
  • To generate a reverse read, the same process takes place, but using primers complementary to the other adapter sequence.

  • +
+
+
+

4. Labeling, imaging, and base calling

+
    +
  • The optical generation of base calls is the most complex and distinctive aspect of Element sequencing, and the one that I had the hardest time understanding. What follows is my best attempt at an explanation, but I'm not fully confident I haven't misunderstood something fundamental.
  • +
+
+

4a. Background and justification

+
    +
  • When a polymerase binds a DNA strand, it first positions itself over the boundary between the double-stranded primer region and the single-stranded template region. It then recruits and positions a nucleotide complementary to the first base of the template region, using a combination of base pairing and direct interactions between the nucleotide and the polymerase enzyme itself. Finally, it incorporates the new nucleotide into the elongating daughter strand by connecting it to the end of that strand via a new phosphodiester bond.

    +
      +
    • Usually, the polymerase then repeats the cycle by recruiting and incorporating a nucleotide complementary to the next base of the template strand; however, if the incorporated nucleotide is a chain terminator, it is unable to do this, and stalls.
    • +
  • +
  • In Illumina sequencing, the terminator nucleotide incorporated by the polymerase is fluorescently labeled, and is imaged following incorporation. The fluorophore is then cleaved off along with the terminator group, and the cycle repeats. As a result, the process of daughter strand elongation and base calling are closely bound together.

  • +
  • In Element sequencing, the goal is to separate the processes of daughter strand elongation (above) and base calling, so that the two can be optimized separately. To achieve this, the aim is to call the next unincorporated position in the template sequence, rather than (as in Illumina sequencing) the most recently incorporated position.

  • +
  • One theoretical way to do this would be to use an engineered polymerase that is able to recruit complementary nucleotides but not incorporate them. One could supply this polymerase with fluorescent nucleotides, and it would recruit the one complementary to the next position on the template strand. This would occur simultaneously at many different locations on each polony, corresponding to the different copies of the library sequence produced by RNA. One could then image the flow cell to identify the nucleotide type recruited at each polony.

  • +
  • The problem with the above approach is low signal persistence. Without incorporation, recruitment of nucleotides by the polymerase is weak and transient: the nucleotide binds its complementary base and the polymerase, remains for a short time, then dissociates. The result is that, for any given polony, too few nucleotides are recruited at any one time to give a sufficient signal for imaging.

  • +
  • In order for an approach like this to work, then, we need a way to improve signal persistence without relying on covalent incorporation of nucleotides. Enter avidity sequencing.

  • +
+
+
+

4b. Base calling by avidity

+
    +
  • The avidity of a molecular interaction is the accumulated strength of that interaction across multiple separate noncovalent bonds. Even if any single one of these bonds is weak and transient, the overall interaction can be strong and stable if the two molecules interact at many different points.

  • +
  • In Element avidity sequencing, the avidite is a large molecular construct, comprising a fluorescently labeled protein core connected to some number of (identical) nucleotides via flexible linker regions. Each of these nucleotide groups can be independently recruited by a polymerase bound to a polony, and positioned based on base-pairing interactions. While each of these nucleotide:template:polymerase interactions is too weak and transient to sustain a strong signal, the avidite as a whole is bound to the polony via multiple such interactions, producing a strong and stable interaction overall.

    +
      +
    • +
    • Example avidite structure from the avidity sequencing paper. The core of the molecule consists of fluorescently labeled streptavidin, bound to linker regions via streptavidin:biotin interactions. Three of the four linkers shown here end in nucleotides (specifically, adenosine); the fourth mediates core:core interactions to produce an even larger avidite complex.

    • +
    • +
    • Example avidite arm structure, with biotin at one end (top-left) and adenosine at the other (bottom-right).

    • +
  • +
  • The base-calling phase of the avidity sequencing cycle thus proceeds as follows:

    +
      +
    • Prior to the base-calling phase, the polymerase and nucleotides involved in the elongation phase are detached and washed away.

    • +
    • The flow cell is then washed with a mixture containing an engineered polymerase as well as four fluorescently-labeled avidites (one each for A, C, G and T). The engineered polymerase (henceforth the avidite-binding polymerase, or ABP) is distinct from that used for elongation, and is capable of binding a template strand and recruiting a complementary nucleotide, but not capable of incorporation.

    • +
    • The ABPs bind to the double-stranded regions of each polony and position themselves at the boundary with the single-stranded template region. They then attempt to recruit nucleotides complementary to the next position on the template strand. The only nucleotides available are those attached to the avidites, which are thus recruited. 

    • +
    • Since each copy of the template sequence in each polony is synchronized, each polymerase bound to each polony attempts to recruit the same nucleotide type, and thus interacts with the same type of avidite. Each avidite molecule is thus recruited to multiple points on the polony, producing a stable overall interaction.

      +
        +
      • +
    • +
    • Multiple copies of the same avidite molecule are thus recruited to each polony, producing a strong and uniform fluorescent signal.

      +
        +
      • +
    • +
    • The flow cell is then imaged to identify the avidite bound to each polony, and thus the next nucleotide in each read. After this, the ABPs and avidites are detached and washed away, and the cycle proceeds to the next elongation phase (see above).

    • +
  • +
  • +
+


+

+ + + + +
+
+ +
+ +
+ + + + \ No newline at end of file diff --git a/docs/notebooks/2023-10-12_how-does-element-sequencing-work_files/figure-html/unnamed-chunk-2-1.png b/docs/notebooks/2023-10-12_how-does-element-sequencing-work_files/figure-html/unnamed-chunk-2-1.png new file mode 100644 index 0000000..d3d13cd Binary files /dev/null and b/docs/notebooks/2023-10-12_how-does-element-sequencing-work_files/figure-html/unnamed-chunk-2-1.png differ diff --git a/docs/notebooks/2023-10-12_how-does-element-sequencing-work_files/figure-html/unnamed-chunk-3-1.png b/docs/notebooks/2023-10-12_how-does-element-sequencing-work_files/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 0000000..eef62bf Binary files /dev/null and b/docs/notebooks/2023-10-12_how-does-element-sequencing-work_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/docs/notebooks/2023-10-12_how-does-element-sequencing-work_files/figure-html/unnamed-chunk-4-1.png b/docs/notebooks/2023-10-12_how-does-element-sequencing-work_files/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 0000000..2c957e2 Binary files /dev/null and b/docs/notebooks/2023-10-12_how-does-element-sequencing-work_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/search.json b/docs/search.json index 110bd82..3cbeb05 100644 --- a/docs/search.json +++ b/docs/search.json @@ -32,7 +32,7 @@ "href": "index.html", "title": "Will's Public NAO Notebook", "section": "", - "text": "Comparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\n\nOct 21, 2023\n\n\n\n\n\n\n \n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items" + "text": "Comparing FASTP and AdapterRemoval for MGS pre-processing\n\n\nTwo tools – how do they perform?\n\n\n\n\n\n\nOct 21, 2023\n\n\n\n\n\n\n \n\n\n\n\nHow does Element AVITI sequencing work?\n\n\nFindings of a shallow investigation\n\n\n\n\n\n\nOct 11, 2023\n\n\n\n\n\n\n \n\n\n\n\nExtraction experiment 2: high-level results & interpretation\n\n\nComparing RNA yields and quality across extraction kits for settled solids\n\n\n\n\n\n\nSep 21, 2023\n\n\n\n\n\n\nNo matching items" }, { "objectID": "notebooks/2023-10-12_fastp-vs-adapterremoval.html", @@ -40,5 +40,26 @@ "title": "Comparing FASTP and AdapterRemoval for MGS pre-processing", "section": "", "text": "The first major step in our current MGS pipeline uses AdapterRemoval to automatically identify and remove sequencing adapters, as well as trimming low-quality bases and collapsing overlapping read pairs (it can also discard low-quality reads entirely, but our current pipeline doesn’t use this). An alternative tool, that can do all of this as well as read deduplication, is fastp. I asked the pipeline’s current primary maintainer if there was a good reason we were using one tool instead of the other, and he said that there wasn’t. So I decided to do a shallow investigation of their relative behavior on some example MGS datasets to see how they compare.\nThe data\nTo carry out this test, I selected three pairs of raw Illumina FASTQC files, corresponding to one sample each from two different published studies as well as one dataset provided to us by Marc Johnson:\n\n\nStudy\nBioproject\nSample\n\n\n\nRothman et al. (2021)\nPRJNA729801\nSRR19607374\n\n\nCrits-Cristoph et al. (2021)\nPRJNA661613\nSRR23998357\n\n\nJohnson (2023)\nN/A\nCOMO4\n\n\n\nFor each sample, I generated FASTQC report files for the raw data, then ran FASTP and AdapterRemoval independently on the FASTQ files and tabulated the results\nThe commands\nFor processing with FASTP, I ran the following command:\nfastp -i <raw-reads-1> -I <raw-reads-2> -o <output-path-1> -O <output-path-2> --failed_out <output-path-failed-reads> --cut_tail --correction\n(I didn’t run deduplication for this test, as AdapterRemoval doesn’t have that functionality.)\nFor processing with AdapterRemoval, I first ran the following command to identify adapters:\nAdapterRemoval --file1 <raw-reads-1> --file2 <raw-reads-2> --identify-adapters --threads 4 > adapter_report.txt\nI then ran the following command to actually carry out pre-processing, using the adapter sequences identified in the previous step (NB the minlength and maxns values are chosen to match the FASTP defaults):\nAdapterRemoval --file1 <raw-reads-1> --file2 <raw-reads-2> --basename <output-prefix> --adapter1 <adapter1> --adapter2 <adapter2> --gzip --trimns --trimqualities --minlength 15 --maxns 5\n1. Rothman et al. (SRR19607374)\nThis sample from Rothman et al. contains 11.58M read pairs in the raw FASTQ files.\nFASTP:\n\nRunning FASTP took a total of 39 seconds.\nFASTP detected and trimmed adapters on 3.88M reads (note: not read pairs).\nA total of 133 Mb of sequence was trimmed due to adapter trimming, and 55 Mb due to other trimming processes, for a total of 188 Mb of trimmed sequence.\nA total of 367,938 read pairs were discarded due to failing various filters, leaving 11.21M read pairs remaining.\n\nAdapterRemoval:\n\nRunning AdapterRemoval took a total of 323.9 seconds (a bit under 5.5 minutes).\nAdapterRemoval detected and trimmed adapters on 3.96M reads (note: again, not read pairs).\nA total of 135 Mb of sequence was trimmed across all reads; the information isn’t provided to distinguish trimmed adapter sequences vs other trimming.\nOnly 2,347 read pairs were discarded due to failing various filters, leaving the final read number almost unchanged.\n\n\nCode# Calculate read allocations for Rothman\nstatus = c(\"raw\", \"fastp\", \"AdapterRemoval\")\nbp_passed = c(1748896043+1748896043,1627794563+1627794563,1681118328+1681115431)\nbp_discarded = c(0,54014538,276107+271850)\nbp_trimmed = c(0,bp_passed[1]-bp_passed[2]-bp_discarded[2],bp_passed[1]-bp_passed[3]-bp_discarded[3])\n# Tabulate\ntab_rothman <- tibble(status = status, bp_passed = bp_passed, bp_discarded = bp_discarded, bp_trimmed = bp_trimmed)\ntab_rothman\n\n\n\n \n\n\nCode# Visualize\ntab_rothman_gathered <- gather(tab_rothman, key = sequence_group, value = bp, -status) |>\n mutate(sequence_group = sub(\"bp_\", \"\", sequence_group),\n sequence_group = factor(sequence_group, \n levels = c(\"passed\", \"trimmed\", \"discarded\")),\n status = factor(status, levels = c(\"raw\", \"fastp\", \"AdapterRemoval\")))\ng_rothman <- ggplot(tab_rothman_gathered, aes(x=status, y=bp, fill = sequence_group)) +\n geom_col(position = \"stack\") + scale_fill_brewer(palette = \"Dark2\") + \n theme_base + theme(axis.title.x = element_blank())\ng_rothman\n\n\n\n\nFASTQC results:\n\nPrior to adapter removal with either tool, the sequencing reads appear good quality, with a consistent average quality score of 30 across all bases in the forward read and ~29 in the reverse read. FASTP successfully raises the average quality score in the reverse read to 30 through trimming and read filtering, while AdapterRemoval leaves it unchanged.\nFASTQC judges the data to have iffy sequence composition (%A/C/G/T); neither tool affects this much.\nAll reads in the raw data are 151bp long; unsurprisingly, trimming by both tools results in a left tail in the sequencing length distribution that was absent in the raw data.\nAs previously observed, the raw data has very high duplicate levels, with only ~26% of sequences estimated by FASTQC to remain after deduplication. Increasing the comparison window to 100bp (from a default of 50bp) increases this to ~35%. Neither tool has much effect on this number – unsurprisingly, since neither carried out deduplication.\nFinally, adapter removal. Unsurprisingly, the raw data shows substantial adapter content. AdapterRemoval does a good job of removing adapters, resulting in a “pass” grade from FASTQC. Surprisingly, despite trimming adapters from fewer reads, fastp does even better (according to FASTQC) at removing adapters.\n\nThe images below show raw, fastp, and AR adapter content:\n\n\n\n\n\n\n2. Crits-Christoph et al. (SRR23998357)\nThis sample from Rothman et al. contains 48.46M read pairs in the raw FASTQ files.\nFASTP:\n\nRunning FASTP took a total of 99 seconds.\nFASTP detected and trimmed adapters on 13.41M reads (note: not read pairs).\nA total of 270 Mb of sequence was trimmed due to adapter trimming, and 43 Mb due to other trimming processes, for a total of 313 Mb of trimmed sequence.\nA total of 1.99M read pairs were discarded due to failing various filters, leaving 47.47M read pairs remaining.\n\nAdapterRemoval:\n\nRunning AdapterRemoval took a total of 1041.3 seconds (a bit over 17 minutes).\nAdapterRemoval detected and trimmed adapters on 8.22M reads (note: again, not read pairs).\nA total of 93.7 Mb of sequence was trimmed across all reads; the information isn’t provided to distinguish trimmed adapter sequences vs other trimming.\nOnly 32,381 read pairs were discarded due to failing various filters, leaving the final read number (again) almost unchanged.\n\n\nCode# Calculate read allocations for CritsCristoph\nstatus = c(\"raw\", \"fastp\", \"AdapterRemoval\")\nbp_passed_cc = c(3683175308+3683175308,3465517525+3467624847,3634186441+3634143668)\nbp_discarded_cc = c(0,120701257,2057612+2224529)\nbp_trimmed_cc = c(0,bp_passed_cc[1]-bp_passed_cc[2]-bp_discarded_cc[2],bp_passed_cc[1]-bp_passed_cc[3]-bp_discarded_cc[3])\n# Tabulate\ntab_cc <- tibble(status = status, bp_passed = bp_passed_cc, bp_discarded = bp_discarded_cc, bp_trimmed = bp_trimmed_cc)\ntab_cc\n\n\n\n \n\n\nCode# Visualize\ntab_cc_gathered <- gather(tab_cc, key = sequence_group, value = bp, -status) |>\n mutate(sequence_group = sub(\"bp_\", \"\", sequence_group),\n sequence_group = factor(sequence_group, \n levels = c(\"passed\", \"trimmed\", \"discarded\")),\n status = factor(status, levels = c(\"raw\", \"fastp\", \"AdapterRemoval\")))\ng_cc <- ggplot(tab_cc_gathered, aes(x=status, y=bp, fill = sequence_group)) +\n geom_col(position = \"stack\") + scale_fill_brewer(palette = \"Dark2\") + \n theme_base + theme(axis.title.x = element_blank())\ng_cc\n\n\n\n\nFASTQC results:\n\nAs with Rothman, the raw data shows good sequence quality (though with some tailing off at later read positions), poor sequence composition, uniform read length (76bp in this case) and high numbers of duplicates. They also, unsurprisingly, have high adaptor content.\nAs with Rothman, fastp successfully improves read quality scores, while AdapterRemoval has little effect. Also as with Rothman, neither tool (as configured) has much effect on sequence composition or duplicates.\n\nIn this case, fastp is highly effective at removing adapter sequences, while AdapterRemoval is only weakly effective. I wonder if I misconfigured AR somehow, because I’m surprised at how many adapter sequences remain in this case. The images below show raw, fastp, and AR adapter content:\n\n\n\n\n\n\n3. Johnson (COMO4)\nThis sample from Johnson contains 15.58M read pairs in the raw FASTQ files.\nFASTP:\n\nRunning FASTP took a total of 33 seconds.\nFASTP detected and trimmed adapters on 158,114 reads (note: not read pairs).\nA total of 1.3 Mb of sequence was trimmed due to adapter trimming, and 14.6 Mb due to other trimming processes, for a total of 15.9 Mb of trimmed sequence.\nA total of 0.33M read pairs were discarded due to failing various filters, leaving 15.25M read pairs remaining.\n\nAdapterRemoval:\n\nRunning AdapterRemoval took a total of 311.4 seconds (a bit over 5 minutes).\nAdapterRemoval detected and trimmed adapters on 155,360 reads (note: again, not read pairs).\nA total of 93.7 Mb of sequence was trimmed across all reads; the information isn’t provided to distinguish trimmed adapter sequences vs other trimming.\nOnly 5,512 read pairs were discarded due to failing various filters, leaving the final read number (again) almost unchanged.\n\nFASTQC results:\n\nAs with previous samples, the raw data shows good sequence quality (though with some tailing off at later read positions), poor sequence composition, uniform read length (76bp again) and high numbers of duplicates.\nUnlike previous samples, the raw data for this sample shows very low adapter content – plausibly they underwent adapter trimming before they were sent to us?\nNeither tool achieves much visible improvement on adapter content – unsurprisingly, given the very low levels in the raw data.\n\n\nCode# Calculate read allocations for Johnson\nstatus = c(\"raw\", \"fastp\", \"AdapterRemoval\")\nbp_passed_como = c(1183987584+1183987584,1157539804+1157540364,1182999562+1182970312)\nbp_discarded_como = c(0,36950446,418230+257734)\nbp_trimmed_como = c(0,bp_passed_como[1]-bp_passed_como[2]-bp_discarded_como[2],bp_passed_como[1]-bp_passed_como[3]-bp_discarded_como[3])\n# Tabulate\ntab_como <- tibble(status = status, bp_passed = bp_passed_como, bp_discarded = bp_discarded_como, bp_trimmed = bp_trimmed_como)\ntab_como\n\n\n\n \n\n\nCode# Visualize\ntab_como_gathered <- gather(tab_como, key = sequence_group, value = bp, -status) |>\n mutate(sequence_group = sub(\"bp_\", \"\", sequence_group),\n sequence_group = factor(sequence_group, \n levels = c(\"passed\", \"trimmed\", \"discarded\")),\n status = factor(status, levels = c(\"raw\", \"fastp\", \"AdapterRemoval\")))\ng_cc <- ggplot(tab_como_gathered, aes(x=status, y=bp, fill = sequence_group)) +\n geom_col(position = \"stack\") + scale_fill_brewer(palette = \"Dark2\") + \n theme_base + theme(axis.title.x = element_blank())\ng_cc\n\n\n\n\nDeduplication with fastp\nGiven that all three of these samples contain high levels of sequence duplicates, I was curious to see to what degree fastp was able to improve on this metric. To test this, I reran fastp on all three samples, with the --dedup option enabled. I observed the following:\n\nRuntimes were consistently very slightly longer than without deduplication.\nThe number of successful output reads declined from 11.21M to 9.29M for the Rothman sample, from 47.47M to 31.47M, and from 15.25M to 11.03M for the Johnson sample.\nRelative to the raw data, and using the default FASTQC settings, the predicted fraction of reads surviving deduplication rose from 26% to 29% for the Rothman sample, from 45% to 64% for the Crits-Cristoph sample, and from 26% to 32% for the Johnson sample, following fastp deduplication. That is to say, by this metric, deduplication was mildly but not very effective.\nThis relative lack of efficacy may simply be because FASTP identifies duplicates as read pairs that are entirely identical in sequence, while FASTQC only looks at the first 50 base pairs of each read in isolation.\nI think I need to learn more about read duplicates and deduplication before I have strong takeaways here.\nConclusions\nTaken together, I think these data make a decent case for using FASTP, rather than AdapterRemoval, for pre-processing and adapter trimming.\n\nFASTP is much faster than AdapterRemoval.\nFor those samples with high adapter content, FASTP appeared more effective than AdapterRemoval at removing adapters, at least for those adapter sequences that could be detected by FASTQC.\nFASTP appears to be more aggressive at quality trimming reads than AdapterRemoval, resulting in better read quality distributions in FASTQC.\nFASTP provides substantially more functionality than AdapterRemoval, making it easier for us to add additional preprocessing steps like read filtering and (some) deduplication down the line.\n\nHowever, one important caveat is that it’s unclear how well either tool will perform on Element sequencing data – or how well FASTQC will be able to detect Element adapters that remain after preprocessing." + }, + { + "objectID": "notebooks/2023-10-12_how-does-element-sequencing-work.html", + "href": "notebooks/2023-10-12_how-does-element-sequencing-work.html", + "title": "How does Element AVITI sequencing work?", + "section": "", + "text": "In September 2023, the NAO team sent several samples to the MIT BioMicro Center, for library preparation and sequencing using their new Element AVITI sequencer. This machine works on quite different principles from Illumina sequencing, but also produces high-volume, paired-end, high-accuracy short reads. Since it looks like we might be using this machine quite a lot in the future, it pays to understand what it's doing. However, I found most quick explanations of Element sequencing much harder to follow than equivalent explanations of Illumina's sequencing technology (e.g. here).\nTo try and understand this better, I dug deeper, using a combination of talks by Element staff on YouTube, their core methods paper, and aggressive interrogation of Claude 2. Given my difficulty understanding this, I figured others on the team might also benefit from a quick-ish write-up of my current best understanding, presented here. Note that this does not go into the performance of Element sequencing, only the underlying mechanisms. Note also that, given the lack of very detailed documentation about many aspects of the process, my understanding here is inevitably more high-level than it would be for e.g. Illumina sequencing." + }, + { + "objectID": "notebooks/2023-10-12_how-does-element-sequencing-work.html#a.-background-and-justification", + "href": "notebooks/2023-10-12_how-does-element-sequencing-work.html#a.-background-and-justification", + "title": "How does Element AVITI sequencing work?", + "section": "4a. Background and justification", + "text": "4a. Background and justification\n\nWhen a polymerase binds a DNA strand, it first positions itself over the boundary between the double-stranded primer region and the single-stranded template region. It then recruits and positions a nucleotide complementary to the first base of the template region, using a combination of base pairing and direct interactions between the nucleotide and the polymerase enzyme itself. Finally, it incorporates the new nucleotide into the elongating daughter strand by connecting it to the end of that strand via a new phosphodiester bond.\n\nUsually, the polymerase then repeats the cycle by recruiting and incorporating a nucleotide complementary to the next base of the template strand; however, if the incorporated nucleotide is a chain terminator, it is unable to do this, and stalls.\n\nIn Illumina sequencing, the terminator nucleotide incorporated by the polymerase is fluorescently labeled, and is imaged following incorporation. The fluorophore is then cleaved off along with the terminator group, and the cycle repeats. As a result, the process of daughter strand elongation and base calling are closely bound together.\nIn Element sequencing, the goal is to separate the processes of daughter strand elongation (above) and base calling, so that the two can be optimized separately. To achieve this, the aim is to call the next unincorporated position in the template sequence, rather than (as in Illumina sequencing) the most recently incorporated position.\nOne theoretical way to do this would be to use an engineered polymerase that is able to recruit complementary nucleotides but not incorporate them. One could supply this polymerase with fluorescent nucleotides, and it would recruit the one complementary to the next position on the template strand. This would occur simultaneously at many different locations on each polony, corresponding to the different copies of the library sequence produced by RNA. One could then image the flow cell to identify the nucleotide type recruited at each polony.\nThe problem with the above approach is low signal persistence. Without incorporation, recruitment of nucleotides by the polymerase is weak and transient: the nucleotide binds its complementary base and the polymerase, remains for a short time, then dissociates. The result is that, for any given polony, too few nucleotides are recruited at any one time to give a sufficient signal for imaging.\nIn order for an approach like this to work, then, we need a way to improve signal persistence without relying on covalent incorporation of nucleotides. Enter avidity sequencing." + }, + { + "objectID": "notebooks/2023-10-12_how-does-element-sequencing-work.html#b.-base-calling-by-avidity", + "href": "notebooks/2023-10-12_how-does-element-sequencing-work.html#b.-base-calling-by-avidity", + "title": "How does Element AVITI sequencing work?", + "section": "4b. Base calling by avidity", + "text": "4b. Base calling by avidity\n\nThe avidity of a molecular interaction is the accumulated strength of that interaction across multiple separate noncovalent bonds. Even if any single one of these bonds is weak and transient, the overall interaction can be strong and stable if the two molecules interact at many different points.\nIn Element avidity sequencing, the avidite is a large molecular construct, comprising a fluorescently labeled protein core connected to some number of (identical) nucleotides via flexible linker regions. Each of these nucleotide groups can be independently recruited by a polymerase bound to a polony, and positioned based on base-pairing interactions. While each of these nucleotide:template:polymerase interactions is too weak and transient to sustain a strong signal, the avidite as a whole is bound to the polony via multiple such interactions, producing a strong and stable interaction overall.\n\n\nExample avidite structure from the avidity sequencing paper. The core of the molecule consists of fluorescently labeled streptavidin, bound to linker regions via streptavidin:biotin interactions. Three of the four linkers shown here end in nucleotides (specifically, adenosine); the fourth mediates core:core interactions to produce an even larger avidite complex.\n\nExample avidite arm structure, with biotin at one end (top-left) and adenosine at the other (bottom-right).\n\nThe base-calling phase of the avidity sequencing cycle thus proceeds as follows:\n\nPrior to the base-calling phase, the polymerase and nucleotides involved in the elongation phase are detached and washed away.\nThe flow cell is then washed with a mixture containing an engineered polymerase as well as four fluorescently-labeled avidites (one each for A, C, G and T). The engineered polymerase (henceforth the avidite-binding polymerase, or ABP) is distinct from that used for elongation, and is capable of binding a template strand and recruiting a complementary nucleotide, but not capable of incorporation.\nThe ABPs bind to the double-stranded regions of each polony and position themselves at the boundary with the single-stranded template region. They then attempt to recruit nucleotides complementary to the next position on the template strand. The only nucleotides available are those attached to the avidites, which are thus recruited. \nSince each copy of the template sequence in each polony is synchronized, each polymerase bound to each polony attempts to recruit the same nucleotide type, and thus interacts with the same type of avidite. Each avidite molecule is thus recruited to multiple points on the polony, producing a stable overall interaction.\n\n\n\nMultiple copies of the same avidite molecule are thus recruited to each polony, producing a strong and uniform fluorescent signal.\n\n\n\nThe flow cell is then imaged to identify the avidite bound to each polony, and thus the next nucleotide in each read. After this, the ABPs and avidites are detached and washed away, and the cycle proceeds to the next elongation phase (see above)." } ] \ No newline at end of file diff --git a/img/2023-10-11_element-avidite-arm.png b/img/2023-10-11_element-avidite-arm.png new file mode 100644 index 0000000..8b7314a Binary files /dev/null and b/img/2023-10-11_element-avidite-arm.png differ diff --git a/img/2023-10-11_element-avidite-binding.png b/img/2023-10-11_element-avidite-binding.png new file mode 100644 index 0000000..6c6e3d9 Binary files /dev/null and b/img/2023-10-11_element-avidite-binding.png differ diff --git a/img/2023-10-11_element-avidite.png b/img/2023-10-11_element-avidite.png new file mode 100644 index 0000000..5c656f4 Binary files /dev/null and b/img/2023-10-11_element-avidite.png differ diff --git a/img/2023-10-11_element-base-calling.png b/img/2023-10-11_element-base-calling.png new file mode 100644 index 0000000..684bc02 Binary files /dev/null and b/img/2023-10-11_element-base-calling.png differ diff --git a/img/2023-10-11_element-elongation.png b/img/2023-10-11_element-elongation.png new file mode 100644 index 0000000..5f1de84 Binary files /dev/null and b/img/2023-10-11_element-elongation.png differ diff --git a/img/2023-10-11_element-imaging.png b/img/2023-10-11_element-imaging.png new file mode 100644 index 0000000..80c6947 Binary files /dev/null and b/img/2023-10-11_element-imaging.png differ diff --git a/img/2023-10-11_rolling-circle-amplification.png b/img/2023-10-11_rolling-circle-amplification.png new file mode 100644 index 0000000..e2e8d7a Binary files /dev/null and b/img/2023-10-11_rolling-circle-amplification.png differ diff --git a/notebooks/2023-10-12_how-does-element-sequencing-work.qmd b/notebooks/2023-10-12_how-does-element-sequencing-work.qmd new file mode 100644 index 0000000..f3e032b --- /dev/null +++ b/notebooks/2023-10-12_how-does-element-sequencing-work.qmd @@ -0,0 +1,116 @@ +--- +title: "How does Element AVITI sequencing work?" +subtitle: "Findings of a shallow investigation" +author: "Will Bradshaw" +date: 2023-10-11 +format: + html: + code-fold: true + code-tools: true + code-link: true + df-print: paged +editor: visual +title-block-banner: black +--- + +In September 2023, the NAO team sent several samples to the MIT BioMicro Center, for library preparation and sequencing using their new Element AVITI sequencer. This machine works on quite different principles from Illumina sequencing, but also produces high-volume, paired-end, high-accuracy short reads. Since it looks like we might be using this machine quite a lot in the future, it pays to understand what it\'s doing. However, I found most quick explanations of Element sequencing much harder to follow than equivalent explanations of Illumina\'s sequencing technology (e.g. [here](https://www.nature.com/articles/nrg.2016.49)). + +To try and understand this better, I dug deeper, using a combination of talks by Element staff on YouTube, their [core methods paper](https://doi.org/10.1038/s41587-023-01750-7), and aggressive interrogation of Claude 2. Given my difficulty understanding this, I figured others on the team might also benefit from a quick-ish write-up of my current best understanding, presented here. Note that this does not go into the performance of Element sequencing, only the underlying mechanisms. Note also that, given the lack of very detailed documentation about many aspects of the process, my understanding here is inevitably more high-level than it would be for e.g. Illumina sequencing. + +# 1. Library prep + +- The fundamental stages required in library prep for Element sequencing are mostly very similar to Illumina sequencing: fragmentation, addition of terminal adapter sequences, optional amplification, size selection, and cleanup. The main additional step required is circularization: to be compatible with Element\'s cluster generation method ([see below](https://docs.google.com/document/d/1BMPg3I3crFHNx7ZUsRrMQCiKdPZBZ9EcLjwJNsR9KeI/edit#heading=h.ekluzd6qkntg)) mature Element library molecules must be circular, with the 5\' and 3\' adapters joined end-to-end. + +- Given the similarity with Illumina library prep procedures, Element have sensibly designed their processes to be compatible with many standard Illumina library prep kits. There are two main ways to adapt Illumina library prep kits for Element sequencing: + + - In the first ([Elevate](https://www.elementbiosciences.com/products/elevate)) workflow, the standard kit protocol is followed, but with Element adapter oligos (including sample indices) replacing Illumina adaptors. The library molecules are then circularized by ligating the ends of the adapters together, cleaned to remove linear molecules, and are ready to be introduced to the flow cell. + + - In the second ([Adept](https://www.elementbiosciences.com/products/adept)) workflow, the standard kit protocol is followed completely (including use of Illumina adapters) followed by additional steps to convert the resulting library into an Element library: addition of terminal Element adapter oligos, circularization, and cleanup. + +# 2. Cluster generation + +- Following library prep, the libraries are denatured to produce single-stranded circular DNA molecules, then washed across a flat flow-cell studded with attached oligos complementary to the adapter sequences.  + +- Library molecules bind to these oligos, with unbound library molecules washed away.  + +- Polymerases and nucleotides are added, and elongate from each attached oligo via [rolling circle amplification](https://en.wikipedia.org/wiki/Rolling_circle_replication). + + - Briefly, the polymerase starts at the hybridized adapter/oligo double-stranded sequence and moves around the circular library molecule. When it reaches the end of the circle, it continues on to another revolution, displacing its own previously-synthesized daughter strand as it goes. + + - This continues over repeated passes, producing a long single-stranded molecule containing many concatenated copies of the (complement of) the original library molecule sequence: + + - ![](/img/2023-10-11_rolling-circle-amplification.png) + + - Imagine this picture, but with the blue primer attached to a flow-cell at one end. + + - The resulting long, attached molecule is referred to as a concatemer, or polony. A prepared AVITI high-output flow-cell contains roughly 1 billion polonies, each of which corresponds to one read pair. (An AVITI run comprises two flow cells run in parallel, for roughly [2 billion read pairs per run](https://www.elementbiosciences.com/products/aviti/specs).) + +# 3. Daughter strand elongation + +- Although Element sequencing is not sequencing-by-synthesis, it is, as it were, sequencing-with-synthesis. Like Illumina, the core of each sequencing cycle is the stepwise elongation of daughter strands complementary to the library molecule sequence in each cluster, followed by imaging to determine the next nucleotide in the sequence. The mechanism of base calling is completely different (and will be described in the [next section](https://docs.google.com/document/d/1BMPg3I3crFHNx7ZUsRrMQCiKdPZBZ9EcLjwJNsR9KeI/edit#heading=h.fb02g84ycd9e)) but the stepwise elongation process is closely related. + +- In the case of Element sequencing, elongation begins with annealing of a sequencing primer complementary to one of the Element adapter sequences. These primers will bind many times to a given polony molecule, at the beginning of each copy of the RCA-duplicated library sequence. + +- A mixture of DNA polymerase and reversible chain terminator nucleotides is then washed across the flow cell. The polymerases bind the double-stranded primer sequences and incorporate a complementary terminator nucleotide, extending the double-stranded sequence by one base pair (after which further elongation is blocked). + +- The polymerases and free nucleotides are displaced and washed away, after which the blocking group on the incorporated terminator nucleotides is removed (enabling further elongation). Base-calling occurs ([see below](https://docs.google.com/document/d/1BMPg3I3crFHNx7ZUsRrMQCiKdPZBZ9EcLjwJNsR9KeI/edit#heading=h.fb02g84ycd9e)), after which the cycle repeats with the addition of polymerase and terminator nucleotides. + +- ![](/img/2023-10-11_element-elongation.png) + +- To generate a reverse read, the same process takes place, but using primers complementary to the other adapter sequence. + +# 4. Labeling, imaging, and base calling + +- The optical generation of base calls is the most complex and distinctive aspect of Element sequencing, and the one that I had the hardest time understanding. What follows is my best attempt at an explanation, but I\'m not fully confident I haven\'t misunderstood something fundamental. + +## **4a. Background and justification** + +- When a polymerase binds a DNA strand, it first **positions** itself over the boundary between the double-stranded primer region and the single-stranded template region. It then **recruits** and positions a nucleotide complementary to the first base of the template region, using a combination of base pairing and direct interactions between the nucleotide and the polymerase enzyme itself. Finally, it **incorporates** the new nucleotide into the elongating daughter strand by connecting it to the end of that strand via a new phosphodiester bond. + + - Usually, the polymerase then repeats the cycle by recruiting and incorporating a nucleotide complementary to the next base of the template strand; however, if the incorporated nucleotide is a chain terminator, it is unable to do this, and stalls. + +- In Illumina sequencing, the terminator nucleotide incorporated by the polymerase is fluorescently labeled, and is imaged following incorporation. The fluorophore is then cleaved off along with the terminator group, and the cycle repeats. As a result, the process of daughter strand elongation and base calling are closely bound together. + +- In Element sequencing, the goal is to separate the processes of daughter strand elongation ([above](https://docs.google.com/document/d/1BMPg3I3crFHNx7ZUsRrMQCiKdPZBZ9EcLjwJNsR9KeI/edit#heading=h.ktchagtw60wa)) and base calling, so that the two can be optimized separately. To achieve this, the aim is to call the next unincorporated position in the template sequence, rather than (as in Illumina sequencing) the most recently incorporated position. + +- One theoretical way to do this would be to use an engineered polymerase that is able to recruit complementary nucleotides but not incorporate them. One could supply this polymerase with fluorescent nucleotides, and it would recruit the one complementary to the next position on the template strand. This would occur simultaneously at many different locations on each polony, corresponding to the different copies of the library sequence produced by RNA. One could then image the flow cell to identify the nucleotide type recruited at each polony. + +- The problem with the above approach is low signal persistence. Without incorporation, recruitment of nucleotides by the polymerase is weak and transient: the nucleotide binds its complementary base and the polymerase, remains for a short time, then dissociates. The result is that, for any given polony, too few nucleotides are recruited at any one time to give a sufficient signal for imaging. + +- In order for an approach like this to work, then, we need a way to improve signal persistence without relying on covalent incorporation of nucleotides. Enter avidity sequencing. + +## **4b. Base calling by avidity** + +- The avidity of a molecular interaction is the accumulated strength of that interaction across multiple separate noncovalent bonds. Even if any single one of these bonds is weak and transient, the overall interaction can be strong and stable if the two molecules interact at many different points. + +- In Element avidity sequencing, the avidite is a large molecular construct, comprising a fluorescently labeled protein core connected to some number of (identical) nucleotides via flexible linker regions. Each of these nucleotide groups can be independently recruited by a polymerase bound to a polony, and positioned based on base-pairing interactions. While each of these nucleotide:template:polymerase interactions is too weak and transient to sustain a strong signal, the avidite as a whole is bound to the polony via multiple such interactions, producing a strong and stable interaction overall. + + - ![](/img/2023-10-11_element-avidite.png) + + - Example avidite structure from the [avidity sequencing paper](https://doi.org/10.1038/s41587-023-01750-7). The core of the molecule consists of fluorescently labeled [streptavidin](https://en.wikipedia.org/wiki/Streptavidin), bound to linker regions via streptavidin:[biotin](https://en.wikipedia.org/wiki/Biotin) interactions. Three of the four linkers shown here end in nucleotides (specifically, adenosine); the fourth mediates core:core interactions to produce an even larger avidite complex. + + - ![](/img/2023-10-11_element-avidite-arm.png) + + - Example avidite arm structure, with biotin at one end (top-left) and adenosine at the other (bottom-right). + +- The base-calling phase of the avidity sequencing cycle thus proceeds as follows: + + - Prior to the base-calling phase, the polymerase and nucleotides involved in the elongation phase are detached and washed away. + + - The flow cell is then washed with a mixture containing an engineered polymerase as well as four fluorescently-labeled avidites (one each for A, C, G and T). The engineered polymerase (henceforth the avidite-binding polymerase, or ABP) is distinct from that used for elongation, and is capable of binding a template strand and recruiting a complementary nucleotide, but not capable of incorporation. + + - The ABPs bind to the double-stranded regions of each polony and position themselves at the boundary with the single-stranded template region. They then attempt to recruit nucleotides complementary to the next position on the template strand. The only nucleotides available are those attached to the avidites, which are thus recruited.  + + - Since each copy of the template sequence in each polony is synchronized, each polymerase bound to each polony attempts to recruit the same nucleotide type, and thus interacts with the same type of avidite. Each avidite molecule is thus recruited to multiple points on the polony, producing a stable overall interaction. + + - ![](/img/2023-10-11_element-avidite-binding.png) + + - Multiple copies of the same avidite molecule are thus recruited to each polony, producing a strong and uniform fluorescent signal. + + - ![](/img/2023-10-11_element-base-calling.png) + + - The flow cell is then imaged to identify the avidite bound to each polony, and thus the next nucleotide in each read. After this, the ABPs and avidites are detached and washed away, and the cycle proceeds to the next elongation phase ([see above](https://docs.google.com/document/d/1BMPg3I3crFHNx7ZUsRrMQCiKdPZBZ9EcLjwJNsR9KeI/edit#heading=h.ktchagtw60wa)). + +- ![](/img/2023-10-11_element-imaging.png) + +\