SEP | |
---|---|
Title | Sequence insertion and replacement |
Authors | Jacob Beal ([email protected]), Raik Gruenberg |
Editor | |
Type | Data Model |
SBOL Version | 2.3 |
Status | Complete |
Created | 17-Jan-2017 |
Last modified | 27-Jan-2017 |
Issue | #28 |
A new, optional, Component.sourceLocation
field is
proposed. Component.sourceLocation
allows only a portion, rather than the
entirety, of a ComponentDefinition's sequence to be included in the Component's
parent. This serves two related purposes: (1) To support the insertion of parts
within a larger plasmid or genome "backbone". Typically, this will replace a
small segment of the backbone which, currently, cannot be expressed in SBOL. (2)
To support the "trimming" of part boundaries as it may happen during DNA
assembly.
- 1. Rationale
- 2. Specification
- 3. Example or Use Case
- 4. Backwards Compatibility
- 5. Discussion
- 6. Competing SEPs
- References
- Copyright
The SBOL sequence data model so far assumes that parts are joined together such that A + B + C gives exactly ABC. However, there are several very common scenarios where the final sequence of a genetic design is not exactly the sum of its parts:
(1) A designed sequence (e.g. synthetic gene or operon with many genes) is inserted into a common plasmid backbone or into a genome. Typically, a small but variable segment of the target vector is replaced in the process. This insert + backbone scheme probably accounts for 90% of all genetic designs published until today. Depending on assembly method, the insertion site and the extent of replacement often is variable and cannot be predicted by the designer or publisher of the backbone part.
(2) A larger sequence is assembled from re-usable fragments (e.g. promoter, RBS, CDS, terminator) but the original fragments are "trimmed" in the process. This may be a design decision (e.g. "Let's shorten this linker sequence") or a mere technical consequence of DNA assembly. For example, neighboring parts may have been registered with a complementary homology region, one copy of which disappears during assembly.
SBOL can already cleanly represent complete insertions (without replacement),
such as ABC + X --> ABXC, by using the ability to put multiple Locations
on a
SequenceAnnotation. Multiple Location
records are interpreted as the part
being split up in the parent (product) sequence. This is equivalent to the
"JOIN" statement in genbank.
At least in theory, complete insertions of a part can be handled with a mapsTo
statement such that ABC -> AXC. However, this requires that the exact
replacement site is aready annotated as a part when the backbone part is first
published. This is, in practice, rarely the case, with the notable exception of
certain highly structured Golden Gate cloning approaches such as MoClo.
Deletions such as ABC -> AC are not handled by current SBOL. Likewise, part boundaries cannot be changed.
The basic idea of this proposal is that a Component should be able to include only a portion of its source sequence rather than necessarily the entirety of it. To do this, we add a new optional field, Component.sourceLocations which can be used to specify which portions of the source's sequence are being used.
On the vector or "backbone" side, this allows us to "lose" some basepairs around the insertion site. On the "insert" side, this allows us to trim the source part before inserting or adding it to the final product.
On Component, add an optional field "sourceLocations" with cardinality 0..*, which contains a set of Locations.
If Component.sourceLocations is not set, then the whole sequence is assumed to be included.
If Component.sourceLocations is set, then the relationship between the original sequence and the sequence included is defined precisely complementary to SequenceAnnotation.locations.
[SBOL 3.x]: If a Component
has one or more location
s, the sequence range or
sequence ranges defined by Component.sourceLocation MUST add up to exactly the
same length as the range(s) covered by the Component.location record(s).
[SBOL 2.x]: For each Component.sourceLocation, if there are one or more locations associated in the containing ComponentDefinition (via SequenceAnnotation.location) then those locations must define a sequence range of exactly the same length.
Rule sbol-10520 will need to be relaxed to account for partial sequence inclusion. Additional new rules will need to be added, symmetric with those for SequenceAnnotation.locations
Let's assume we want to describe a plasmid "myPlasmid" which has been created by inserting a single part "cargo" into a common plasmid backbone called "backbone". The insert is 50 bp long and will replace 10 base pairs of the vector backbone.
CD backbone
sequence = Sequence of length 200
CD cargo
sequence = Sequence of length 50
CD myPlasmid
sequence = final Sequence of length 240
Component insert
definition -> cargo
location = Location 101..150
Component backbone
definition -> backbone
location = Location 1..100, 151..240
sourceLocation = Location 1..100, 111..200
Note that Component "backbone" can be immediately validated by checking that each of the sourceLocations has a (target) location of exactly matching length. In this particular example, location records cover the full sequence without gaps. The final (product) sequence of "myPlasmid" could therefore be constructed from the two components and there is, stricly speaking, no need to list it again.
Below is a UML object diagram that is another illustration the above example. Note
The BioBrick cloning standard defines a DNA prefix and suffix sequence to allow iterative cloning:
PREFIX[part A]SUFFIX + PREFIX[part B]SUFFIX --> PREFIX[part A]scar[part B]SUFFIX
However, the original standard (RFC 10) did not support protein fusions -- the scar sequence was 4 bp long and contained a STOP codon. An alternative Biobrick standard (RFC 25) therefore extended the original prefix and suffix with an additional set of "inner" restriction sites that could be used for protein fusions. RFC 25 is therefore compatible with both RFC 10 assembly and with protein fusions. A schematic overview:
PREFIXo0o~[part A]~xxxSUFFIX + ... --> PREFIXo0o~[part A]~~[part B]~xxxSUFFIX
The iGEM parts registry, for a long time, only supported RFC 10 parts. RFC 25 parts are therefore often registered like this:
PREFIX[o0o~part A~xxx]SUFFIX
That means, the inner part of the RFC 25 prefix (TGGCCGGC
) and suffix
(ACCGGTTAAT
) are included in the part definition. Using sourceLocation
, the
product of assembling part A and B can nevertheless be described in SBOL:
CD partA
sequence = TGGCCGGC part A ACCGGTTAAT
CD partB
sequence = TGGCCGGC part B ACCGGTTAAT
CD part AB
sequence = TGGCCGGC part A ACCGGC part B ACCGGTTAAT
Component A
sourceLocation 9..110
location 9..110
Component B
sourceLocation 9..110
location 117..128
CD ... ComponentDefinition
Note: The example assumes part A and B are both exactly 100 bp long. The
assembly scar ACCGGC
is not assigned to any sub-part definition as it does
not carry any function. The description of part AB therefore MUST include a
complete sequence.
Similar issues with the definition of part boundaries may also arise in modern
Golden Gate based assembly standards such as MoClo and GoldenBraid. MoClo and
GoldenBraid define similar "prefix" and "suffix" sequences with IIS restriction
sites, custom 4 bp overhangs as well as "filling" nucleotides. Depending on
software and user preference, prefixes, suffixes and assembly overhangs may
either be considered to belong to the part or not. Consequently,
sourceLocation
gives the flexibility to cleanly describe DNA assembly products
from such parts in either scenario.
This SEP is backward compatible, in the sense that all prior SBOL 2.x files remain valid and retain their exact same meaning.
An old tool reading a file that uses Component.sourceLocation, however, will not know how to correctly interpret the relationship between the two sequences affected by Component.sourceLocation, however, and should report violation of rule sbol-10520.
This seems unlikely to be a major issue for two reasons:
- sbol-10520 is a "best practices" rule, not a requirement
- Any well-designed tool should react by refusing to make inferences about a sequence relationship that it does not understand, thus at least not damaging files.
There are at least two scenarios in which an SBOL 2.1 compatible tool may create a wrong sequence when presented with a Component.sourceLocation field:
- The tool interprets the unknown
sourceLocation
field as a custom annotation (ignored but passed on) - There is no
location
record on the Component's SequenceAnnotation, e.g. because the sequence is defined using relative Sequence constraints.
Another failure scenario is that there is a location
record but the tool
choses to ignore the fact that the length defined by location
and the length
of the sub-part are not matching.
Although we need sourceLocation to be available at Component, it is worth considering whether it should be placed on its parent, ComponentInstance, in order to make sourceLocation available to FunctionalComponent as well.
A ModuleDefinition, however, has no Sequence for a "partial import" to be targeting. Since it is a different type of representation, every ComponentDefinition that is imported must be imported entirely. The "partial import" analog would not be on FunctionalComponent but some form of "partial import" field on Module. We are not currently aware of any compelling case for such, however (ModuleDefinitions do not have the equivalent of the massive sequence information associated ComponentDefinitions), we do not propose any such modification at this time, but note that it would be reasonable to consider in the future if sufficient motivation appears.
A competing proposal has been put forth (though not yet formalized into an SEP), for representing sequence modifications through MapsTo. The basic idea of this proposal is to make use of the "useLocal" option on MapsTo to specify a modification of the sequence. A basic example would look like this:
CD InsertionSite1
...
CD InsertionSite2
...
CD InsertionSite3
...
CD backbone
Component InsertionSite1
Location 11..20
Component InsertionSite2
Location 12..19
Component InsertionSite3
Location 10..21
Sequence of length 200 bases
CD cargo
Sequence of length 20 bases
CD MyInsertion
Component MyCargo
Location 11..30
Definition cargo
Component MyBackBone
Location 1..9, 31..210
Definition backbone
MapsTo
Local MyCargo
Remote InsertionSite1
For deletion, we can either use a sequence of length 0 or we can allow a MapsTo to have no local, implying deletion (change to the current data model). This approach is advantageous in that it can be technically implemented without any modification of the data model (using empty sequences). Conceptually, however, the MapsTo approach represents a major kludge, in that it requires the introduction of a potentially large number of ComponentDefinitions without any particular engineering meaning, that only exist to represent statements about potential insertion sites.
Our community has already strongly rejected this idea in SEP004, which allows us to represent small statements about sequence features, like targets and cut sites, without having to separate those fragments (which do not have an independent existence) as ComponentDefinition. We thus have a clear precedent for "ComponentDefinition" representing a unit of engineering modularity, rather than a unit of sequence information. Moreover, under the MapsTo approach, describing a new ComponentDefinition forces one to modify an existing ComponentDefinition, which seems to be a clear violation of modularity. There are also technical issues, such as interaction with sbol-10520 and sbol-10526, but the major challenge is the conceptual change implied by the use of MapsTo in this fashion.
It may or may not be possible to express the extraction of a part using Prov-O provenance terms. Example 3.1 could potentially be re-formulated like this:
CD backbone ## original backbone record; defined by somebody else
sequence = Sequence of length 200
CD myBackbone ## re-definition to remove insertion site
sequence = Sequence of length 190
prov:wasDerivedFrom -> backbone
prov:wasGeneratedBy -> ??SequenceExtract??
??range?? 1..100
??range?? 111..200
CD cargo
sequence = Sequence of length 50
CD myPlasmid
sequence = final Sequence of length 240
Component insert
definition -> cargo
location = Location 101..150
Component backbone
definition -> myBackbone
location = Location 1..100, 151..240
Equivalent "activities" for "??SequenceExtract?? and ??range?? may or may not
exist in the prov-O ontology. SBOL could define a custom activity
"SequenceExtract" and use Location
records within this activity. This would be
equivalent to the sourceLocation
proposal but would be much more verbose and
indirect. An additional disadvantage may be that the truncated part may, in
fact, be defined in another document which makes it more complicated to validate
that extracted sequence and the target sequence defined by location
have exactly
the same length.
In addition, it may be possible to encode Example 3.1 using both the sourceLocation property and Prov-O in a complementary fashion. In the UML object diagram below, the insertion of a cargo ComponentDefinition (pink) into a backbone ComponentDefinition (also pink) is specified by making them sub-Components (orange) of a plasmid ComponentDefinition (red) with associated Component sourceLocations (orange) and Sequence Annotation locations (yellow) as described earlier. The insertion is also specified in this diagram by an in silico assembly Activity with Usages (orange) of the cargo and backbone, plus a Plan (also orange) for their assembly. In this way, a tool that does not fully support DNA assembly could still interpret the composition of a new design in terms of non-modular pieces of previous designs by referring to the location and sourceLocation properties, while a tool that does support DNA assembly could expose a richer view of the biochemical mechanisms by which this composition could occur by referring to the Usages and assembly Plan.
None at present.
None.
To the extent possible under law,
SBOL developers
has waived all copyright and related or neighboring rights to
SEP 013.
This work is published from:
United States.