SEP | 008 |
---|---|
Title | Specify DNA / RNA topology |
Authors | Raik Gruenberg (raik.gruenberg at gmail com) |
Editor | Raik Gruenberg |
Type | Data Model |
SBOL Version | 2.2 |
Status | Deferred |
Created | 03-Sep-2016 |
Last modified | 22-Oct-2016 |
Issue | #23 |
SBOL currently is missing any means to distinguish between circular (e.g. plasmid) and linear DNA constructs. Also double- and single-stranded DNA or RNA cannot be distinguished.
Note from the editors: This proposal was deferred in favor of a short-term solution described in SEP #11.
Engineered DNA constructs are currently mostly deposited, used and exchanged as circular, double-stranded, plasmid DNA. However, there are also many applications for linear constructs which are easily and cheaply accessible through PCR or gene synthesis. Moreover, even circular plasmid DNA may be cut and stored or exchanged as linear DNA to aid the construction of new vectors. Likewise, natural or engineered genomes may be circular (bacterial chromosomes) or linear (eukaryotic chromosomes). Both linear and circular DNA or RNA molecules can, moreover, be either double-stranded (ds) or single stranded. RNA, typically, is used as single-stranded molecule whereas DNA is, mostly, utilized double-stranded (dsDNA). However, oligonucleotides (primers) are an example of single-stranded DNA (ssDNA) which, however, can also be complemented into double-stranded fragments.
Knowing whether a given construct is supposed to have circular or linear topology is of crucial relevance both for experimental design and for basic sequence analysis. Nevertheless, unlike the much older genbank format, SBOL so far lacks the possibility to distinguish between linear and circular DNA. Likewise, the distinction between single-stranded and double-stranded has important consequences but is not currently conveyed by SBOL.
(2.1) ComponentDefinition should be amended by a new field topology
pointing to either of two possible Sequence Ontology terms:
- SO:0000987 -- a.k.a. 'linear' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000987]
- SO:0000988 -- a.k.a. 'circular' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000988]
(2.2) A new validation rule needs to be implemented that makes this an required field for ComponentDefinitions of type 'DNA' or 'RNA'. By contrast, the field is not allowed for descriptions of proteins, small molecules or other types of entities.
(2.3) ComponentDefinition should be amended by a new field strand
pointing to either of two possible Sequence Ontology terms:
- SO:0000984 -- a.k.a. 'single' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000984]
- SO:0000985 -- a.k.a. 'double' [http://www.sequenceontology.org/browser/current_svn/term/SO:0000985]
(2.4) A new validation rule needs to be implemented that makes this an required field for ComponentDefinitions of type 'DNA' or 'RNA'. By contrast, the field is not allowed for descriptions of proteins, small molecules or other types of entities.
(2.5) Consequences for software tools
topology
= circular
(SO:0000988) instructs software to interpret the beginning
/ end position of a given sequence (be it DNA or RNA) as arbitrary so that sequence
features may be mapped or identified across this junction.
strand
= double
(SO:0000985) instructs software to apply sequence searches to both strands (i.e. sequence and reverse complement of sequence).
(2.6) Missing fields
There are no default values for topology
or strand
. Absence of either of these fields should be treated as genuine lack of information.
This is such a basic feature of molecular biology that it is difficult to find an example where it is, in fact, not needed. By way of example, consider popular sequence design software such as Benchling, CLC, VectorNTI or others. In these cases, declaring a sequence as circular changes the view and the searching behaviour of the sequence editor.
Example for double-stranded / single-stranded:
While the pattern 'ATG' is not found within the sequence TTCCTAC
, the reverse complement of this sequence does contain ATG
. Whether or not a given sequence is considered double-stranded therefore decides whether or not it matches a given pattern.
Example for circular / linear:
While no 'ATG' start codon is found within the linear sequence GAATCATCATAT
, this sequence does contain a start codon if considered circular.
No backwards compatibility issues are foreseen. However, the fields will be missing from older records. Software then should fall back to making an educated guess or assuming linear topology.
(This is the solution ultimately adopted for SBOL 2.1)
Several devs suggest to not introduce any new fields but, instead, to attach topology and strand features using the existing DnaComponent.type field. type
is already supposed to point to SO terms and more than one type field can be added per ComponentDefinition.
Advantage of re-using type:
- no change of data model, only validation rules need update
- may make life slightly easier for reasoning software (less fields to consider)
- fast implementation in terms of spec and no need for a vote
Disadvantage:
- higher risk that topology and strand information is not actually given
- complex validation (if one type field points to DNA, ensure there are two more type fields, one of which ...)
- software would have to interpret a collection of type fields even if it is only interested in, for example, telling apart DNA and protein parts
- undefined result if type fields point to neither of the four given SO terms; non-reasoning software could much easier recover if it, at least, is still told whether the unknown term refers to
topology
,strand
ortype
Conversely, issues with the proposal of new fields topology, strand (e.g. pointed out by Chris):
Having two fields in CD that only make sense for some types complicates the data model unnecessarily and actually complicates software development as well, since this means not just the library needs new validation rules but the software also using the library needs to be aware of new fields. This is particularly problematic for software that is not using SBOL internally as its data model, and it will increase the chance of data loss. [Chris]
Making these fields required renders files of previous versions invalid. This would therefore become an SBOL 3.0 change, not a SBOL 2.1 change.
I originally proposed to make fields required at the data model level. However, since topology and strand information makes little or even no sense for ComponentDefinitions describing proteins or small molecules, this was dropped (following debate on sbol-dev). Two new validation rules are proposed instead.
The original SEP proposed boolean values (circular=True/False). This was critized on the grounds that we, so far, are not using boolean values anywhere in SBOL and that we should tie things in with SO instead.
SO offers terms for all 8 combinations of circular/linear and double/single stranded and DNA/RNA. An example is
the term circular_double_stranded_DNA_chromosome
. Using these terms was considered less elegant though. (1) software likely would have to decompose these terms again, (2) the 'chromosome' semantics does not fit many SBOL applications.
It would be possible to formulate topology through sequence constraints. However, sequence constraints have a different use case -- they describe incomplete designs that may, for example, be communicated to services for automated design or library construction. Standard sequence editors will likely not support any sequence constraints framework in the near future. By contrast, the current standard use case for SBOL is (or should be) the communication of a complete sequence, e.g. for a plasmid deposited at AddGene or a fragment described on parts.igem.org.
Also protein or even small / medium-sized molecules can feature circular topologies and topology
was originally envisioned to apply to all types of ComponentDefinition. However, in case of proteins, circularity is exceedingly rare and 'double-stranded' does not apply at all. Furthermore, the SO terms seem to be more or less explicitely defined for nucleic acid polymers. Explicitely excluding these fields from all non-RNA or DNA ComponenDefinitions therefore seems reasonable.
This SEP was formerly replaced by SEP 011 (SEP #11).
To the extent possible under law,
SBOL developers
have waived all copyright and related or neighboring rights to
SEP 008.
This work is published from:
United States.