SEP | |
---|---|
Title | SEP 020 -- Linking experimental results with Biological clones and replicates |
Authors | Raik Gruenberg ([email protected]) |
Editor | Nicholas Roehner ([email protected]) |
Type | Data Model |
SBOL Version | 2.3 |
Replaces | SEP 014, SEP 016, SEP 017, SEP 019 |
Status | Draft |
Created | 09-Dec-2017 |
Last modified | 02-Nov-2017 |
Issue | #47 |
Experimental information recorded for one engineered clone of cells or a batch of biomaterial is often not representative for the behavior of another clone or batch. The reason are additional variables (known or unknown) such as mutations or stochastic biochemical events that lie outside the scope of the actual design. Bioengineers are therefore used to tracking the identity of clones, batches or replicates of biological materials whenever experimental results are collected. The Implementation class (previously defined in SEP 19) is here modified in order to allow a reliable and straightforward identification of cell lines, bacterial clones and other replicates of biomaterial. In contrast to SEP 19, the present proposal introduces a mandatory and direct link ('design') from Implementation
to its intended design. SEP 20 therefore guarantees that we always know What a clone or batch is. Information about How something was constructed may be added using prov-O (provenance ontology) records just as described in SEP 19.
- 1. Rationale
- 2. Specification
- 3. Examples
- 4. Backwards Compatibility
- 5. Discussion
- 6. Relation to Other SEPs
- References
- Copyright
If I am given a sample of something, the most important question is: "What is it?". SEP 19 gives no less than three different ways of answering this question: (1) prov:wasDerivedFrom
, (2) built
, (3) prov:generatedBy
+ Activity
. This would be confusing enough but, in fact, none of these redundant links is mandatory and it is not clear which one is preferred. Therefore, while SEP 19 makes a very useful first attempt to record workflow information ("How was this made?" or "How was this measured?") the, arguably, more basic information about "What is this?" may be ambiguous, in conflict with other statements or completely missing. This is the main issue that is being addressed by the present proposal.
The main focus of SEP 20 lies therefore on the Implementation
class. However, the SBOL editors have demanded a clear choice between SEP 19 and SEP 20 and I am therefore also including workflow-related information in this proposal. These are taken
directly from SEP 19 and should be interpreted as preliminary guidelines, to be tested and further refined by practical use.
The main goals of this proposal are therefore:
- Identify cell lines (clones, replicates, batches) as needed for the exchange of biomaterials (e.g. from AddGene, ACTC, partsregistry) and as also currently required by Journals for experimental papers
- Allow to attach experimental results to clones/batches
- Guarantee that each record of a clone or batch always points to exactly one authorative design
- Workflow & provenance information make up an additional layer of meta information rather than being required
- Ensure that the link from Experimental data through clone/batch to (intended) design is as direct and unambiguous as possible and does not require iterations over chains of events / activities (with the associated risk of missing links)
An Implementation
represents a real, physical instance of a synthetic biological construct which may be associated with, typically more than one, actual laboratory samples. Note that the representation of individual samples (as in LIMS systems) remains outside the scope of SBOL. Whether or not two samples are considered to belong to the same batch or at which point they are split into separate Implementation
instances is decided by the user and will be a matter of community best practices in a given area of bioengineering.
Implementation
MUST be linked back to its intended design, either a ModuleDefinition
or ComponentDefinition
, using a mandatory field design
(this is the main difference to SEP 19). An Implementation
MAY also link, via an optional field built
, to an additional ModuleDefinition
or ComponentDefinition
which represents a more detailed picture of its actual structure as it was realized in the laboratory.
Figure 1: Diagram of the Implementation
class and its associated properties
Each Implementation
record MUST have exactly one design
property which MUST point to a ComponentDefinition
or a ModuleDefinition
. This Component/ModuleDefinition
is interpreted as the intended design. Sister clones of a bioengineering experiment will therefore always point to the same design
ComponentDefinition
regardless of whether or not they accurately implement this design.
If, after creation of a clone, additional design information becomes available that equally applies to all clones of an experiment, then the design ComponentDefinition
instance may be updated in-place. Alternatively the design
field of all affected Implementations
may be changed to point to a new version of the Component/ModuleDefinition
. In both cases, previous versions and reasons for the change can, optionally, be tracked using established prov-O versioning (e.g. ComponentDefinition.wasDerivedFrom
) as already prescribed by SBOL.
An optional Implementation.built
property references one ComponentDefinition
that describes the, possibly deviating, actual physical composition of a realized construct. built
MUST NOT reference the same ComponentDefinition
that design
points to. Instead, if no extra or deviating structural information exists for a given clone, the built
field is simply not to be used at all. [This is a difference to SEP 19]
PREFERABLY, the built
structure should be specific for this particular Implementation
. For example, it may be based on the sequencing of this particular clone. Implementation.built
SHOULD NOT be used to introduce general corrections to the original design (see 2.1.1 above how this should instead be handled).
Once created, the uri
identifier of an Implementation
instance MUST NOT change. The Implementation
record is mutable and may change over time as new information is added or corrected. However, since Implementation
represents a physical object in the real world, changes to the information about this physical object MUST NOT lead to a change of its unique identifier.
The following Best Practices are currently considered guide lines which may change in future SBOL releases.
Implementation
MAY link to experimental data using the attachment
field approved in SEP 18. It is RECOMMENDED that sbol:Attachment
instances containing experimental data are further annotated by prov-O activity records as shown in the following Figure.
Figure 2: Association of experimental data with Implementation
(Clones/Batches) and intended Design.
Such Attachments
MAY instead also be grouped into Collections
. Details will have to be determined by practical use.
Detailed information about how a given construct or organism was built in the lab can be encoded in a prov-O:Activity
record. [Arguably, this should be described in a separate SEP. The detailed description of experiments and protocols is beyond the scope of this SEP]. All sister clones of a construction experiment SHOULD point to the same Activity
record. Clones that implement the same design but were produced independently (e.g. in a different lab) SHOULD NOT point to the same Activity
record, even if the assembly protocol happened to be identical.
Figure 3: Representation of construct or organism assembly.
The Assembly Activity
MAY also have a Usage:design
record pointing to the design ComponentDefinition
. This link is redundant with the Implementation.design
field and therefore optional [A difference to SEP 19]. If present, it MUST point to the same Component/ModuleDefinition
instance as the Implementation.design
field.
prov:wasDerivedFrom
references are used to express that one batch or clone was directly created from another one. It is RECOMMENDED that this should be accompanied by a prov:Activity
record that outlines how one batch was derived from the other. However, we do not yet have a standard for formulating these Activity records. Examples are: transformation of DNA into bacteria to generate a new cell stock or purification of DNA or proteins from a given cell stock. Also the simple splitting of cell lines or inoculation of a new culture from a given bacterial stock may or may not warrant the definition of a new Implementation
record.
Figure 4: Documentation of simple cell line or clone ancestry.
It is RECOMMENDED that Implementation.wasDerivedFrom
is exclusively used for experimental strain/batch ancestry relations. This is in contrast to SEP 19, where the same term is also used to link Implementation
to its design ComponentDefinition
.
Note: This solution may create conflicts or ambiguities with SBOL versioning. Other alternatives are being discussed.
The use of prov-O vocabulary together with Implementation
is encouraged but optional. Parsing of prov-O information MUST NOT be needed for (1) the simple identification of a clone, (2) neither for determining what structure (a.k.a. sequence) it is supposed to have. Further details on the use of prov-O for workflow documentation are suggested in the Example section below.
In line with SEP 19 (see detailed discussion in 2.3.2), we RECOMMEND as a best practice that objects linked by Activities not be successive versions (of the computational record) of each other, though we leave this at the discretion of users and library developers.
Not much is needed beyond the strict data model definition:
-
Implementation.built
SHOULD NOT point to the sameComponentDefinition
asImplementation.design
(Warning) -
If a "Built"
Activity
referenced byImplementation.wasGeneratedBy
contains aUsage:design
record, then thisUsage:design
MUST point to the sameComponentDefinition
orModuleDefinition
asImplementation.design
. (Error)
SEP 19 prescribes a host of additional validation rules related to prov-O usage. Some of these may apply here as well.
The first example covers the, presumably, most common use case of transmitting information about one particular plasmid which may be shipped between labs or obtained from the iGEM registry or AddGene:
Example 1a: This example covers the, presumably, most common use case of transmitting information about one particular plasmid which may be shipped between labs or obtained from the iGEM registry or AddGene. Two scenarios are shown. Top: the by far, most common use case where the plasmid corresponds exactly to the given design. Bottom: a scenario as it may, for example, happen in intermediate stages of assembly processes, where a plasmid is tracked and found to not exactly correspond to its design.
It is instructive to compare the above example to exactly the same information expressed according to SEP 19. Most UML diagrams in SEP 19 are highly simplified. So here is the same example with everything spelled out according to SEP 19 and the current prov-O documentation:
SEP 19 Example: The exact same data as in Example 1a shown in the notation suggested by SEP 19. Note that the same information expressed by a single design
field in SEP 20, requires, in SEP 19, 6 prov-O fields, 2 prov-O class instances, and one novel sbol:"design" term.
Returning to SEP 20, the above (correct) plasmid may be transformed into an E. coli expression strain for protein production. A cell stock of the single colony picked for the expression culture will be kept for future reference. Let's again assume something went wrong and we later discover that there was a genomic mutation in this particular clone. In other words, the cell stock faithfully implements the core of the design (the expression plasmid) but its genome deviates from the original design (and from its sister clones). It may still be used in subsequent experiments and the deviation is documented using the built
link:
Example 1b: A clone of a bacterial strain harboring the same expression plasmid but with a novel genomic mutation.
Note: SBOL currently has no clear rule for the identification of organisms. Presumably, a plasmid within a cell will be represented by a ComponentDefinition
for the plasmid, contained within a ModuleDefinition
somehow representing the cell. The given scheme is thus merely a suggestion.
Competing SEP 19 emphasizes the documentation of an idealized design - build - test workflow. I am skeptical as to how realistic this is at this point. For example, automated model building or automated sequence design from models is still the exception. In my experience, models are (if at all) built after a design has been made. I therefore find it problematic to connect Experimental Results -> Implementation, Implementation -> Design and Design -> Model with prov:wasDerivedFrom
links as suggested in SEP 19. In 95% of use cases, this will be confusing or outright false. Moreover, SBOL already specifies proper fields for connecting Design and Model as well as Attachment.
The following is therefore, according to this proposal, the minimal representation of the actual result of a design - build - test workflow:
Example 2a: A minimal workflow representation without use of prov-O meta data. This example provides the essential link from experimental result to clone, design and theoretical model.
In-line with SEP 19, rich meta data may be added to annotate the factual workflow result with "activity" records:
Example 2b: A tentative full workflow representation. Prov-O activities provide an orthogonal layer of meta information on top of the native SBOL chain of objects. Note: the prov:usage terms "design", "build", "test", "learn" have been suggested in sboldev discussions and are defined in SEP 19. "biomaterial" would be a new term which, in my view, better represents the relation between experimental protocols and clones/batches. The OBI ontology for biomedical investigations contains many terms that may be of more specific use here.
Note: This example has been adapted from Example 2 in SEP 19. The main difference is that wasDerivedFrom
links are replaced by design
, model
and attachment
. Other changes to the prov-O layout are suggestions only.
Please refer to section 2.2.2.
The following example is taken 1:1 from SEP 19. There is only two modifications: (1) designs are linked via design
rather than built
., (2) there is no "wasDerivedFrom" link between Implementation and design.
Example 4: An initial plasmid kit based on part designs from the iGEM Registry is replicated and split into different samples. Note, contrary to the use (in blue) in this example, Implementation
is not actually supposed to be equaled to "Sample". Implementation
defines clones or batches that are typically distributed over many samples and only occasionally will each sample be unique enough to warrant its own Implementation
. However, as SBOL currently does not have a term for "Sample", the given scheme is a workaround.
The following example is taken 1:1 from SEP 19 (Example 6). As before, the data model is somewhat simplified by the fact that there is only one link between Implementation
and ComponentDefinition
. In other words, design
replaces the redundant use of prov:wasDerivedFrom
and built
.
Example 5: Two sets of overlapping constructs that implement the components of a CRISPR-based circuit are co-transfected to implement two different versions of the circuit.
Note: Arguably, the wasDerivedFrom
links (shown in gray color) from the CRISPR construct "mixture" or circuits on the right to the individual genes on the left can be removed. The same information is already provided by the ModuleDefinition. Alternatively, the sub-implementations should be linked together through a more expressive gene assembly "prov:Activity" activity (as outlined in section 2.2.2).
This proposal does not affect backwards compatibility.
SEP 19 relies heavily on prov-O provenance ontology terms which are freely mixed with SBOL terms. SEP 20 instead tries to establish a clear separation of concerns: SBOL terms are used to describe the core information of What we are looking at. Prov-O terms are used to add an optional layer of information about How something was created. In practice, this means the proper SBOL terms model
, design
, attachment
are used to link Experimental Results, Implementation
, Component/ModuleDefinition
, and Model
(the "What").
SEP 19 gives no less than three different ways of answering the "What are we looking at" question: (1) wasDerivedFrom
, (2) built
, (3) generatedBy
+ Activity
. None of these redundant links is mandatory and it is not clear which one is preferred.
[1] wasDerivedFrom
has the serious issue that is also being used for (i) computational version tracking, and (ii) tracking of clonal relationships between samples and, moreover, (iii) it could to be used for pointing to structural building blocks (Implementations) used during construction.
[2] Built
seems to be the most obvious and straightforward choice but it is NOT fitting either because it is explicitely reserved for "actual" structure or "actual function". Upon creation of a batch or clone in the lab, the actual physical structure is not yet known (functional behavior may never be known, is a can of worms to even specify, and is irrelevant for the question of what something is). As we MUST NOT create an Implementation record that is not pointing to its content -- after all I am holding a physical sample in my hands and have to say what the heck it is -- this only leaves us with "design".
[3] design
according to SEP 19 is formulated using a prov-O record comprising around 8 different classes and fields. The link may furthermore be indirect (going through "parent" Implementations) which means it needs to be inferred by iterating over a potential chain of Implementation instances. This chain may easily become interupted as there is no guarantee or mechanism to ensure that all parent Implementations are in the same SBOL document or even accessible elsewhere.
SEP 20 solves these issues with a direct and mandatory field design
. All use of provenance ontology terms and constructs is made optional.
As a side effect, this leads to a massive reduction of prov-O overhead and reference redundancy. In SEP 19, the expression of the design link via a prov-O Activity
comprises a massive overhead of, in many cases redundant, fields and classes (8 or 9 interlinked entities, likely translating to 20 to 40 lines of XML code) only so that additional meta information could theoretically be added on top of that. SEP 20 replaces this by a single pointer. This will be sufficient for most use cases. If, and only if, additional meta information is available, prov-O Activities
can be layered on top of that.
The high number of redundant references in the SEP 19 prov-O model (see the Figure in Example 1 above) also requires a very high number of new validation rules that verify that two different but equivalent prov-O pointers indeed reference the same Implementation
or ComponentDefinition
object. Realistically, we have to assume SBOL documents will not always meet all validation rules. SEP 19 therefore creates an abundance of possibilities for serious bugs where software tools receive conflicting references.
The use of Implementation.wasDerivedFrom
for strain ancestry (section 2.2.3) creates a potential conflict with versioning: wasDerivedFrom
is currently also used to link different (computational) versions of SBOL records. (This problem is even more acute in SEP 19 where wasDerivedFrom
is heavily used for non-versioning related links.) This issue requires further discussion. One possible alternative to wasDerivedFrom
would be the use of generatedBy
pointing to an Activity
. The Activity
could then be annotated with a OBI term describing the exact experiment and relation used to derive one clone from another.
Further alternatives:
-
Define a new "sbol:ancestor" (or similar) term to be associated with a
prov:Usage
record within the activity. -
Invent a custom SBOL term for physical ancestry of batches and strains.
As mostly philosophical difference between this proposal and SEP 19 is that SEP 19 considers the Implementation
to be "generated" or "derived" from the design. In my view, a given Implementation
either implements a design or it does not. I would argue that there should not be any wasDerivedFrom
link between Implementation
and design. One could easily argue that a given implementation has been primarily generated not from a design but from the physical building blocks that it was assembled from.
However, this argument does not rule out custom "usage" links from activity
records to ComponentDefinition
.
SEP 16 suggested the introduction of an Implementation.status
field pointing to one of a fixed set of possible values: "validated_correct", "validated_incorrect", "not_validated", "validated_ambiguous". This field was explicitly only supposed to deal with sequence validation without touching the more complicated issue whether or not a given clone in fact "works" as expected.
The field may indeed be required if we insist that design
always points to the intended design and build
is only used if there are known deviations from this design. In this case it becomes difficult to distinguish between a clone that has been shown to be exactly as specified in the design and another clone for which this is simply not yet known. Also, clones that are shown to be incorrect (e.g. fail sequencing) could be more easily flagged without always attaching the underlying sequencing information.
On the other hand, this kind of information can only be a (albeit useful) shortcut and almost always would benefit from additional meta information (how was a batch validated, how was the result evaluated and by whom?). For this reason, and in order to minimize differences with SEP 19, it has not been included here.
This proposal is a merge of SEP 16 and SEP 19 after discussion on the Github issue tracker, and on the ["Design-Build-Test" thread]: https://groups.google.com/forum/#!topic/sbol-dev/AnpwJP2_f5A and during a conference call in 11/2017.
SEP 20 relies on SEP 18 (definition of "Attachment").
SEP 20 is meant to replace SEP 19.
To the extent possible under law,
SBOL developers
has waived all copyright and related or neighboring rights to
SEP 020.
This work is published from:
United States.