Skip to content

Latest commit

 

History

History
158 lines (106 loc) · 8.32 KB

sep_041.md

File metadata and controls

158 lines (106 loc) · 8.32 KB

SEP 041 -- Make all sequence associations explicit

SEP
Title Make all sequence associations explicit
Authors Jacob Beal ([email protected])
Editor
Type Data Model
SBOL Version 3.0
Replaces
Status Accepted
Created 22-Jan-2020
Last modified 22-Jan-2020
Issue #89

Abstract

This SEP proposes to require that the association of a Sequence to complete Component be explicit, rather than implicit, and to enable this via changes to Location, Component, and a new EntireComponent location class.

1. Rationale

Currently, when a ComponentDefinition includes a Sequence, that Sequence is implicitly assumed to apply to the entire ComponentDefinition. If ComponentDefinition and ModuleDefinition are merged into Component, however, (as proposed by SEP 025), then we will often have circumstances where there is more than one Sequence in a Component.

Example: a system on two plasmids:
* DNA Plasmid 1: constitutive CDS for TetR
* DNA Plasmid 2: pTet regulation of GFP
* Proteins: TetR, GFP
* Interactions: P1 CDS produces TetR; TetR represses pTet; GFP CDS produces GFP

We need to have two Sequences to describe this Component, one for each plamid. There are two alternatives for doing so:

  1. The Sequences cannot be in this Component, but must be in the SubComponents for the plasmids.
    • Advantage: this is what we do now.
    • Disadvantage: this limits flattening, since we have to have two layers of system. Clever counter-examples might even generate unflattenable structures of arbitrary depth.
    • Disadvantage: raises the question of whether "Component" should be separated into subclasses that do or don't allow a sequence.
  2. The Sequences can be in included in the Component, but must have an explicit link between a Sequence and the thing that it describes.
    • Advantage: complete flattening is possible
    • Advantage: uniform handling of sequences in all types of Component
    • Disadvantage: requires some sort of extra relationship in simple components (like the simple promoter pBAD)

This SEP makes a formal proposal for option #2, in contrast to to SEP 025, which effectively proposes Option 1.

2. Specification

Changes to Location

The Location class's sequence property changes from OPTIONAL (cardinality [0..1]). to REQUIRED (cardinality [1]).

Changes to Component

The sequence property is removed from Component. The sequences associated with a component are now linked only via the location fields of its child SequenceAnnotation and SubComponent objects (per SEP 10 addition of location to SubComponent).

This change ensures that no Sequence is ever associated with a Component without being given an explicit location.

EntireComponent

The EntireComponent class is a new subclass of Location with no additional properties. Use of this class indicates that the linked Sequence describes the entirety of the Component or SubComponent parent of this Location object.

Note that this class is not strictly necessary, as EntireComponent can be taken as equivalent to Range(1 .. sequence size). Using EntireComponent, however, will make validation rules easier.

New validation rule for SequenceAnnotation and SubComponent

The following new validation rules are added for Component and SubComponent respectively:

If a Component has a SequenceAnnotation that refers to Sequence X, then either the Component or one of its SubComponents must also have a SequenceAnnotation that gives Sequence X an EntireComponent Location.

If a SubComponent has a Location that refers to Sequence X, then the either the SubComponent or its parent Component must also have a SequenceAnnotation that gives Sequence X an EntireComponent Location.

This validation rule ensures that at every sequence referred to has at least one EntireComponent relationship associating it with a particular entity in the representation.

Note that it is not inconsistent to have multiple SubComponents with an EntireComponent pointing to the same Sequence, or both a Component and its SubComponent. A simple example of the first case would be a system that uses the same plasmid in two different compartments. The second case is not expected to arise in practice, but would not be incorrect---such a SubComponent would simply be redundant and pointless.

3. Examples

Example: pTet promoter

The Component for the pTet promoter is given a SequenceAnnotation with an EntireComponent location for its Sequence.

Example: a system on two plasmids

The Component contains the following:

  • DNA Plasmid 1: constitutive CDS for TetR
  • DNA Plasmid 2: pTet regulation of GFP
  • Proteins: TetR, GFP
  • Interactions: P1 CDS produces TetR; TetR represses pTet; GFP CDS produces GFP

Plasmid 1 is given an EntireComponent location for its Sequence

Plasmid 2 is given an EntireComponent location for its (different) Sequence.

Example: annotation in context

The Component contains the following:

  • SubComponent for a system expressing Cas9 and a targeting gRNA
    • definition is Cas9-gRNA Component:
      • SubComponent: promoter - CDS for Cas9
      • SubComponent: U6promoter - gRNA coding sequence
      • SubComponent: Cas9 (protein)
      • SubComponent: gRNA (RNA)
      • SubComponent: Cas9-gRNA complex
      • Interaction: Association - Cas9 + gRNA -> Cas9-gRNA
  • SubComponent for a system with a genome that is going to be edited.
    • location is EntireComponent for Sequence: E-coli-genome-sequence
    • definition is a Component for E. coli genome
      • SequenceAnnotation with location EntireComponent for Sequence: E-coli-genome-sequence
  • SequenceAnnotation for each gRNA binding site:
    • role: SO binding site
    • location: Range [nn .. nn] on Sequence: E-coli-genome-sequence
  • ... similar for each Cas9m break site

In order to indicate where the binding targets and edit locations on the genome will be, we add SequenceAnnotations to this Component, each annotation referencing the Sequence for the genome.

Now, however, we also need to satisfy the rule that there is an EntireComponent location for that genome in either the Component or the SubComponent. This cannot be done as a SequenceAnnotation on the Component, since the Component also contains the Cas9 system, the gRNA, whatever else is in the genome system, etc.---the genome isn't the whole Component. Instead (assuming SEP 037 is accepted) we make a SubComponent of type ComponentReference that points to the genome within the genomic system, and give that the EntireComponent location for the Sequence.

4. Backwards Compatibility

SBOL 2 systems can be readily converted to follow this convention by changing each sequence property to a SequenceAnnotation with an EntireComponent Location linking to the Sequence.

SBOL 3 systems with precisely one EntireComponent SequenceAnnotation can be converted back to SBOL 2 by the converse. Otherwise, they will need a more complex process of extracting required sub-components.

5. Discussion

Potential alternative names for EntireComponent:

  • EntireLocation
  • EntireSequence
  • Any of the above with "Whole" instead of "Entire"

6. Competing SEPs

This SEP proposes a modification of the unification of ComponentDefinition and ModuleDefinition proposed in SEP 025. Technically, it stacks on top of SEP 025.

Copyright

CC0
To the extent possible under law, SBOL developers has waived all copyright and related or neighboring rights to SEP 041. This work is published from: United States.