SEP 006 -- Add Length field to Range Locations #6

drdozer · 2016-02-12T16:18:05Z

SEP	006
Title	Length Locations
Authors	Matthew Pocock ([email protected])
Type	Data model
Status	Draft
Created	12-Feb-2016
Created	12-Feb-2016

Abstract

This document defines Length locations, intended for use in iterative refinement of constraint-based designs.

Rationale

The current SBOL Location data model contains a number of classes to represent locations within stranded and unstranded biopolymers. This includes regions defined by a beginning and ending index into the biopolymer. During constraint-based iterative design, it is sometimes known what length a region will have but not what the beginning and ending indexes will be. In these situations it would be useful to be able to capture this length within the ComponentDefinition.

An example use-case would be in the design of a primer. It may be required to be 22 nt in length, and to be placed somewhere within a particular 50nt target region. The location of the target region can be specified as Range. The placement of the primer within the target region can be captured as a constraint. But currently there's no simple way to specify the primer length.

Possible Implementations

There are two relatively low impact ways to address this need by modifying the locations class hierarchy.

Introduce a Length location class

case class Length(len: Int) extends Location

public class Length extends Location {
  private int len;
  public Length(int len) { this.len = len }
  public int getLength() { return len }
  public void setLength(int len) { this.len = len }
}

If a SequenceAnnotation has associated Ranges, the length should be consistent with these. In the case of one range, the length should correspond to end - start + 1. In the case of multiple ranges, it is not clear how to uniquely define the length.

Modify the Range location type to include a length

case class Range(
    start: Optional[Int],
    end: Optional[Int],
    length: Optional[Int],
    orientation: Optional[Orientation])

The range class is modified to include an optional length field. The start and end fields are also made optional. At least one of start, end, length or orientation must be specified.

This allows GenericLocation to be retired, as it is the same as a Range with only the orientation field set.

The start, end and length fields, if specified, must conform to the constraint length = end - start + 1. If any two are specified, the third can be inferred using this relation, but there is no requirement to 'fill in' missing values that were not specified by the data author.

Discussion

The first option to introduce a Length class is the quickest to implement. However, the alternative to extend Range seems to have less ambiguity about what the length actually refers to (the length of that region), and allow the unification of a number of different corner cases in the current data model (including GenericComponent).

It is technically possible to simulate a length constraint by associating a SequenceAnnotation with a Component that points to a ComponentDefinition with an associated Sequence with a known length. However, this involves inventing a whole chain of instances, purely to specify a length.

The text was updated successfully, but these errors were encountered:

jakebeal · 2016-02-12T16:52:18Z

Given that you have proposed two options, can you please indicate which of the two you recommend, and why?

cjmyers · 2016-02-12T18:04:48Z

I would think it is option 1. Option 1 is the cleanest. It allows the data model to be extended, and it does not significantly impact 2.0. Namely, nothing in 2.0 is removed or changed, it is simply an additional feature. I believe we agreed that 2.x would add features but it should not make significant changes making 2.0 invalid under 2.x.

When we considered locations, we talked about having multiple fields with some unused, and unless my memory fails me, I thought Matthew that you were the one arguing for not having unused fields which is why we ended up with the abstract location class. While I don’t have a problem with unused fields, I feel like that ship has sailed, and we should stick with the current approach.

What I do think we need to consider though is if we can perhaps consider what combinations of locations are useful together within a single SA. Namely, should we allow both a length and range location? Or how about a genetic location and a range? It seems that certain combinations add no additional information, and it would be simpler to disallow them. Note this is not a restriction on 2.0, since they can be mapped to 2.0 in a clean fashion, since the information of having these extra locations in the list is redundant.

Chris

On Feb 12, 2016, at 9:52 AM, Jacob Beal [email protected] wrote:

Given that you have proposed two options, can you please indicate which of the two you recommend, and why?

—
Reply to this email directly or view it on GitHub #6 (comment).

bbartley · 2016-02-12T18:39:45Z

Can length constraints be indicated without changing the data model? For example using IUPAC encodings? For example a primer Definition could point to a Sequence of 23 n's where n is an arbitrary nucleotide

B

Sent from my iPhone

On Feb 12, 2016, at 10:04 AM, cjmyers [email protected] wrote:

I would think it is option 1. Option 1 is the cleanest. It allows the data model to be extended, and it does not significantly impact 2.0. Namely, nothing in 2.0 is removed or changed, it is simply an additional feature. I believe we agreed that 2.x would add features but it should not make significant changes making 2.0 invalid under 2.x.

When we considered locations, we talked about having multiple fields with some unused, and unless my memory fails me, I thought Matthew that you were the one arguing for not having unused fields which is why we ended up with the abstract location class. While I don’t have a problem with unused fields, I feel like that ship has sailed, and we should stick with the current approach.

What I do think we need to consider though is if we can perhaps consider what combinations of locations are useful together within a single SA. Namely, should we allow both a length and range location? Or how about a genetic location and a range? It seems that certain combinations add no additional information, and it would be simpler to disallow them. Note this is not a restriction on 2.0, since they can be mapped to 2.0 in a clean fashion, since the information of having these extra locations in the list is redundant.

Chris

On Feb 12, 2016, at 9:52 AM, Jacob Beal [email protected] wrote:

Given that you have proposed two options, can you please indicate which of the two you recommend, and why?

—
Reply to this email directly or view it on GitHub #6 (comment).

—
Reply to this email directly or view it on GitHub.

cjmyers · 2016-02-12T18:45:06Z

Yes, and this is mentioned in the SEP discussion section. So, this does mean if we accept it, there would be a path to convert to SBOL 2.0, if desired. I think though I can see the argument for why this is a bit clunky. I think it makes sense sometimes not to create instantiations simply to make annotations. This is why I also support adding Role to SAs. Hopefully, that SEP can be discussed soon too.

Chris

On Feb 12, 2016, at 11:39 AM, bbartley [email protected] wrote:

Can length constraints be indicated without changing the data model? For example using IUPAC encodings? For example a primer Definition could point to a Sequence of 23 n's where n is an arbitrary nucleotide

B

Sent from my iPhone

On Feb 12, 2016, at 10:04 AM, cjmyers [email protected] wrote:

I would think it is option 1. Option 1 is the cleanest. It allows the data model to be extended, and it does not significantly impact 2.0. Namely, nothing in 2.0 is removed or changed, it is simply an additional feature. I believe we agreed that 2.x would add features but it should not make significant changes making 2.0 invalid under 2.x.

When we considered locations, we talked about having multiple fields with some unused, and unless my memory fails me, I thought Matthew that you were the one arguing for not having unused fields which is why we ended up with the abstract location class. While I don’t have a problem with unused fields, I feel like that ship has sailed, and we should stick with the current approach.

What I do think we need to consider though is if we can perhaps consider what combinations of locations are useful together within a single SA. Namely, should we allow both a length and range location? Or how about a genetic location and a range? It seems that certain combinations add no additional information, and it would be simpler to disallow them. Note this is not a restriction on 2.0, since they can be mapped to 2.0 in a clean fashion, since the information of having these extra locations in the list is redundant.

Chris

On Feb 12, 2016, at 9:52 AM, Jacob Beal [email protected] wrote:

Given that you have proposed two options, can you please indicate which of the two you recommend, and why?

—
Reply to this email directly or view it on GitHub #6 (comment).

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub #6 (comment).

drdozer · 2016-02-13T12:02:00Z

I purposefully didn't indicate my preference between the two options. I'm leaning towards the second one, to tidy up the whole location model for biopolymers. The semantics are clearer, in that each Range is then self-contained as being one place that the annotation is within the component, and if you have multiple ranges attached, the annotation is at each of those, but e.g. the length of one applies only to that range and not to the others. You can do genbank-style messy combinations of ranges each with distinct orientation. There are cases where from a design perspective I'd like to be able to specify only the start position of an annotation, or the start and length, since those are the things fixed by my design, and it is an accident that this lets us know the end.

drdozer · 2016-02-13T12:08:01Z

@bbartley yes, a length constraint can be inferred by chasing a whole load of pointers. I'm not a fan for several reasons.

you invent a bunch of top-level entities that now pollute the global namespace just to state a constraint within your component.
if this is the primary mechanism, there are likely to be an embarrassing number of these polluting instances. It isn't DRY
the externality of the constraint looses encapsulation -- key intended design constraints are not contained within the component being designed
it requires a bunch of special-case reasoning and inference code -- so what do we do for protein regions with constrained lengths? (e.g. a flexible linker of 7 aa), or xna? or pna? We'd need a bunch of code that understands each particular encoding of sequences to figure out that the associated sequence leads to a particular length constraint.

I'd rather have these constraints documented within the object that they are constraining.

graik · 2016-02-13T13:53:07Z

Picking up from Jake's comment, when we put this to a vote, it needs to be clear whether we are voting for option 1 or 2. If there is no obvious consensus which one to pick, each option should get its own competing SEP.

@cjmyers Chris, I find the "this ship has sailed" argument very unconvincing. 2.0 is just starting to be tested and challenged for real. The implementation of alternative libraries and tools that are not using the java library is a true acid test for the standard. So we must remain open for changes that correct mistakes or simplify or clean up things. So I am actually in favour of the second option as it seems to be the cleaner design.

Having said that, this also depends on which SBOL version is targeted. From what I understood, 2.1 is supposed to impose minimal pain on library developers so slightly more radical changes should perhaps be marked up for SBOL 2.2. @drdozer Mat, could you please put in a field "SBOL version" at the top? A short "Backwards compatibility" section would also be good (for one of the options).

jakebeal · 2016-02-13T14:03:25Z

Correction: 2.0.1 should impose minimal-to-no pain. 2.1 can require new
work.

On Sat, Feb 13, 2016 at 7:53 AM, Raik Grünberg [email protected]
wrote:

Picking up from Jake's comment, when we put this to a vote, it needs to be
clear whether we are voting for option 1 or 2. If there is no obvious
consensus which one to pick, each option should get its own competing SEP.

@cjmyers https://github.com/cjmyers Chris, I find the "this ship has
sailed" argument very unconvincing. 2.0 is just starting to be tested and
challenged for real. The implementation of alternative libraries and tools
that are not using the java library is a true acid test for the standard.
So we must remain open for changes that correct mistakes or simplify or
clean up things. So I am actually in favour of the second option as it
seems to be the cleaner design.

Having said that, this also depends on which SBOL version is targeted.
From what I understood, 2.1 is supposed to impose minimal pain on library
developers so slightly more radical changes should perhaps be marked up for
SBOL 2.2. @drdozer https://github.com/drdozer Mat, could you please put
in a field "SBOL version" at the top? A short "Backwards compatibility"
section would also be good (for one of the options).

—
Reply to this email directly or view it on GitHub
#6 (comment).

graik · 2016-02-13T14:16:16Z

Right, that sounds logical. Thanks!

cjmyers · 2016-02-13T14:24:37Z

I thought the agreement was 2.0.1 was just clarifications and typos in spec, 2.1 was new features but all 2.0 files should be valid 2.1 files, changes that would actually invalidate 2.0 files would come in 3.0.

As for ship sailed argument, what I'm really concerned about is we have now several SBOL 2 files published as supplemental to papers, this must remain valid. Allowing changes that would make these files not valid when moving to 2.1 creates a lot of difficulty for library development as it means more version conversion routines will be needed.

All this being said, we can make the library api look like option 2 with serialization as option 1. In fact it is already that way since when you create an annotation you just provide fields you need and it creates correct location class.

Matthew: this is big change for you. You used to be very against unused fields as I recall.

Chris

Sent from my iPhone

On Feb 13, 2016, at 7:03 AM, Jacob Beal [email protected] wrote:

Correction: 2.0.1 should impose minimal-to-no pain. 2.1 can require new
work.

On Sat, Feb 13, 2016 at 7:53 AM, Raik Grünberg [email protected]
wrote:

Picking up from Jake's comment, when we put this to a vote, it needs to be
clear whether we are voting for option 1 or 2. If there is no obvious
consensus which one to pick, each option should get its own competing SEP.

@cjmyers https://github.com/cjmyers Chris, I find the "this ship has
sailed" argument very unconvincing. 2.0 is just starting to be tested and
challenged for real. The implementation of alternative libraries and tools
that are not using the java library is a true acid test for the standard.
So we must remain open for changes that correct mistakes or simplify or
clean up things. So I am actually in favour of the second option as it
seems to be the cleaner design.

Having said that, this also depends on which SBOL version is targeted.
From what I understood, 2.1 is supposed to impose minimal pain on library
developers so slightly more radical changes should perhaps be marked up for
SBOL 2.2. @drdozer https://github.com/drdozer Mat, could you please put
in a field "SBOL version" at the top? A short "Backwards compatibility"
section would also be good (for one of the options).

—
Reply to this email directly or view it on GitHub
#6 (comment).

—
Reply to this email directly or view it on GitHub.

jakebeal · 2016-02-13T14:36:40Z

On Sat, Feb 13, 2016 at 8:24 AM, cjmyers [email protected] wrote:

I thought the agreement was 2.0.1 was just clarifications and typos in
spec, 2.1 was new features but all 2.0 files should be valid 2.1 files,
changes that would actually invalidate 2.0 files would come in 3.0.

I believe that this is in agreement with what I said, which was in response
to Raik's statement about software changes.

So far as I understand this SEP, both options would keep all 2.0 files as
valid 2.1 files. Both would require changes to the code library.

Thanks,
-Jake

graik · 2016-02-13T14:49:34Z

The option that modifies the definition of RangeLocation would also deprecate GenericLocation. However, we could declare GenericLocation deprecated but still read-supported in 2.1 and then remove it for 3.0. This would make 2.0 files valid input for 2.1 tools but not necessarily the other way round. Would that be sufficient?

cjmyers · 2016-02-13T16:20:45Z

I'm okay with deprecation approach. I actual think combining all locations types is a long term good simplification. This would also mean I hope deprecating Cut and merging that in too.

Chris

Sent from my iPhone

On Feb 13, 2016, at 7:49 AM, Raik Grünberg [email protected] wrote:

The option that modifies the definition of RangeLocation would also deprecate GenericLocation. However, we could declare GenericLocation deprecated but still read-supported in 2.1 and then remove it for 3.0. This would make 2.0 files valid input for 2.1 tools but not necessarily the other way round. Would that be sufficient?

—
Reply to this email directly or view it on GitHub.

neilswainston · 2016-02-15T08:47:38Z

@bbartley I think that I suggested the specification of an unknown sequence by a length of Ns a couple of months back. In reality, I can't envisage a case where we would need something more specific (e.g. NNNANNNCNNNTNN). Also, I take on board Matthew's point about unknown protein and XNA sequences. So either of Matthew's solutions would be acceptable (to me, at least).

drdozer · 2016-02-24T09:37:16Z

If we go with option 2, it could target 2.1, with a deprecation of generic location and data migration guide.

I would retain Cut, as its semantics are different to Range. Cut indexes the phosphates (or amide bonds), not the monomers. I would also deprecate the multi-location datatype, instead just attaching multiple locations to a single annotation. This would leave us with Range and Cut as the only concrete location types, suitable for biopolymers.

cjmyers · 2016-02-24T14:01:33Z

Multi location has been gone since 2.0. It is already a list of locations.

Chris

Sent from my iPhone

On Feb 24, 2016, at 2:37 AM, Matthew Pocock [email protected] wrote:

If we go with option 2, it could target 2.1, with a deprecation of generic location and data migration guide.

I would retain Cut, as its semantics are different to Range. It indexes the phosphates (or amide bonds), not the monomers. I would also deprecate the multi-location datatype, instead just attaching multiple locations to a single annotation. This would leave us with Range and Cut as the only concrete location types, suitable for biopolymers.

—
Reply to this email directly or view it on GitHub.

drdozer · 2016-03-12T12:30:43Z

Shall I make a separate SEP document that we could vote on that contains only option 2?

jakebeal · 2016-03-12T12:44:46Z

I would suggest instead that you modify this document to put option #2 as
the proposed change and put option #1 into the discussion as an alternative
that has been considered but rejected. This puts it in the desired state
for a vote without losing information about why (which is one of the
important intentions of SEPs).

On Sat, Mar 12, 2016 at 7:30 AM, Matthew Pocock [email protected]
wrote:

Shall I make a separate SEP document that we could vote on that contains
only option 2?

—
Reply to this email directly or view it on GitHub
#6 (comment).

drdozer · 2016-03-12T14:22:16Z

@jakebeal do I edit the top comment directly? Or is there some procedure where it lives in GIT with a log of commits?

bbartley · 2016-03-12T19:23:37Z

Hi Matt,
When your revised SEP is ready, simply initiate a pull request at SynBioDex/SEPs with the latest version of your .md file.
RegardsBryan From: Matthew Pocock [email protected]
To: SynBioDex/SEPs [email protected]
Cc: bbartley [email protected]
Sent: Saturday, March 12, 2016 6:22 AM
Subject: Re: [SEPs] SEP 006 -- Length locations (#6)

@jakebeal do I edit the top comment directly? Or is there some procedure where it lives in GIT with a log of commits?—
Reply to this email directly or view it on GitHub.

bbartley · 2016-03-15T15:55:41Z

This was discussed at SBOL 14, but a number of shortcomings were pointed out. Option 2 would require a 3.0 spec change. This will require some additional clarification.

NeilWipat · 2018-10-11T19:49:43Z

Update as of COMBINE 2018

Needs more discussion

cjmyers · 2019-12-11T18:43:04Z

@jakebeal agreed to rewrite this as adding a length field to Range Locations.

jakebeal · 2020-10-23T11:13:53Z

I remember that we discussed this at COMBINE 2020 and decided not to move forward on it, though I can't remember the precise reason. As I continue to think about this, I think the potential value is less in adding a "length" to Range and more in having a new location type with relative coordinates, per: SynBioDex/SBOL-specification#223

@drdozer Would you be OK with the idea of withdrawing this proposal in favor of a potential future relative coordinate proposal?

graik added Draft Type: Data Model labels Feb 13, 2016

jakebeal added the Inactive label Dec 8, 2016

cjmyers removed the Draft label Jun 23, 2018

jakebeal added this to the SBOL 3.0 milestone Nov 24, 2019

cjmyers added Draft and removed Inactive labels Dec 11, 2019

cjmyers changed the title ~~SEP 006 -- Length locations~~ SEP 006 -- Add Length field to Range Locations Dec 11, 2019

cjmyers assigned ghost Dec 12, 2019

jakebeal modified the milestones: SBOL 3.0, SBOL 3.1 Jan 27, 2020

cjmyers added the Withdrawn label Mar 8, 2021

cjmyers closed this as completed Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEP 006 -- Add Length field to Range Locations #6

SEP 006 -- Add Length field to Range Locations #6

drdozer commented Feb 12, 2016

jakebeal commented Feb 12, 2016

cjmyers commented Feb 12, 2016

bbartley commented Feb 12, 2016

cjmyers commented Feb 12, 2016

drdozer commented Feb 13, 2016

drdozer commented Feb 13, 2016

graik commented Feb 13, 2016

jakebeal commented Feb 13, 2016

graik commented Feb 13, 2016

cjmyers commented Feb 13, 2016

jakebeal commented Feb 13, 2016

graik commented Feb 13, 2016

cjmyers commented Feb 13, 2016

neilswainston commented Feb 15, 2016

drdozer commented Feb 24, 2016

cjmyers commented Feb 24, 2016

drdozer commented Mar 12, 2016

jakebeal commented Mar 12, 2016

drdozer commented Mar 12, 2016

bbartley commented Mar 12, 2016

bbartley commented Mar 15, 2016

NeilWipat commented Oct 11, 2018

cjmyers commented Dec 11, 2019

jakebeal commented Oct 23, 2020

SEP 006 -- Add Length field to Range Locations #6

SEP 006 -- Add Length field to Range Locations #6

Comments

drdozer commented Feb 12, 2016

Abstract

Rationale

Possible Implementations

Introduce a Length location class

Modify the Range location type to include a length

Discussion

jakebeal commented Feb 12, 2016

cjmyers commented Feb 12, 2016

bbartley commented Feb 12, 2016

cjmyers commented Feb 12, 2016

drdozer commented Feb 13, 2016

drdozer commented Feb 13, 2016

graik commented Feb 13, 2016

jakebeal commented Feb 13, 2016

graik commented Feb 13, 2016

cjmyers commented Feb 13, 2016

jakebeal commented Feb 13, 2016

graik commented Feb 13, 2016

cjmyers commented Feb 13, 2016

neilswainston commented Feb 15, 2016

drdozer commented Feb 24, 2016

cjmyers commented Feb 24, 2016

drdozer commented Mar 12, 2016

jakebeal commented Mar 12, 2016

drdozer commented Mar 12, 2016

bbartley commented Mar 12, 2016

bbartley commented Mar 15, 2016

NeilWipat commented Oct 11, 2018

cjmyers commented Dec 11, 2019

jakebeal commented Oct 23, 2020