Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEP 006 -- Add Length field to Range Locations #6

Closed
drdozer opened this issue Feb 12, 2016 · 24 comments
Closed

SEP 006 -- Add Length field to Range Locations #6

drdozer opened this issue Feb 12, 2016 · 24 comments

Comments

@drdozer
Copy link
Member

drdozer commented Feb 12, 2016

SEP 006
Title Length Locations
Authors Matthew Pocock ([email protected])
Type Data model
Status Draft
Created 12-Feb-2016
Created 12-Feb-2016

Abstract

This document defines Length locations, intended for use in iterative refinement of constraint-based designs.

Rationale

The current SBOL Location data model contains a number of classes to represent locations within stranded and unstranded biopolymers. This includes regions defined by a beginning and ending index into the biopolymer. During constraint-based iterative design, it is sometimes known what length a region will have but not what the beginning and ending indexes will be. In these situations it would be useful to be able to capture this length within the ComponentDefinition.

An example use-case would be in the design of a primer. It may be required to be 22 nt in length, and to be placed somewhere within a particular 50nt target region. The location of the target region can be specified as Range. The placement of the primer within the target region can be captured as a constraint. But currently there's no simple way to specify the primer length.

Possible Implementations

There are two relatively low impact ways to address this need by modifying the locations class hierarchy.

Introduce a Length location class

case class Length(len: Int) extends Location
public class Length extends Location {
  private int len;
  public Length(int len) { this.len = len }
  public int getLength() { return len }
  public void setLength(int len) { this.len = len }
}

If a SequenceAnnotation has associated Ranges, the length should be consistent with these. In the case of one range, the length should correspond to end - start + 1. In the case of multiple ranges, it is not clear how to uniquely define the length.

Modify the Range location type to include a length

case class Range(
    start: Optional[Int],
    end: Optional[Int],
    length: Optional[Int],
    orientation: Optional[Orientation])

The range class is modified to include an optional length field. The start and end fields are also made optional. At least one of start, end, length or orientation must be specified.

This allows GenericLocation to be retired, as it is the same as a Range with only the orientation field set.

The start, end and length fields, if specified, must conform to the constraint length = end - start + 1. If any two are specified, the third can be inferred using this relation, but there is no requirement to 'fill in' missing values that were not specified by the data author.

Discussion

The first option to introduce a Length class is the quickest to implement. However, the alternative to extend Range seems to have less ambiguity about what the length actually refers to (the length of that region), and allow the unification of a number of different corner cases in the current data model (including GenericComponent).

It is technically possible to simulate a length constraint by associating a SequenceAnnotation with a Component that points to a ComponentDefinition with an associated Sequence with a known length. However, this involves inventing a whole chain of instances, purely to specify a length.

@jakebeal
Copy link
Contributor

Given that you have proposed two options, can you please indicate which of the two you recommend, and why?

@cjmyers
Copy link
Contributor

cjmyers commented Feb 12, 2016

I would think it is option 1. Option 1 is the cleanest. It allows the data model to be extended, and it does not significantly impact 2.0. Namely, nothing in 2.0 is removed or changed, it is simply an additional feature. I believe we agreed that 2.x would add features but it should not make significant changes making 2.0 invalid under 2.x.

When we considered locations, we talked about having multiple fields with some unused, and unless my memory fails me, I thought Matthew that you were the one arguing for not having unused fields which is why we ended up with the abstract location class. While I don’t have a problem with unused fields, I feel like that ship has sailed, and we should stick with the current approach.

What I do think we need to consider though is if we can perhaps consider what combinations of locations are useful together within a single SA. Namely, should we allow both a length and range location? Or how about a genetic location and a range? It seems that certain combinations add no additional information, and it would be simpler to disallow them. Note this is not a restriction on 2.0, since they can be mapped to 2.0 in a clean fashion, since the information of having these extra locations in the list is redundant.

Chris

On Feb 12, 2016, at 9:52 AM, Jacob Beal [email protected] wrote:

Given that you have proposed two options, can you please indicate which of the two you recommend, and why?


Reply to this email directly or view it on GitHub #6 (comment).

@bbartley
Copy link
Contributor

Can length constraints be indicated without changing the data model? For example using IUPAC encodings? For example a primer Definition could point to a Sequence of 23 n's where n is an arbitrary nucleotide

B

Sent from my iPhone

On Feb 12, 2016, at 10:04 AM, cjmyers [email protected] wrote:

I would think it is option 1. Option 1 is the cleanest. It allows the data model to be extended, and it does not significantly impact 2.0. Namely, nothing in 2.0 is removed or changed, it is simply an additional feature. I believe we agreed that 2.x would add features but it should not make significant changes making 2.0 invalid under 2.x.

When we considered locations, we talked about having multiple fields with some unused, and unless my memory fails me, I thought Matthew that you were the one arguing for not having unused fields which is why we ended up with the abstract location class. While I don’t have a problem with unused fields, I feel like that ship has sailed, and we should stick with the current approach.

What I do think we need to consider though is if we can perhaps consider what combinations of locations are useful together within a single SA. Namely, should we allow both a length and range location? Or how about a genetic location and a range? It seems that certain combinations add no additional information, and it would be simpler to disallow them. Note this is not a restriction on 2.0, since they can be mapped to 2.0 in a clean fashion, since the information of having these extra locations in the list is redundant.

Chris

On Feb 12, 2016, at 9:52 AM, Jacob Beal [email protected] wrote:

Given that you have proposed two options, can you please indicate which of the two you recommend, and why?


Reply to this email directly or view it on GitHub #6 (comment).


Reply to this email directly or view it on GitHub.

@cjmyers
Copy link
Contributor

cjmyers commented Feb 12, 2016

Yes, and this is mentioned in the SEP discussion section. So, this does mean if we accept it, there would be a path to convert to SBOL 2.0, if desired. I think though I can see the argument for why this is a bit clunky. I think it makes sense sometimes not to create instantiations simply to make annotations. This is why I also support adding Role to SAs. Hopefully, that SEP can be discussed soon too.

Chris

On Feb 12, 2016, at 11:39 AM, bbartley [email protected] wrote:

Can length constraints be indicated without changing the data model? For example using IUPAC encodings? For example a primer Definition could point to a Sequence of 23 n's where n is an arbitrary nucleotide

B

Sent from my iPhone

On Feb 12, 2016, at 10:04 AM, cjmyers [email protected] wrote:

I would think it is option 1. Option 1 is the cleanest. It allows the data model to be extended, and it does not significantly impact 2.0. Namely, nothing in 2.0 is removed or changed, it is simply an additional feature. I believe we agreed that 2.x would add features but it should not make significant changes making 2.0 invalid under 2.x.

When we considered locations, we talked about having multiple fields with some unused, and unless my memory fails me, I thought Matthew that you were the one arguing for not having unused fields which is why we ended up with the abstract location class. While I don’t have a problem with unused fields, I feel like that ship has sailed, and we should stick with the current approach.

What I do think we need to consider though is if we can perhaps consider what combinations of locations are useful together within a single SA. Namely, should we allow both a length and range location? Or how about a genetic location and a range? It seems that certain combinations add no additional information, and it would be simpler to disallow them. Note this is not a restriction on 2.0, since they can be mapped to 2.0 in a clean fashion, since the information of having these extra locations in the list is redundant.

Chris

On Feb 12, 2016, at 9:52 AM, Jacob Beal [email protected] wrote:

Given that you have proposed two options, can you please indicate which of the two you recommend, and why?


Reply to this email directly or view it on GitHub #6 (comment).


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub #6 (comment).

@drdozer
Copy link
Member Author

drdozer commented Feb 13, 2016

I purposefully didn't indicate my preference between the two options. I'm leaning towards the second one, to tidy up the whole location model for biopolymers. The semantics are clearer, in that each Range is then self-contained as being one place that the annotation is within the component, and if you have multiple ranges attached, the annotation is at each of those, but e.g. the length of one applies only to that range and not to the others. You can do genbank-style messy combinations of ranges each with distinct orientation. There are cases where from a design perspective I'd like to be able to specify only the start position of an annotation, or the start and length, since those are the things fixed by my design, and it is an accident that this lets us know the end.

@drdozer
Copy link
Member Author

drdozer commented Feb 13, 2016

@bbartley yes, a length constraint can be inferred by chasing a whole load of pointers. I'm not a fan for several reasons.

  1. you invent a bunch of top-level entities that now pollute the global namespace just to state a constraint within your component.
  2. if this is the primary mechanism, there are likely to be an embarrassing number of these polluting instances. It isn't DRY
  3. the externality of the constraint looses encapsulation -- key intended design constraints are not contained within the component being designed
  4. it requires a bunch of special-case reasoning and inference code -- so what do we do for protein regions with constrained lengths? (e.g. a flexible linker of 7 aa), or xna? or pna? We'd need a bunch of code that understands each particular encoding of sequences to figure out that the associated sequence leads to a particular length constraint.

I'd rather have these constraints documented within the object that they are constraining.

@graik
Copy link
Contributor

graik commented Feb 13, 2016

Picking up from Jake's comment, when we put this to a vote, it needs to be clear whether we are voting for option 1 or 2. If there is no obvious consensus which one to pick, each option should get its own competing SEP.

@cjmyers Chris, I find the "this ship has sailed" argument very unconvincing. 2.0 is just starting to be tested and challenged for real. The implementation of alternative libraries and tools that are not using the java library is a true acid test for the standard. So we must remain open for changes that correct mistakes or simplify or clean up things. So I am actually in favour of the second option as it seems to be the cleaner design.

Having said that, this also depends on which SBOL version is targeted. From what I understood, 2.1 is supposed to impose minimal pain on library developers so slightly more radical changes should perhaps be marked up for SBOL 2.2. @drdozer Mat, could you please put in a field "SBOL version" at the top? A short "Backwards compatibility" section would also be good (for one of the options).

@jakebeal
Copy link
Contributor

Correction: 2.0.1 should impose minimal-to-no pain. 2.1 can require new
work.

On Sat, Feb 13, 2016 at 7:53 AM, Raik Grünberg [email protected]
wrote:

Picking up from Jake's comment, when we put this to a vote, it needs to be
clear whether we are voting for option 1 or 2. If there is no obvious
consensus which one to pick, each option should get its own competing SEP.

@cjmyers https://github.com/cjmyers Chris, I find the "this ship has
sailed" argument very unconvincing. 2.0 is just starting to be tested and
challenged for real. The implementation of alternative libraries and tools
that are not using the java library is a true acid test for the standard.
So we must remain open for changes that correct mistakes or simplify or
clean up things. So I am actually in favour of the second option as it
seems to be the cleaner design.

Having said that, this also depends on which SBOL version is targeted.
From what I understood, 2.1 is supposed to impose minimal pain on library
developers so slightly more radical changes should perhaps be marked up for
SBOL 2.2. @drdozer https://github.com/drdozer Mat, could you please put
in a field "SBOL version" at the top? A short "Backwards compatibility"
section would also be good (for one of the options).


Reply to this email directly or view it on GitHub
#6 (comment).

@graik
Copy link
Contributor

graik commented Feb 13, 2016

Right, that sounds logical. Thanks!

@cjmyers
Copy link
Contributor

cjmyers commented Feb 13, 2016

I thought the agreement was 2.0.1 was just clarifications and typos in spec, 2.1 was new features but all 2.0 files should be valid 2.1 files, changes that would actually invalidate 2.0 files would come in 3.0.

As for ship sailed argument, what I'm really concerned about is we have now several SBOL 2 files published as supplemental to papers, this must remain valid. Allowing changes that would make these files not valid when moving to 2.1 creates a lot of difficulty for library development as it means more version conversion routines will be needed.

All this being said, we can make the library api look like option 2 with serialization as option 1. In fact it is already that way since when you create an annotation you just provide fields you need and it creates correct location class.

Matthew: this is big change for you. You used to be very against unused fields as I recall.

Chris

Sent from my iPhone

On Feb 13, 2016, at 7:03 AM, Jacob Beal [email protected] wrote:

Correction: 2.0.1 should impose minimal-to-no pain. 2.1 can require new
work.

On Sat, Feb 13, 2016 at 7:53 AM, Raik Grünberg [email protected]
wrote:

Picking up from Jake's comment, when we put this to a vote, it needs to be
clear whether we are voting for option 1 or 2. If there is no obvious
consensus which one to pick, each option should get its own competing SEP.

@cjmyers https://github.com/cjmyers Chris, I find the "this ship has
sailed" argument very unconvincing. 2.0 is just starting to be tested and
challenged for real. The implementation of alternative libraries and tools
that are not using the java library is a true acid test for the standard.
So we must remain open for changes that correct mistakes or simplify or
clean up things. So I am actually in favour of the second option as it
seems to be the cleaner design.

Having said that, this also depends on which SBOL version is targeted.
From what I understood, 2.1 is supposed to impose minimal pain on library
developers so slightly more radical changes should perhaps be marked up for
SBOL 2.2. @drdozer https://github.com/drdozer Mat, could you please put
in a field "SBOL version" at the top? A short "Backwards compatibility"
section would also be good (for one of the options).


Reply to this email directly or view it on GitHub
#6 (comment).


Reply to this email directly or view it on GitHub.

@jakebeal
Copy link
Contributor

On Sat, Feb 13, 2016 at 8:24 AM, cjmyers [email protected] wrote:

I thought the agreement was 2.0.1 was just clarifications and typos in
spec, 2.1 was new features but all 2.0 files should be valid 2.1 files,
changes that would actually invalidate 2.0 files would come in 3.0.

I believe that this is in agreement with what I said, which was in response
to Raik's statement about software changes.

So far as I understand this SEP, both options would keep all 2.0 files as
valid 2.1 files. Both would require changes to the code library.

Thanks,
-Jake

@graik
Copy link
Contributor

graik commented Feb 13, 2016

The option that modifies the definition of RangeLocation would also deprecate GenericLocation. However, we could declare GenericLocation deprecated but still read-supported in 2.1 and then remove it for 3.0. This would make 2.0 files valid input for 2.1 tools but not necessarily the other way round. Would that be sufficient?

@cjmyers
Copy link
Contributor

cjmyers commented Feb 13, 2016

I'm okay with deprecation approach. I actual think combining all locations types is a long term good simplification. This would also mean I hope deprecating Cut and merging that in too.

Chris

Sent from my iPhone

On Feb 13, 2016, at 7:49 AM, Raik Grünberg [email protected] wrote:

The option that modifies the definition of RangeLocation would also deprecate GenericLocation. However, we could declare GenericLocation deprecated but still read-supported in 2.1 and then remove it for 3.0. This would make 2.0 files valid input for 2.1 tools but not necessarily the other way round. Would that be sufficient?


Reply to this email directly or view it on GitHub.

@neilswainston
Copy link

@bbartley I think that I suggested the specification of an unknown sequence by a length of Ns a couple of months back. In reality, I can't envisage a case where we would need something more specific (e.g. NNNANNNCNNNTNN). Also, I take on board Matthew's point about unknown protein and XNA sequences. So either of Matthew's solutions would be acceptable (to me, at least).

@drdozer
Copy link
Member Author

drdozer commented Feb 24, 2016

If we go with option 2, it could target 2.1, with a deprecation of generic location and data migration guide.

I would retain Cut, as its semantics are different to Range. Cut indexes the phosphates (or amide bonds), not the monomers. I would also deprecate the multi-location datatype, instead just attaching multiple locations to a single annotation. This would leave us with Range and Cut as the only concrete location types, suitable for biopolymers.

@cjmyers
Copy link
Contributor

cjmyers commented Feb 24, 2016

Multi location has been gone since 2.0. It is already a list of locations.

Chris

Sent from my iPhone

On Feb 24, 2016, at 2:37 AM, Matthew Pocock [email protected] wrote:

If we go with option 2, it could target 2.1, with a deprecation of generic location and data migration guide.

I would retain Cut, as its semantics are different to Range. It indexes the phosphates (or amide bonds), not the monomers. I would also deprecate the multi-location datatype, instead just attaching multiple locations to a single annotation. This would leave us with Range and Cut as the only concrete location types, suitable for biopolymers.


Reply to this email directly or view it on GitHub.

@drdozer
Copy link
Member Author

drdozer commented Mar 12, 2016

Shall I make a separate SEP document that we could vote on that contains only option 2?

@jakebeal
Copy link
Contributor

I would suggest instead that you modify this document to put option #2 as
the proposed change and put option #1 into the discussion as an alternative
that has been considered but rejected. This puts it in the desired state
for a vote without losing information about why (which is one of the
important intentions of SEPs).

On Sat, Mar 12, 2016 at 7:30 AM, Matthew Pocock [email protected]
wrote:

Shall I make a separate SEP document that we could vote on that contains
only option 2?


Reply to this email directly or view it on GitHub
#6 (comment).

@drdozer
Copy link
Member Author

drdozer commented Mar 12, 2016

@jakebeal do I edit the top comment directly? Or is there some procedure where it lives in GIT with a log of commits?

@bbartley
Copy link
Contributor

Hi Matt,
When your revised SEP is ready, simply initiate a pull request at SynBioDex/SEPs with the latest version of your .md file. 
RegardsBryan From: Matthew Pocock [email protected]
To: SynBioDex/SEPs [email protected]
Cc: bbartley [email protected]
Sent: Saturday, March 12, 2016 6:22 AM
Subject: Re: [SEPs] SEP 006 -- Length locations (#6)

@jakebeal do I edit the top comment directly? Or is there some procedure where it lives in GIT with a log of commits?—
Reply to this email directly or view it on GitHub.

@bbartley
Copy link
Contributor

This was discussed at SBOL 14, but a number of shortcomings were pointed out. Option 2 would require a 3.0 spec change. This will require some additional clarification.

@cjmyers cjmyers removed the Draft label Jun 23, 2018
@NeilWipat
Copy link
Collaborator

Update as of COMBINE 2018

Needs more discussion

@jakebeal jakebeal added this to the SBOL 3.0 milestone Nov 24, 2019
@cjmyers cjmyers added Draft and removed Inactive labels Dec 11, 2019
@cjmyers
Copy link
Contributor

cjmyers commented Dec 11, 2019

@jakebeal agreed to rewrite this as adding a length field to Range Locations.

@cjmyers cjmyers changed the title SEP 006 -- Length locations SEP 006 -- Add Length field to Range Locations Dec 11, 2019
@cjmyers cjmyers assigned ghost Dec 12, 2019
@jakebeal jakebeal modified the milestones: SBOL 3.0, SBOL 3.1 Jan 27, 2020
@jakebeal
Copy link
Contributor

I remember that we discussed this at COMBINE 2020 and decided not to move forward on it, though I can't remember the precise reason. As I continue to think about this, I think the potential value is less in adding a "length" to Range and more in having a new location type with relative coordinates, per: SynBioDex/SBOL-specification#223

@drdozer Would you be OK with the idea of withdrawing this proposal in favor of a potential future relative coordinate proposal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants