-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEP 006 -- Add Length field to Range Locations #6
Comments
Given that you have proposed two options, can you please indicate which of the two you recommend, and why? |
I would think it is option 1. Option 1 is the cleanest. It allows the data model to be extended, and it does not significantly impact 2.0. Namely, nothing in 2.0 is removed or changed, it is simply an additional feature. I believe we agreed that 2.x would add features but it should not make significant changes making 2.0 invalid under 2.x. When we considered locations, we talked about having multiple fields with some unused, and unless my memory fails me, I thought Matthew that you were the one arguing for not having unused fields which is why we ended up with the abstract location class. While I don’t have a problem with unused fields, I feel like that ship has sailed, and we should stick with the current approach. What I do think we need to consider though is if we can perhaps consider what combinations of locations are useful together within a single SA. Namely, should we allow both a length and range location? Or how about a genetic location and a range? It seems that certain combinations add no additional information, and it would be simpler to disallow them. Note this is not a restriction on 2.0, since they can be mapped to 2.0 in a clean fashion, since the information of having these extra locations in the list is redundant. Chris
|
Can length constraints be indicated without changing the data model? For example using IUPAC encodings? For example a primer Definition could point to a Sequence of 23 n's where n is an arbitrary nucleotide B Sent from my iPhone On Feb 12, 2016, at 10:04 AM, cjmyers [email protected] wrote:
|
Yes, and this is mentioned in the SEP discussion section. So, this does mean if we accept it, there would be a path to convert to SBOL 2.0, if desired. I think though I can see the argument for why this is a bit clunky. I think it makes sense sometimes not to create instantiations simply to make annotations. This is why I also support adding Role to SAs. Hopefully, that SEP can be discussed soon too. Chris
|
I purposefully didn't indicate my preference between the two options. I'm leaning towards the second one, to tidy up the whole location model for biopolymers. The semantics are clearer, in that each Range is then self-contained as being one place that the annotation is within the component, and if you have multiple ranges attached, the annotation is at each of those, but e.g. the length of one applies only to that range and not to the others. You can do genbank-style messy combinations of ranges each with distinct orientation. There are cases where from a design perspective I'd like to be able to specify only the start position of an annotation, or the start and length, since those are the things fixed by my design, and it is an accident that this lets us know the end. |
@bbartley yes, a length constraint can be inferred by chasing a whole load of pointers. I'm not a fan for several reasons.
I'd rather have these constraints documented within the object that they are constraining. |
Picking up from Jake's comment, when we put this to a vote, it needs to be clear whether we are voting for option 1 or 2. If there is no obvious consensus which one to pick, each option should get its own competing SEP. @cjmyers Chris, I find the "this ship has sailed" argument very unconvincing. 2.0 is just starting to be tested and challenged for real. The implementation of alternative libraries and tools that are not using the java library is a true acid test for the standard. So we must remain open for changes that correct mistakes or simplify or clean up things. So I am actually in favour of the second option as it seems to be the cleaner design. Having said that, this also depends on which SBOL version is targeted. From what I understood, 2.1 is supposed to impose minimal pain on library developers so slightly more radical changes should perhaps be marked up for SBOL 2.2. @drdozer Mat, could you please put in a field "SBOL version" at the top? A short "Backwards compatibility" section would also be good (for one of the options). |
Correction: 2.0.1 should impose minimal-to-no pain. 2.1 can require new On Sat, Feb 13, 2016 at 7:53 AM, Raik Grünberg [email protected]
|
Right, that sounds logical. Thanks! |
I thought the agreement was 2.0.1 was just clarifications and typos in spec, 2.1 was new features but all 2.0 files should be valid 2.1 files, changes that would actually invalidate 2.0 files would come in 3.0. As for ship sailed argument, what I'm really concerned about is we have now several SBOL 2 files published as supplemental to papers, this must remain valid. Allowing changes that would make these files not valid when moving to 2.1 creates a lot of difficulty for library development as it means more version conversion routines will be needed. All this being said, we can make the library api look like option 2 with serialization as option 1. In fact it is already that way since when you create an annotation you just provide fields you need and it creates correct location class. Matthew: this is big change for you. You used to be very against unused fields as I recall. Chris Sent from my iPhone
|
On Sat, Feb 13, 2016 at 8:24 AM, cjmyers [email protected] wrote:
I believe that this is in agreement with what I said, which was in response So far as I understand this SEP, both options would keep all 2.0 files as Thanks, |
The option that modifies the definition of RangeLocation would also deprecate GenericLocation. However, we could declare GenericLocation deprecated but still read-supported in 2.1 and then remove it for 3.0. This would make 2.0 files valid input for 2.1 tools but not necessarily the other way round. Would that be sufficient? |
I'm okay with deprecation approach. I actual think combining all locations types is a long term good simplification. This would also mean I hope deprecating Cut and merging that in too. Chris Sent from my iPhone
|
@bbartley I think that I suggested the specification of an unknown sequence by a length of Ns a couple of months back. In reality, I can't envisage a case where we would need something more specific (e.g. NNNANNNCNNNTNN). Also, I take on board Matthew's point about unknown protein and XNA sequences. So either of Matthew's solutions would be acceptable (to me, at least). |
If we go with option 2, it could target 2.1, with a deprecation of generic location and data migration guide. I would retain Cut, as its semantics are different to Range. Cut indexes the phosphates (or amide bonds), not the monomers. I would also deprecate the multi-location datatype, instead just attaching multiple locations to a single annotation. This would leave us with Range and Cut as the only concrete location types, suitable for biopolymers. |
Multi location has been gone since 2.0. It is already a list of locations. Chris Sent from my iPhone
|
Shall I make a separate SEP document that we could vote on that contains only option 2? |
I would suggest instead that you modify this document to put option #2 as On Sat, Mar 12, 2016 at 7:30 AM, Matthew Pocock [email protected]
|
@jakebeal do I edit the top comment directly? Or is there some procedure where it lives in GIT with a log of commits? |
Hi Matt, @jakebeal do I edit the top comment directly? Or is there some procedure where it lives in GIT with a log of commits?— |
This was discussed at SBOL 14, but a number of shortcomings were pointed out. Option 2 would require a 3.0 spec change. This will require some additional clarification. |
Update as of COMBINE 2018 Needs more discussion |
@jakebeal agreed to rewrite this as adding a length field to Range Locations. |
I remember that we discussed this at COMBINE 2020 and decided not to move forward on it, though I can't remember the precise reason. As I continue to think about this, I think the potential value is less in adding a "length" to Range and more in having a new location type with relative coordinates, per: SynBioDex/SBOL-specification#223 @drdozer Would you be OK with the idea of withdrawing this proposal in favor of a potential future relative coordinate proposal? |
Abstract
This document defines Length locations, intended for use in iterative refinement of constraint-based designs.
Rationale
The current SBOL
Location
data model contains a number of classes to represent locations within stranded and unstranded biopolymers. This includes regions defined by a beginning and ending index into the biopolymer. During constraint-based iterative design, it is sometimes known what length a region will have but not what the beginning and ending indexes will be. In these situations it would be useful to be able to capture this length within theComponentDefinition
.An example use-case would be in the design of a primer. It may be required to be 22 nt in length, and to be placed somewhere within a particular 50nt target region. The location of the target region can be specified as
Range
. The placement of the primer within the target region can be captured as a constraint. But currently there's no simple way to specify the primer length.Possible Implementations
There are two relatively low impact ways to address this need by modifying the locations class hierarchy.
Introduce a Length location class
If a
SequenceAnnotation
has associatedRange
s, the length should be consistent with these. In the case of one range, the length should correspond toend - start + 1
. In the case of multiple ranges, it is not clear how to uniquely define the length.Modify the Range location type to include a length
The range class is modified to include an optional length field. The start and end fields are also made optional. At least one of start, end, length or orientation must be specified.
This allows
GenericLocation
to be retired, as it is the same as aRange
with only the orientation field set.The
start
,end
andlength
fields, if specified, must conform to the constraintlength = end - start + 1
. If any two are specified, the third can be inferred using this relation, but there is no requirement to 'fill in' missing values that were not specified by the data author.Discussion
The first option to introduce a Length class is the quickest to implement. However, the alternative to extend
Range
seems to have less ambiguity about what the length actually refers to (the length of that region), and allow the unification of a number of different corner cases in the current data model (includingGenericComponent
).It is technically possible to simulate a length constraint by associating a
SequenceAnnotation
with aComponent
that points to aComponentDefinition
with an associatedSequence
with a known length. However, this involves inventing a whole chain of instances, purely to specify a length.The text was updated successfully, but these errors were encountered: