-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate interval tree from cgranges to superintervals #42
base: main
Are you sure you want to change the base?
Conversation
4877a18
to
641a51e
Compare
I'd like to write a little test suite to make sure performance isn't affected... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we owe it to ourselves to do at least 1 performance benchmark on large sized real data since superintervals
is brand new and I haven't seen any public benchmarks yet.
@@ -322,18 +322,38 @@ def overlaps_any(self, interval: Span) -> bool: | |||
True if and only if the given interval overlaps with any interval in this | |||
detector. | |||
""" | |||
tree = self._refname_to_tree.get(interval.refname) | |||
tree = self._refname_to_tree.get(interval.refname, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't .get()
already return None
as the default empty value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I tend to prefer explicit None
s in this kind of use case and changed it without thinking much. I can revert if this is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I don't really have an opinion. Explicit is nice.
for index in reversed(tree.find_overlaps(interval.start + 1, interval.end)): | ||
yield ref_intervals[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to reversed
won't make this lazy despite its use in a generator.
Are you doing this to preserve original order?
IMHO order wouldn't matter to me.
Or if it did, I'd want the same order as is forced in get_overlaps()
further down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the call to reversed
to yield them in insertion order. The IntervalSet
returns a list in reverse-insertion order, and reversed
iterates backwards through the list without copying, so basically no overhead. I don't feel strongly, the point of this method was to have a lightweight method for yielding all the intervals without concern for order.
# to start | ||
return tree.any_overlaps(interval.start + 1, interval.end) | ||
|
||
def iter_overlaps(self, interval: Span) -> Iterator[SpanType]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new method isn't much different in behavior than get_overlaps
so I don't think we need both publicly available. Although this one looks lazy, it isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to @clintval and @msto in the same place since you made similar comments:
- It's not intended to be lazy, just unopinionated about the container and sort order. If you just want to iterate over all the overlaps in any order, this saves processing and memory, as well as keeping any duplicates (as far as
__hash__
is concernced). - In another project I had to write my own subclass with duplicate overlap-getting logic because I explicitly wanted duplicates and couldn't get them otherwise. But I can't exactly change the behavior of
get_overlaps
without breaking the API. - To me this design seemed like a compromise that kept all the overlap logic in one place while allowing a lighter-weight call for those who wanted it. But it's outside the remit of the originating issue, so I could punt and remove this if the consensus is that it's unwanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd usually favor including those as options in get_overlaps()
instead of having a separate public method.
e.g.
def get_overlaps(
self,
interval: Span,
sort_overlaps: bool = True,
include_duplicates: bool = False,
) -> list[SpanType]:
I agree with Clint that returning an Iterator
implies lazy evaluation. If the method can be lazy, I think our usual pattern is to return an iterator and allow the user to convert to list
at the call site. Otherwise, the method should return list
.
), | ||
) | ||
return sorted( | ||
set(self.iter_overlaps(interval)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about the set
here - how does IntervalSet
handle duplicate intervals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntervalSet
will only be storing indices into the intervals list that the OverlapDetector
stores. So it won't even be aware of any exact duplicates (i.e. even the index is the same). As for intervals with duplicate start and stop positions, it stores them correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will depend on the hash of the object you place in the OverlapDetector
, since the overlap detector is generic now.
If you send in a dataclass with frozen=True
, then all fields will be used as a part of the hash. However, if you make a custom interval-like object and define your own hash method, then that hash method will be used.
Both BedRecord
and Interval
hash on all their fields.
Mypy should disallow any custom interval-like object that does not have a __hash__()
dunder method:
pybedlite/pybedlite/overlap_detector.py
Line 82 in a63c492
class Span(Hashable, Protocol): |
I don't think a call to set
or sorted
should have been included in the original implementation (a user should have the choice of "unique-ifying" later, or sorting later) but because they are already a part of the implementation details, it makes sense to preserve behavior:
pybedlite/pybedlite/overlap_detector.py
Lines 362 to 374 in 9c4990e
intervals: Set[SpanType] = { | |
ref_intervals[index] | |
for _, _, index in tree.overlap(interval.refname, interval.start, interval.end) | |
} | |
return sorted( | |
intervals, | |
key=lambda intv: ( | |
intv.start, | |
intv.end, | |
self._negative(intv), | |
intv.refname, | |
), | |
) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #42 +/- ##
==========================================
+ Coverage 95.25% 95.44% +0.19%
==========================================
Files 8 8
Lines 674 681 +7
Branches 119 119
==========================================
+ Hits 642 650 +8
Misses 18 18
+ Partials 14 13 -1 ☔ View full report in Codecov by Sentry. |
Also add an overlap iterator method that does not enforce uniqueness or sort output intervals.