Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate interval tree from cgranges to superintervals #42

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

TedBrookings
Copy link

Also add an overlap iterator method that does not enforce uniqueness or sort output intervals.

@TedBrookings TedBrookings marked this pull request as ready for review December 3, 2024 16:46
@TedBrookings TedBrookings requested review from msto and clintval December 3, 2024 16:48
@nh13
Copy link
Member

nh13 commented Dec 3, 2024

I'd like to write a little test suite to make sure performance isn't affected...

Copy link
Member

@clintval clintval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we owe it to ourselves to do at least 1 performance benchmark on large sized real data since superintervals is brand new and I haven't seen any public benchmarks yet.

.github/workflows/wheels.yml Outdated Show resolved Hide resolved
.github/workflows/wheels.yml Outdated Show resolved Hide resolved
.gitmodules Show resolved Hide resolved
@@ -322,18 +322,38 @@ def overlaps_any(self, interval: Span) -> bool:
True if and only if the given interval overlaps with any interval in this
detector.
"""
tree = self._refname_to_tree.get(interval.refname)
tree = self._refname_to_tree.get(interval.refname, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't .get() already return None as the default empty value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I tend to prefer explicit Nones in this kind of use case and changed it without thinking much. I can revert if this is an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I don't really have an opinion. Explicit is nice.

Comment on lines +355 to +356
for index in reversed(tree.find_overlaps(interval.start + 1, interval.end)):
yield ref_intervals[index]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to reversed won't make this lazy despite its use in a generator.

Are you doing this to preserve original order?

IMHO order wouldn't matter to me.

Or if it did, I'd want the same order as is forced in get_overlaps() further down.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the call to reversed to yield them in insertion order. The IntervalSet returns a list in reverse-insertion order, and reversed iterates backwards through the list without copying, so basically no overhead. I don't feel strongly, the point of this method was to have a lightweight method for yielding all the intervals without concern for order.

pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
# to start
return tree.any_overlaps(interval.start + 1, interval.end)

def iter_overlaps(self, interval: Span) -> Iterator[SpanType]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new method isn't much different in behavior than get_overlaps so I don't think we need both publicly available. Although this one looks lazy, it isn't.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to @clintval and @msto in the same place since you made similar comments:

  • It's not intended to be lazy, just unopinionated about the container and sort order. If you just want to iterate over all the overlaps in any order, this saves processing and memory, as well as keeping any duplicates (as far as __hash__ is concernced).
  • In another project I had to write my own subclass with duplicate overlap-getting logic because I explicitly wanted duplicates and couldn't get them otherwise. But I can't exactly change the behavior of get_overlaps without breaking the API.
  • To me this design seemed like a compromise that kept all the overlap logic in one place while allowing a lighter-weight call for those who wanted it. But it's outside the remit of the originating issue, so I could punt and remove this if the consensus is that it's unwanted.

Copy link
Contributor

@msto msto Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd usually favor including those as options in get_overlaps() instead of having a separate public method.

e.g.

def get_overlaps(
    self,
    interval: Span,
    sort_overlaps: bool = True,
    include_duplicates: bool = False,
) -> list[SpanType]:

I agree with Clint that returning an Iterator implies lazy evaluation. If the method can be lazy, I think our usual pattern is to return an iterator and allow the user to convert to list at the call site. Otherwise, the method should return list.

pybedlite/overlap_detector.py Show resolved Hide resolved
),
)
return sorted(
set(self.iter_overlaps(interval)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about the set here - how does IntervalSet handle duplicate intervals?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntervalSet will only be storing indices into the intervals list that the OverlapDetector stores. So it won't even be aware of any exact duplicates (i.e. even the index is the same). As for intervals with duplicate start and stop positions, it stores them correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will depend on the hash of the object you place in the OverlapDetector, since the overlap detector is generic now.

If you send in a dataclass with frozen=True, then all fields will be used as a part of the hash. However, if you make a custom interval-like object and define your own hash method, then that hash method will be used.

Both BedRecord and Interval hash on all their fields.

Mypy should disallow any custom interval-like object that does not have a __hash__() dunder method:

class Span(Hashable, Protocol):

I don't think a call to set or sorted should have been included in the original implementation (a user should have the choice of "unique-ifying" later, or sorting later) but because they are already a part of the implementation details, it makes sense to preserve behavior:

intervals: Set[SpanType] = {
ref_intervals[index]
for _, _, index in tree.overlap(interval.refname, interval.start, interval.end)
}
return sorted(
intervals,
key=lambda intv: (
intv.start,
intv.end,
self._negative(intv),
intv.refname,
),
)

Copy link

codecov bot commented Dec 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.44%. Comparing base (9c4990e) to head (a63c492).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
+ Coverage   95.25%   95.44%   +0.19%     
==========================================
  Files           8        8              
  Lines         674      681       +7     
  Branches      119      119              
==========================================
+ Hits          642      650       +8     
  Misses         18       18              
+ Partials       14       13       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants