Migrate interval tree from cgranges to superintervals #42

TedBrookings · 2024-12-03T16:16:49Z

Also add an overlap iterator method that does not enforce uniqueness or sort output intervals.

nh13 · 2024-12-03T16:52:11Z

I'd like to write a little test suite to make sure performance isn't affected...

clintval

I think we owe it to ourselves to do at least 1 performance benchmark on large sized real data since superintervals is brand new and I haven't seen any public benchmarks yet.

.github/workflows/wheels.yml

.gitmodules

clintval · 2024-12-03T16:52:42Z

pybedlite/overlap_detector.py

@@ -322,18 +322,38 @@ def overlaps_any(self, interval: Span) -> bool:
            True if and only if the given interval overlaps with any interval in this
            detector.
        """
-        tree = self._refname_to_tree.get(interval.refname)
+        tree = self._refname_to_tree.get(interval.refname, None)


Doesn't .get() already return None as the default empty value?

Yes. I tend to prefer explicit Nones in this kind of use case and changed it without thinking much. I can revert if this is an issue.

No worries, I don't really have an opinion. Explicit is nice.

clintval · 2024-12-03T16:54:31Z

pybedlite/overlap_detector.py

+            for index in reversed(tree.find_overlaps(interval.start + 1, interval.end)):
+                yield ref_intervals[index]


The call to reversed won't make this lazy despite its use in a generator.

Are you doing this to preserve original order?

IMHO order wouldn't matter to me.

Or if it did, I'd want the same order as is forced in get_overlaps() further down.

I put the call to reversed to yield them in insertion order. The IntervalSet returns a list in reverse-insertion order, and reversed iterates backwards through the list without copying, so basically no overhead. I don't feel strongly, the point of this method was to have a lightweight method for yielding all the intervals without concern for order.

pyproject.toml

clintval · 2024-12-03T16:56:38Z

pybedlite/overlap_detector.py

+            # to start
+            return tree.any_overlaps(interval.start + 1, interval.end)
+
+    def iter_overlaps(self, interval: Span) -> Iterator[SpanType]:


This new method isn't much different in behavior than get_overlaps so I don't think we need both publicly available. Although this one looks lazy, it isn't.

Responding to @clintval and @msto in the same place since you made similar comments:

It's not intended to be lazy, just unopinionated about the container and sort order. If you just want to iterate over all the overlaps in any order, this saves processing and memory, as well as keeping any duplicates (as far as __hash__ is concernced).

In another project I had to write my own subclass with duplicate overlap-getting logic because I explicitly wanted duplicates and couldn't get them otherwise. But I can't exactly change the behavior of get_overlaps without breaking the API.

To me this design seemed like a compromise that kept all the overlap logic in one place while allowing a lighter-weight call for those who wanted it. But it's outside the remit of the originating issue, so I could punt and remove this if the consensus is that it's unwanted.

I'd usually favor including those as options in get_overlaps() instead of having a separate public method.

e.g.

def get_overlaps( self, interval: Span, sort_overlaps: bool = True, include_duplicates: bool = False, ) -> list[SpanType]:

I agree with Clint that returning an Iterator implies lazy evaluation. If the method can be lazy, I think our usual pattern is to return an iterator and allow the user to convert to list at the call site. Otherwise, the method should return list.

pybedlite/overlap_detector.py

msto · 2024-12-03T17:02:34Z

pybedlite/overlap_detector.py

-                ),
-            )
+        return sorted(
+            set(self.iter_overlaps(interval)),


Curious about the set here - how does IntervalSet handle duplicate intervals?

IntervalSet will only be storing indices into the intervals list that the OverlapDetector stores. So it won't even be aware of any exact duplicates (i.e. even the index is the same). As for intervals with duplicate start and stop positions, it stores them correctly.

It will depend on the hash of the object you place in the OverlapDetector, since the overlap detector is generic now.

If you send in a dataclass with frozen=True, then all fields will be used as a part of the hash. However, if you make a custom interval-like object and define your own hash method, then that hash method will be used.

Both BedRecord and Interval hash on all their fields.

Mypy should disallow any custom interval-like object that does not have a __hash__() dunder method:

pybedlite/pybedlite/overlap_detector.py

Line 82 in a63c492

class Span(Hashable, Protocol):

I don't think a call to set or sorted should have been included in the original implementation (a user should have the choice of "unique-ifying" later, or sorting later) but because they are already a part of the implementation details, it makes sense to preserve behavior:

pybedlite/pybedlite/overlap_detector.py

Lines 362 to 374 in 9c4990e

intervals: Set[SpanType] = {

ref_intervals[index]

for _, _, index in tree.overlap(interval.refname, interval.start, interval.end)

}

return sorted(

intervals,

key=lambda intv: (

intv.start,

intv.end,

self._negative(intv),

intv.refname,

),

)

codecov · 2024-12-03T19:10:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.44%. Comparing base (9c4990e) to head (a63c492).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
+ Coverage   95.25%   95.44%   +0.19%     
==========================================
  Files           8        8              
  Lines         674      681       +7     
  Branches      119      119              
==========================================
+ Hits          642      650       +8     
  Misses         18       18              
+ Partials       14       13       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TedBrookings added 2 commits December 3, 2024 11:13

Add failing unit test

8ebed61

Add IntervalSet and replace use of cgranges

25201bd

TedBrookings had a problem deploying to github-action-ci December 3, 2024 16:22 — with GitHub Actions Failure

TedBrookings had a problem deploying to github-action-ci December 3, 2024 16:22 — with GitHub Actions Error

TedBrookings had a problem deploying to github-action-ci December 3, 2024 16:22 — with GitHub Actions Failure

TedBrookings had a problem deploying to github-action-ci December 3, 2024 16:35 — with GitHub Actions Error

TedBrookings had a problem deploying to github-action-ci December 3, 2024 16:35 — with GitHub Actions Failure

TedBrookings temporarily deployed to github-action-ci December 3, 2024 16:40 — with GitHub Actions Inactive

Remove cgranges and fix README

641a51e

TedBrookings force-pushed the tb-superintervals branch from 4877a18 to 641a51e Compare December 3, 2024 16:44

TedBrookings temporarily deployed to github-action-ci December 3, 2024 16:44 — with GitHub Actions Inactive

TedBrookings marked this pull request as ready for review December 3, 2024 16:46

TedBrookings requested review from nh13 and tfenne as code owners December 3, 2024 16:46

TedBrookings requested review from msto and clintval December 3, 2024 16:48

clintval reviewed Dec 3, 2024

View reviewed changes

msto reviewed Dec 3, 2024

View reviewed changes

Respond to some reviewer comments

a63c492

TedBrookings temporarily deployed to github-action-ci December 3, 2024 18:08 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate interval tree from cgranges to superintervals #42

Migrate interval tree from cgranges to superintervals #42

TedBrookings commented Dec 3, 2024

nh13 commented Dec 3, 2024

clintval left a comment

clintval Dec 3, 2024

TedBrookings Dec 3, 2024

clintval Dec 3, 2024

clintval Dec 3, 2024

TedBrookings Dec 3, 2024

clintval Dec 3, 2024

TedBrookings Dec 3, 2024

msto Dec 3, 2024 •

edited

Loading

msto Dec 3, 2024

TedBrookings Dec 3, 2024

clintval Dec 3, 2024

codecov bot commented Dec 3, 2024

		for index in reversed(tree.find_overlaps(interval.start + 1, interval.end)):
		yield ref_intervals[index]

	intervals: Set[SpanType] = {
	ref_intervals[index]
	for _, _, index in tree.overlap(interval.refname, interval.start, interval.end)
	}
	return sorted(
	intervals,
	key=lambda intv: (
	intv.start,
	intv.end,
	self._negative(intv),
	intv.refname,
	),
	)

Migrate interval tree from cgranges to superintervals #42

Are you sure you want to change the base?

Migrate interval tree from cgranges to superintervals #42

Conversation

TedBrookings commented Dec 3, 2024

nh13 commented Dec 3, 2024

clintval left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msto Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 3, 2024

Codecov Report

msto Dec 3, 2024 •

edited

Loading