Add sample mask #896

benjeffery · 2024-02-02T23:39:27Z

Needs a couple of extra tests for weird masks, but mostly there.

codecov · 2024-02-03T00:12:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.04%. Comparing base (2cf0975) to head (56df800).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #896   +/-   ##
=======================================
  Coverage   87.04%   87.04%           
=======================================
  Files           5        5           
  Lines        1767     1767           
  Branches      310      310           
=======================================
  Hits         1538     1538           
  Misses        140      140           
  Partials       89       89

Flag	Coverage Δ
C	`87.04% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jeromekelleher

Nice. I think the chunk_iterator could be a bit simpler and more generic by having a mask=[dim0mask, dim1mask] argument, and each dimension mask defaults to None (np.ones in that dimension), but we can log that as a follow up issue.

hyanwong · 2024-02-05T09:53:36Z

Did we ever get to the root of the knotty question of whether to try to make the meaning of mask=1 and mask=0 the same between SGkit and tsinfer? I worry that they have opposite meanings at the moment. (SGkit sets 1 to mask out something, whereas our approach has been to use 1 to include a region. What convention do other bioinformatics pipelines use?

jeromekelleher · 2024-02-05T09:58:55Z

Good point. We want to follow sgkit, no point in thinking any harder than that.

benjeffery · 2024-02-05T11:58:31Z

Did we ever get to the root of the knotty question of whether to try to make the meaning of mask=1 and mask=0 the same between SGkit and tsinfer? I worry that they have opposite meanings at the moment. (SGkit sets 1 to mask out something, whereas our approach has been to use 1 to include a region. What convention do other bioinformatics pipelines use?

Yeah, I plan to flip the masks. Will file an issue for it.

benjeffery · 2024-03-05T12:53:10Z

Would appreciate a quick review here. Seems to all be working at GeL and BMRC.

hyanwong · 2024-03-05T12:57:27Z

tsinfer/formats.py

+    def samples_mask(self):
+        # Samples in sgkit are individuals in tskit, so we need to expand
+        # the mask to cover all the samples for each individual.
+        return np.repeat(self.individuals_mask, self.ploidy)


Instead of making a new cached array, can't you broadcast a view into the individuals_mask instead, so you don't need to make a copy?

Hmm, maybe not actually, since this is row-wise. Ignore me.

Although I can't actually see anywhere that samples_mask is used in the code (although it is in the tests)? Am I missing something?

Yes, this isn't used directly, but thought it would be useful thing to have.

hyanwong · 2024-03-05T13:16:02Z

tsinfer/formats.py

@@ -2297,9 +2299,9 @@ def __init__(self, path):
        self.path = path
        self.data = zarr.open(path, mode="r")
        genotypes_arr = self.data["call_genotype"]
-        _, self._num_individuals, self.ploidy = genotypes_arr.shape
+        _, self._num_unmasked_individuals, self.ploidy = genotypes_arr.shape


I interpreted "unmasked" as the number of individuals without the mask set, but instead it's the total number of individuals before masking. Would a better name be e.g. _total_num_individuals or num_individuals_premask, or something else maybe (also below for _num_unmasked_samples

hyanwong · 2024-03-05T13:17:35Z

tsinfer/formats.py

@@ -2445,9 +2460,9 @@ def provenances_record(self):
        except KeyError:
            return np.array([], dtype=object)

-    @property
+    @functools.cached_property
    def num_samples(self):


Is it worth documenting this as the number of samples that have not been masked out (equivalent to the total number of samples in the dataset if there is no masking)?

hyanwong

LGTM, modulo naming (and I'm not sure you use samples_mask, so do we actually need it?)

jeromekelleher · 2024-03-05T15:24:18Z

Would it be simpler to assume that there is always an individual and site mask, and just try to preserve the old API with those imposed? From tsinfer's perspective, we don't care about stuff that has been masked out, and if you want to look at the raw data you go back to sgkit.

So, num_individuals and num_sites etc is what is left in the dataset after we've masked out stuff, and the actual masks are internal implementation details.

benjeffery · 2024-03-08T13:57:48Z

Would it be simpler to assume that there is always an individual and site mask, and just try to preserve the old API with those imposed? From tsinfer's perspective, we don't care about stuff that has been masked out, and if you want to look at the raw data you go back to sgkit.

So, num_individuals and num_sites etc is what is left in the dataset after we've masked out stuff, and the actual masks are internal implementation details.

Yes, I think that would be cleaner, will switch it over.

jeromekelleher · 2024-03-12T11:09:07Z

I think this is a nice pattern we could adopt here for the general iteration: https://github.com/pystatgen/vcf-zarr-publication/blob/a01de00e36d0918a7e47fb2f8c6b3a4fd810eb66/src/zarr_afdist.py#L110

So, for iterating over haplotypes we'd do:

    # Use zarr arrays to get mask chunks aligned with the main data
    # for convenience.
    z_variant_mask = zarr.array(
        variant_mask, chunks=call_genotype.chunks[0], dtype=np.int8
    )
    for v_chunk in range(call_genotype.cdata_shape[0]):
        variant_mask_chunk = z_variant_mask.blocks[v_chunk]
        count = np.sum(variant_mask_chunk)
        if count > 0:
            v_chunk = call_genotype.blocks[v_chunk]
            for j, row in enumerate(v_chunk):
                   if variant_mask_chunk[j]:
                       yield row[sample_mask]

mergify · 2024-04-25T12:23:03Z

⚠️ The sha of the head commit of this PR conflicts with #900. Mergify cannot evaluate rules on this PR. ⚠️

benjeffery · 2024-05-14T11:58:59Z

This pretty much done - the test failure is odd, can't immediately recreate so will make the exact env that is failing.

jeromekelleher

LGTM, but I think we need to fix the terminology here as it's horribly confusing. Let's change all the things that we currently have as x_mask to x_select, and reserve the work "mask" to specifically mean "mask something out if true". Also change unmasked_x to selected_x, I think would be a lot easier to follow.

tsinfer/formats.py

benjeffery · 2024-05-15T13:47:07Z

Ok I've done the mask renaming.
Still unable to recreate the odd failure we're getting AttributeError: 'dict' object has no attribute 'astype' deep in some zarr code. Will try rolling back zarr.

jeromekelleher

LGTM, a couple of simplifications and minor comment. Good to merge then.

jeromekelleher · 2024-05-15T15:43:30Z

tsinfer/formats.py

+        if self._sites_mask_name is None:
+            return np.full(self.data["variant_position"].shape, True, dtype=bool)
+        else:
+            try:


Basically all of the logic for this method could be moved to the __init__, couldn't it? That would make it possible to catch these kinds of errors at init time rather than later on.

Why not also just store the sites_select array then rather than faffing with a cached_property? This is a read-only view, isn't it?

Fixed in 56df800

jeromekelleher · 2024-05-15T15:44:06Z

tsinfer/formats.py

@@ -2333,6 +2355,26 @@ def sequence_length(self):
    def num_sites(self):
        return self._num_sites

+    @functools.cached_property
+    def individuals_select(self):
+        if self._sgkit_samples_mask_name is None:


Same comment as sites_select - can we just compute and store this at init time?

Fixed in 56df800

jeromekelleher · 2024-05-15T15:45:55Z

tsinfer/formats.py

@@ -305,7 +305,7 @@ def zarr_summary(array):
    return ret


-def chunk_iterator(array, indexes=None, mask=None, dimension=0):
+def chunk_iterator(array, indexes=None, mask=None, orthogonal_select=None, dimension=0):


mask should be select here, it's being used in the wrong sense currently.

Fixed in 3fdf3b9

benjeffery · 2024-05-15T16:11:32Z

@Mergifyio rebase

mergify · 2024-05-15T16:12:06Z

rebase

☑️ Nothing to do

-conflict [📌 rebase requirement]
-closed [📌 rebase requirement]
queue-position=-1 [📌 rebase requirement]
any of:
- #commits-behind>0 [📌 rebase requirement]
- #commits>1 [📌 rebase requirement]
- -linear-history [📌 rebase requirement]

benjeffery · 2024-05-15T17:04:03Z

Comments addressed.

benjeffery force-pushed the sample_mask branch from ddebea8 to e529ac0 Compare February 2, 2024 23:56

jeromekelleher reviewed Feb 5, 2024

View reviewed changes

benjeffery mentioned this pull request Feb 6, 2024

Flip polarity #899

Closed

benjeffery marked this pull request as ready for review March 5, 2024 12:52

hyanwong reviewed Mar 5, 2024

View reviewed changes

hyanwong approved these changes Mar 5, 2024

View reviewed changes

benjeffery mentioned this pull request Apr 25, 2024

Mask names #900

Closed

benjeffery force-pushed the sample_mask branch 4 times, most recently from 8817753 to 5938d9a Compare April 30, 2024 01:12

benjeffery force-pushed the sample_mask branch from e3590b0 to 4f15091 Compare May 13, 2024 22:31

benjeffery mentioned this pull request May 14, 2024

Batch ancestor matching #917

Merged

jeromekelleher reviewed May 15, 2024

View reviewed changes

tsinfer/formats.py Outdated Show resolved Hide resolved

benjeffery force-pushed the sample_mask branch from 5ac1b3e to d67e625 Compare May 15, 2024 13:37

benjeffery force-pushed the sample_mask branch from 7ac6607 to e740bd9 Compare May 15, 2024 14:01

jeromekelleher reviewed May 15, 2024

View reviewed changes

benjeffery and others added 9 commits May 15, 2024 17:37

Add sample mask

234bedb

Flip sgkit mask polarity

2504915

Add mask names

fb79f2d

Fix sample masking error

5f883dd

Extra tests of sample matching to disk with mask

51c75d0

Remove dask sample matching

185c489

Remove slice_haplotypes

350f8ce

Use blocks to interate haplotypes

e88d398

Use less RAM when iterating haplotypes

8e497df

benjeffery force-pushed the sample_mask branch from e740bd9 to 7abd8cb Compare May 15, 2024 16:45

benjeffery added 2 commits May 15, 2024 17:50

Rename positive masks to _select

3fdf3b9

Move selects to init

56df800

benjeffery force-pushed the sample_mask branch from 7abd8cb to 56df800 Compare May 15, 2024 17:03

jeromekelleher approved these changes May 15, 2024

View reviewed changes

benjeffery added the AUTOMERGE-REQUESTED label May 15, 2024

mergify bot merged commit 1d45c0c into tskit-dev:main May 15, 2024
14 checks passed

mergify bot removed the AUTOMERGE-REQUESTED label May 15, 2024

benjeffery deleted the sample_mask branch May 16, 2024 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample mask #896

Add sample mask #896

benjeffery commented Feb 2, 2024

codecov bot commented Feb 3, 2024 •

edited

Loading

jeromekelleher left a comment

hyanwong commented Feb 5, 2024

jeromekelleher commented Feb 5, 2024

benjeffery commented Feb 5, 2024

benjeffery commented Mar 5, 2024

hyanwong Mar 5, 2024

hyanwong Mar 5, 2024

hyanwong Mar 5, 2024 •

edited

Loading

benjeffery Mar 8, 2024

hyanwong Mar 5, 2024

hyanwong Mar 5, 2024

hyanwong left a comment

jeromekelleher commented Mar 5, 2024

benjeffery commented Mar 8, 2024

jeromekelleher commented Mar 12, 2024 •

edited

Loading

mergify bot commented Apr 25, 2024

benjeffery commented May 14, 2024

jeromekelleher left a comment

benjeffery commented May 15, 2024

jeromekelleher left a comment

jeromekelleher May 15, 2024

benjeffery May 15, 2024

jeromekelleher May 15, 2024

benjeffery May 15, 2024

jeromekelleher May 15, 2024

benjeffery May 15, 2024

benjeffery commented May 15, 2024

mergify bot commented May 15, 2024

benjeffery commented May 15, 2024

Add sample mask #896

Add sample mask #896

Conversation

benjeffery commented Feb 2, 2024

codecov bot commented Feb 3, 2024 • edited Loading

Codecov Report

jeromekelleher left a comment

Choose a reason for hiding this comment

hyanwong commented Feb 5, 2024

jeromekelleher commented Feb 5, 2024

benjeffery commented Feb 5, 2024

benjeffery commented Mar 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hyanwong Mar 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hyanwong left a comment

Choose a reason for hiding this comment

jeromekelleher commented Mar 5, 2024

benjeffery commented Mar 8, 2024

jeromekelleher commented Mar 12, 2024 • edited Loading

mergify bot commented Apr 25, 2024

benjeffery commented May 14, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

benjeffery commented May 15, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjeffery commented May 15, 2024

mergify bot commented May 15, 2024

☑️ Nothing to do

benjeffery commented May 15, 2024

codecov bot commented Feb 3, 2024 •

edited

Loading

hyanwong Mar 5, 2024 •

edited

Loading

jeromekelleher commented Mar 12, 2024 •

edited

Loading