Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample mask #896

Merged
merged 11 commits into from
May 15, 2024
Merged

Add sample mask #896

merged 11 commits into from
May 15, 2024

Conversation

benjeffery
Copy link
Member

Needs a couple of extra tests for weird masks, but mostly there.

Copy link

codecov bot commented Feb 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.04%. Comparing base (2cf0975) to head (56df800).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #896   +/-   ##
=======================================
  Coverage   87.04%   87.04%           
=======================================
  Files           5        5           
  Lines        1767     1767           
  Branches      310      310           
=======================================
  Hits         1538     1538           
  Misses        140      140           
  Partials       89       89           
Flag Coverage Δ
C 87.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I think the chunk_iterator could be a bit simpler and more generic by having a mask=[dim0mask, dim1mask] argument, and each dimension mask defaults to None (np.ones in that dimension), but we can log that as a follow up issue.

@hyanwong
Copy link
Member

hyanwong commented Feb 5, 2024

Did we ever get to the root of the knotty question of whether to try to make the meaning of mask=1 and mask=0 the same between SGkit and tsinfer? I worry that they have opposite meanings at the moment. (SGkit sets 1 to mask out something, whereas our approach has been to use 1 to include a region. What convention do other bioinformatics pipelines use?

@jeromekelleher
Copy link
Member

Good point. We want to follow sgkit, no point in thinking any harder than that.

@benjeffery
Copy link
Member Author

Did we ever get to the root of the knotty question of whether to try to make the meaning of mask=1 and mask=0 the same between SGkit and tsinfer? I worry that they have opposite meanings at the moment. (SGkit sets 1 to mask out something, whereas our approach has been to use 1 to include a region. What convention do other bioinformatics pipelines use?

Yeah, I plan to flip the masks. Will file an issue for it.

@benjeffery benjeffery mentioned this pull request Feb 6, 2024
@benjeffery benjeffery marked this pull request as ready for review March 5, 2024 12:52
@benjeffery
Copy link
Member Author

Would appreciate a quick review here. Seems to all be working at GeL and BMRC.

def samples_mask(self):
# Samples in sgkit are individuals in tskit, so we need to expand
# the mask to cover all the samples for each individual.
return np.repeat(self.individuals_mask, self.ploidy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making a new cached array, can't you broadcast a view into the individuals_mask instead, so you don't need to make a copy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe not actually, since this is row-wise. Ignore me.

Copy link
Member

@hyanwong hyanwong Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I can't actually see anywhere that samples_mask is used in the code (although it is in the tests)? Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this isn't used directly, but thought it would be useful thing to have.

@@ -2297,9 +2299,9 @@ def __init__(self, path):
self.path = path
self.data = zarr.open(path, mode="r")
genotypes_arr = self.data["call_genotype"]
_, self._num_individuals, self.ploidy = genotypes_arr.shape
_, self._num_unmasked_individuals, self.ploidy = genotypes_arr.shape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpreted "unmasked" as the number of individuals without the mask set, but instead it's the total number of individuals before masking. Would a better name be e.g. _total_num_individuals or num_individuals_premask, or something else maybe (also below for _num_unmasked_samples

@@ -2445,9 +2460,9 @@ def provenances_record(self):
except KeyError:
return np.array([], dtype=object)

@property
@functools.cached_property
def num_samples(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth documenting this as the number of samples that have not been masked out (equivalent to the total number of samples in the dataset if there is no masking)?

Copy link
Member

@hyanwong hyanwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo naming (and I'm not sure you use samples_mask, so do we actually need it?)

@jeromekelleher
Copy link
Member

Would it be simpler to assume that there is always an individual and site mask, and just try to preserve the old API with those imposed? From tsinfer's perspective, we don't care about stuff that has been masked out, and if you want to look at the raw data you go back to sgkit.

So, num_individuals and num_sites etc is what is left in the dataset after we've masked out stuff, and the actual masks are internal implementation details.

@benjeffery
Copy link
Member Author

Would it be simpler to assume that there is always an individual and site mask, and just try to preserve the old API with those imposed? From tsinfer's perspective, we don't care about stuff that has been masked out, and if you want to look at the raw data you go back to sgkit.

So, num_individuals and num_sites etc is what is left in the dataset after we've masked out stuff, and the actual masks are internal implementation details.

Yes, I think that would be cleaner, will switch it over.

@jeromekelleher
Copy link
Member

jeromekelleher commented Mar 12, 2024

I think this is a nice pattern we could adopt here for the general iteration: https://github.com/pystatgen/vcf-zarr-publication/blob/a01de00e36d0918a7e47fb2f8c6b3a4fd810eb66/src/zarr_afdist.py#L110

So, for iterating over haplotypes we'd do:

    # Use zarr arrays to get mask chunks aligned with the main data
    # for convenience.
    z_variant_mask = zarr.array(
        variant_mask, chunks=call_genotype.chunks[0], dtype=np.int8
    )
    for v_chunk in range(call_genotype.cdata_shape[0]):
        variant_mask_chunk = z_variant_mask.blocks[v_chunk]
        count = np.sum(variant_mask_chunk)
        if count > 0:
            v_chunk = call_genotype.blocks[v_chunk]
            for j, row in enumerate(v_chunk):
                   if variant_mask_chunk[j]:
                       yield row[sample_mask]

Copy link
Contributor

mergify bot commented Apr 25, 2024

⚠️ The sha of the head commit of this PR conflicts with #900. Mergify cannot evaluate rules on this PR. ⚠️

@benjeffery
Copy link
Member Author

This pretty much done - the test failure is odd, can't immediately recreate so will make the exact env that is failing.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think we need to fix the terminology here as it's horribly confusing. Let's change all the things that we currently have as x_mask to x_select, and reserve the work "mask" to specifically mean "mask something out if true". Also change unmasked_x to selected_x, I think would be a lot easier to follow.

tsinfer/formats.py Outdated Show resolved Hide resolved
@benjeffery
Copy link
Member Author

Ok I've done the mask renaming.
Still unable to recreate the odd failure we're getting AttributeError: 'dict' object has no attribute 'astype' deep in some zarr code. Will try rolling back zarr.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple of simplifications and minor comment. Good to merge then.

if self._sites_mask_name is None:
return np.full(self.data["variant_position"].shape, True, dtype=bool)
else:
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all of the logic for this method could be moved to the __init__, couldn't it? That would make it possible to catch these kinds of errors at init time rather than later on.

Why not also just store the sites_select array then rather than faffing with a cached_property? This is a read-only view, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 56df800

@@ -2333,6 +2355,26 @@ def sequence_length(self):
def num_sites(self):
return self._num_sites

@functools.cached_property
def individuals_select(self):
if self._sgkit_samples_mask_name is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as sites_select - can we just compute and store this at init time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 56df800

@@ -305,7 +305,7 @@ def zarr_summary(array):
return ret


def chunk_iterator(array, indexes=None, mask=None, dimension=0):
def chunk_iterator(array, indexes=None, mask=None, orthogonal_select=None, dimension=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask should be select here, it's being used in the wrong sense currently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3fdf3b9

@benjeffery
Copy link
Member Author

@Mergifyio rebase

Copy link
Contributor

mergify bot commented May 15, 2024

rebase

☑️ Nothing to do

  • -conflict [📌 rebase requirement]
  • -closed [📌 rebase requirement]
  • queue-position=-1 [📌 rebase requirement]
  • any of:
    • #commits-behind>0 [📌 rebase requirement]
    • #commits>1 [📌 rebase requirement]
    • -linear-history [📌 rebase requirement]

@benjeffery
Copy link
Member Author

Comments addressed.

@mergify mergify bot merged commit 1d45c0c into tskit-dev:main May 15, 2024
14 checks passed
@benjeffery benjeffery deleted the sample_mask branch May 16, 2024 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants