Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra sample nodes being created from .vcz file when ploidies are uneven #952

Open
hyanwong opened this issue Aug 20, 2024 · 10 comments
Open

Comments

@hyanwong
Copy link
Member

hyanwong commented Aug 20, 2024

If we have a mix of ploidies in a .vcz file, we create sample nodes for the maximum ploidy size for each individual, that just attach to the root (because they have no data). I.e. we make a huge polytomy at the root of "fake" samples. I'm not even sure how we identify these samples to delete them.

There are a couple of people at the Oslo workshop who are working on haplodiploids, and others who are working on mixed policy systems in plants (e.g. @situssog), so this would be good to fix.

Here's an example:

import tempfile
import os
import sys
import subprocess
import msprime
import pysam
from Bio import bgzf


# make some haploids, diploids, and tetraploids
ts = msprime.sim_ancestry(
    [msprime.SampleSet(2, ploidy=1), msprime.SampleSet(3, ploidy=2), msprime.SampleSet(1, ploidy=4)],
    population_size=1e3,
    sequence_length=1e4,
    recombination_rate=1e-6,
)
ts = msprime.sim_mutations(ts, rate=1e-7)
zarr_file_name = "tmp.vcz"
with tempfile.TemporaryDirectory() as tmpdirname:
    vcf_name = os.path.join(tmpdirname, zarr_file_name.replace("/", "") + ".vcf.gz")
    with bgzf.open(vcf_name, "wt") as f:
        ts.write_vcf(f)
    pysam.tabix_index(vcf_name, preset="vcf")
    subprocess.run([sys.executable, "-m", "bio2zarr", "vcf2zarr", "convert", "--force", vcf_name, zarr_file_name])

vdata = tsinfer.VariantData("tmp.vcz", ancestral_allele=np.array([s.ancestral_state for s in ts.sites()]))
inferred_ts = tsinfer.infer(vdata)
print("ts.num_samples:", ts.num_samples, "-- inferred_ts.num_samples:", inferred_ts.num_samples)
@benjeffery
Copy link
Member

Yes, mixed ploidy isn't supported for now. I guess detecting the not-present samples and then masking them out is the only way to do this.

@hyanwong
Copy link
Member Author

Ah yes: https://sgkit-dev.github.io/sgkit/0.5.0/vcf.html#polyploid-and-mixed-ploidy-vcf. It looks like vcf2zarr doesn't have a mixed_ploidy option like sgkit.io.vcf.vcf_to_zarr does. I assume this is something we eventually plan to incorporate? And in the same way, we would set a sentinel of -2 for each genotype call?

I guess detecting the not-present samples and then masking them out is the only way to do this.

Do you mean the only way to do this at the moment, or the only foreseeable way forward in the long term?

@jeromekelleher
Copy link
Member

Mixed_ploidy isn't a property of the dataset, and was introduced in sgkit as a way of forcing calls to appear as if they are not mixed ploidy (from the output of certain variant callers on human datasets). I think we would need to tell tsinfer explicitly what ploidy to expect for each sample to support this properly.

We could add a ploidy= argument in SampleData, which could be a single number or an array (one value for each sample)?

@hyanwong
Copy link
Member Author

hyanwong commented Sep 1, 2024

The .vcz file "knows" the ploidy in each case, though, as it has a -2 in the right place. E.g. here is the call_genotype array for the example above:

array([[[ 1, -2, -2, -2],
        [ 1, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1,  0,  0]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 1,  0, -2, -2],
        [ 0,  0, -2, -2],
        [ 1,  0,  0,  1]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0,  0,  0]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 1,  1, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  1, -2, -2],
        [ 0,  0,  0,  1]],

       [[ 0, -2, -2, -2],
        [ 0, -2, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0, -2, -2],
        [ 0,  0,  0,  1]]], dtype=int8)

Is it not just a question of the VariantData object correctly inferring the ploidy of each sample from the -2 values?

@jeromekelleher
Copy link
Member

That involves a full scan of the data, though, as you have to examine every element of the genotype matrix to find any -2s. That's definitely not an operation we want to be done "implicitly".

It's not that big a chore for a user who already knows the ploidy of their samples to supply it as an argument.

@hyanwong
Copy link
Member Author

hyanwong commented Sep 2, 2024

We could have a flag in VariantData to say to "check the first variant for non--2 values and use that as the ploidy"? I don't think -2 is used for other stuff, is it (missing data is -1). The users I was speaking to will probably just be pulling in VCFs from somewhere else, and I suspect it would be a tedious extra step to find the sample IDs and figure out the ploidy, when it's all in the VCF anyway.

Some other options:

  1. a VariantData function, check_ploidy that "fixes" the problem on an optional basis
  2. some sort of check when we do go though the variants (e.g. when building ancestors or matching samples) that the ploidies are correct?

@hyanwong
Copy link
Member Author

hyanwong commented Sep 2, 2024

Another option, which makes @jeromekelleher's suggestion easier:

  1. A function which extracts the ploidies for all samples from a single variant, which could then be used as the input to the VariantData constructor, without having to figure out sample IDs etc.
data = tsinfer.VariantData(
    "tmp.vcz",
    ploidies = tsinfer.ploidies_from_vcz("tmp.vcz", variant_id=0)
)

@jeromekelleher
Copy link
Member

Option 3 is straightforward enough all right.

@hyanwong
Copy link
Member Author

hyanwong commented Sep 2, 2024

It might also be a good idea to warn on creation of the VariantData object if the first variant has any -2 values and a ploidy argument is not specified?

@jeromekelleher
Copy link
Member

Can't do it on create - it involves reading all values and can take 10s of minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants