Replies: 4 comments 11 replies
-
Hi @alxsimon - you are right that windows never span contigs. This is the way that the I haven't done it, but it should be possible to define a single genome-wide window, by defining variables such that |
Beta Was this translation helpful? Give feedback.
-
@tomwhite should we add a whole genome window helper to the API for convenience? |
Beta Was this translation helpful? Give feedback.
-
+1 to adding a helper method for this use case. I tried to reproduce the error you got @alxsimon, but I wasn't able to. Does using the >>> import numpy as np
>>> import sgkit as sg
>>> import xarray as xr
>>> from sgkit.io.vcf import vcf_to_zarr
>>>
>>> vcf_to_zarr("sgkit/tests/io/vcf/data/CEUTrio.20.21.gatk3.4.g.vcf.bgz", "output.zarr")
/Users/tom/workspace/sgkit/sgkit/io/vcf/vcf_reader.py:963: MaxAltAllelesExceededWarning: Some alternate alleles were dropped, since actual max value 7 exceeded max_alt_alleles setting of 3.
warnings.warn(
>>>
>>> ds = sg.load_dataset("output.zarr")
>>> ds["sample_cohort"] = xr.DataArray(np.array([0]), dims="samples")
>>>
>>> def window_by_genome(ds):
... ds["window_start"] = (
... ["windows"],
... np.array([0]),
... )
... ds["window_stop"] = (
... ["windows"],
... np.array([ds.dims["variants"]]),
... )
... return ds
...
>>>
>>> ds = window_by_genome(ds)
>>> ds = sg.diversity(ds)
>>> ds
<xarray.Dataset>
Dimensions: (windows: 1, cohorts: 1, variants: 19910, alleles: 4,
samples: 1, ploidy: 2, filters: 2)
Dimensions without coordinates: windows, cohorts, variants, alleles, samples,
ploidy, filters
Data variables: (12/17)
stat_diversity (windows, cohorts) float64 dask.array<chunksize=(1, 1), meta=np.ndarray>
cohort_allele_count (variants, cohorts, alleles) uint64 dask.array<chunksize=(10000, 1, 4), meta=np.ndarray>
call_allele_count (variants, samples, alleles) uint8 dask.array<chunksize=(10000, 1, 4), meta=np.ndarray>
call_genotype (variants, samples, ploidy) int8 dask.array<chunksize=(10000, 1, 2), meta=np.ndarray>
call_genotype_mask (variants, samples, ploidy) bool dask.array<chunksize=(10000, 1, 2), meta=np.ndarray>
call_genotype_phased (variants, samples) bool dask.array<chunksize=(10000, 1), meta=np.ndarray>
... ...
variant_id_mask (variants) bool dask.array<chunksize=(10000,), meta=np.ndarray>
variant_position (variants) int32 dask.array<chunksize=(10000,), meta=np.ndarray>
variant_quality (variants) float32 dask.array<chunksize=(10000,), meta=np.ndarray>
sample_cohort (samples) int64 0
window_start (windows) int64 0
window_stop (windows) int64 19910
Attributes:
contigs: ['20', '21']
filters: ['PASS', 'LowQual']
max_alt_alleles_seen: 7
source: sgkit-0.4.1.dev20+g839eb9a9
vcf_header: ##fileformat=VCFv4.1\n##FILTER=<ID=PASS,Descriptio...
vcf_zarr_version: 0.1
>>>
>>> ds["stat_diversity"].compute()
<xarray.DataArray 'stat_diversity' (windows: 1, cohorts: 1)>
array([[nan]])
Dimensions without coordinates: windows, cohorts
Attributes:
comment: Genetic diversity (also known as "Tajima’s pi") for cohorts. |
Beta Was this translation helpful? Give feedback.
-
Thanks for providing a reproducer @alxsimon. It looks like there is a bug in Thinking about this more though, So, I think in this case to compute divergence, we just want to sum the per-variant Something like this: >>> import numpy as np
>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(
... n_variant=3000,
... n_sample=12,
... n_ploidy=1,
... n_allele=2,
... n_contig=2,
... seed=3,
... missing_pct=0.1,
... ).chunk(1000)
>>> ds["sample_cohort"] = ("samples", np.array([0, 1, 2, 3] * 3))
>>> # don't window
>>> div = sg.divergence(ds, merge=False)
>>> div
<xarray.Dataset>
Dimensions: (variants: 3000, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: variants, cohorts_0, cohorts_1
Data variables:
stat_divergence (variants, cohorts_0, cohorts_1) float64 dask.array<chunksize=(1000, 4, 4), meta=np.ndarray>
>>> genome_wide_div = div.sum(dim="variants")["stat_divergence"]
>>> genome_wide_div
<xarray.DataArray 'stat_divergence' (cohorts_0: 4, cohorts_1: 4)>
dask.array<sum-aggregate, shape=(4, 4), dtype=float64, chunksize=(4, 4), chunktype=numpy.ndarray>
Dimensions without coordinates: cohorts_0, cohorts_1
>>> from sgkit.stats.popgen import _Fst_Hudson
>>> _Fst_Hudson(genome_wide_div.compute().data)
array([[ nan, 0.01822849, 0.01228666, 0.02382808],
[0.01822849, nan, 0.01190565, 0.014855 ],
[0.01228666, 0.01190565, nan, 0.01998291],
[0.02382808, 0.014855 , 0.01998291, nan]]) If this looks like the right direction , then we can package it up as part of the library - but it would be good to get confirmation from a popgen expert that this is in fact meaningful. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
Maybe I missed it, but is there an easy way to compute the genome-wide average Fst?
For example it's available in scikit-allel.
Tried to use only 1 overall window but this still splits between chromosomes.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions