Genome-wide average Fst #888

alxsimon · 2022-08-12T09:39:55Z

alxsimon
Aug 12, 2022

Hi everyone,
Maybe I missed it, but is there an easy way to compute the genome-wide average Fst?
For example it's available in scikit-allel.

Tried to use only 1 overall window but this still splits between chromosomes.

Thanks

tomwhite · 2022-08-12T10:28:14Z

tomwhite
Aug 12, 2022
Maintainer

Hi @alxsimon - you are right that windows never span contigs. This is the way that the window_by_position and window_by_variant functions have been implemented. (If you don't specify a window you get per-variant stats.)

I haven't done it, but it should be possible to define a single genome-wide window, by defining variables such that ds["window_start"] is an array with a single value of 0, and ds["window_stop"] is an array with a single value that is the total number of variants. Then the popgen methods like Fst, which use window_statistic under the covers, should use that single window.

1 reply

alxsimon Aug 13, 2022
Author

Thanks for the idea, should have thought of this workaround.

Unfortunately I end up having an error.
It seems the underlying functions are still considering a block by chromosome, this is the end trace of the result:

File /opt/miniconda3/envs/popgen/lib/python3.9/site-packages/sgkit/window.py:406, in window_statistic(values, statistic, window_starts, window_stops, dtype, chunks, new_axis, **kwargs)
    404     # new chunks are same except in first axis
    405     new_chunks = tuple([tuple(windows_per_chunk)] + list(desired_chunks[1:]))  # type: ignore
--> 406 return values.map_overlap(
    407     blockwise_moving_stat,
    408     dtype=dtype,
    409     chunks=new_chunks,
    410     depth=depth,
    411     boundary=0,
    412     trim=False,
...
    265         )
    266     chunks[i] = tuple(adjust_chunks[ind])
    267 else:

ValueError: Dimension 0 has 1 blocks, adjust_chunks specified with 21 blocks

This is the xarray.Dataset I am working with:

<xarray.Dataset>
Dimensions:                    (variants: 1135618, samples: 666, ploidy: 2,
                                ancestries: 3, alleles: 2, windows: 1)
Dimensions without coordinates: variants, samples, ploidy, ancestries, alleles,
                                windows
Data variables: (12/23)
    call_genotype              (variants, samples, ploidy) int8 ...
    call_genotype_mask         (variants, samples, ploidy) bool ...
    sample_admixture           (samples, ancestries) float64 0.146 ... 0.012
    sample_date_bp             (samples) int64 2240 3698 2225 ... 953 5418 5639
    sample_family_id           (samples) <U3 '1' '2' '3' ... '664' '665' '666'
    sample_filter_0            (samples) <U105 'Use' ... 'Do not pool into ma...
    ...                         ...
    variant_id                 (variants) <U16 ...
    variant_position           (variants) int32 ...
    variant_rate               (variants) float64 ...
    sample_cohort              (samples) int64 4 1 4 -9 -9 -9 4 ... 5 5 5 5 0 -9
    window_start               (windows) int64 0
    window_stop                (windows) int64 1135618
Attributes:
    contigs:  ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',...
    source:   sgkit-0.5.0

hammer · 2022-08-12T19:03:07Z

hammer
Aug 12, 2022
Maintainer

@tomwhite should we add a whole genome window helper to the API for convenience?

2 replies

alxsimon Aug 13, 2022
Author

From my end user point of view, this would indeed be a nice helper to have as we often want a few stats genome-wide.
Or you could also just add a small section in the documentation describing this trick.

timothymillar Aug 14, 2022
Maintainer

A whole chromosome equivalent could also be useful, i.e. a better signposted approach than ask for one window knowing that it will be split by chromosome

tomwhite · 2022-08-15T13:05:01Z

tomwhite
Aug 15, 2022
Maintainer

+1 to adding a helper method for this use case.

I tried to reproduce the error you got @alxsimon, but I wasn't able to. Does using the window_by_genome function below work for you? If not, it would be useful to have a dataset (simulated or otherwise) to help reproduce the problem. (I tried with a test file from the github repo - are there any files there that show the same problem?)

>>> import numpy as np
>>> import sgkit as sg
>>> import xarray as xr
>>> from sgkit.io.vcf import vcf_to_zarr
>>> 
>>> vcf_to_zarr("sgkit/tests/io/vcf/data/CEUTrio.20.21.gatk3.4.g.vcf.bgz", "output.zarr")
/Users/tom/workspace/sgkit/sgkit/io/vcf/vcf_reader.py:963: MaxAltAllelesExceededWarning: Some alternate alleles were dropped, since actual max value 7 exceeded max_alt_alleles setting of 3.
  warnings.warn(
>>> 
>>> ds = sg.load_dataset("output.zarr")
>>> ds["sample_cohort"] = xr.DataArray(np.array([0]), dims="samples")
>>> 
>>> def window_by_genome(ds):
...     ds["window_start"] = (
...         ["windows"],
...         np.array([0]),
...     )
...     ds["window_stop"] = (
...         ["windows"],
...         np.array([ds.dims["variants"]]),
...     )
...     return ds
... 
>>> 
>>> ds = window_by_genome(ds)
>>> ds = sg.diversity(ds)
>>> ds
<xarray.Dataset>
Dimensions:               (windows: 1, cohorts: 1, variants: 19910, alleles: 4,
                           samples: 1, ploidy: 2, filters: 2)
Dimensions without coordinates: windows, cohorts, variants, alleles, samples,
                                ploidy, filters
Data variables: (12/17)
    stat_diversity        (windows, cohorts) float64 dask.array<chunksize=(1, 1), meta=np.ndarray>
    cohort_allele_count   (variants, cohorts, alleles) uint64 dask.array<chunksize=(10000, 1, 4), meta=np.ndarray>
    call_allele_count     (variants, samples, alleles) uint8 dask.array<chunksize=(10000, 1, 4), meta=np.ndarray>
    call_genotype         (variants, samples, ploidy) int8 dask.array<chunksize=(10000, 1, 2), meta=np.ndarray>
    call_genotype_mask    (variants, samples, ploidy) bool dask.array<chunksize=(10000, 1, 2), meta=np.ndarray>
    call_genotype_phased  (variants, samples) bool dask.array<chunksize=(10000, 1), meta=np.ndarray>
    ...                    ...
    variant_id_mask       (variants) bool dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_position      (variants) int32 dask.array<chunksize=(10000,), meta=np.ndarray>
    variant_quality       (variants) float32 dask.array<chunksize=(10000,), meta=np.ndarray>
    sample_cohort         (samples) int64 0
    window_start          (windows) int64 0
    window_stop           (windows) int64 19910
Attributes:
    contigs:               ['20', '21']
    filters:               ['PASS', 'LowQual']
    max_alt_alleles_seen:  7
    source:                sgkit-0.4.1.dev20+g839eb9a9
    vcf_header:            ##fileformat=VCFv4.1\n##FILTER=<ID=PASS,Descriptio...
    vcf_zarr_version:      0.1
>>> 
>>> ds["stat_diversity"].compute()
<xarray.DataArray 'stat_diversity' (windows: 1, cohorts: 1)>
array([[nan]])
Dimensions without coordinates: windows, cohorts
Attributes:
    comment:  Genetic diversity (also known as "Tajima’s pi") for cohorts.

3 replies

alxsimon Aug 16, 2022
Author

Thanks for investigating this. Your code above is working fine for me, but I managed to reproduce the error with a simulated dataset.

There is definitely something going on with chunking as removing the chunking on variants removes the issue (however this is not working on my dataset). I don't have enough knowledge of dask and xarray to pinpoint the issue.

>>> import sgkit as sg
>>> import numpy as np
>>> ds = sg.simulate_genotype_call_dataset(
...     n_variant=3000,
...     n_sample=12,
...     n_ploidy=1,
...     n_allele=2, n_contig=2,
...     seed=3,
...     missing_pct=0.1).chunk(1000)

>>> ds['sample_cohort'] = (
...     'samples',
...     np.array([0, 1, 2, 3]*3)
... )

>>> def window_by_genome(ds):
...     ds["window_start"] = (
...             ["windows"],
...             np.array([0]),
...     )
...     ds["window_stop"] = (
...             ["windows"],
...             np.array([ds.dims["variants"]]),
...     )
...     return ds
... 
>>> ds = window_by_genome(ds)
>>> print(ds)
<xarray.Dataset>
Dimensions:             (variants: 3000, alleles: 2, samples: 12, ploidy: 1,
                         windows: 1)
Dimensions without coordinates: variants, alleles, samples, ploidy, windows
Data variables:
    variant_contig      (variants) int64 dask.array<chunksize=(1000,), meta=np.ndarray>
    variant_position    (variants) int64 dask.array<chunksize=(1000,), meta=np.ndarray>
    variant_allele      (variants, alleles) |S1 dask.array<chunksize=(1000, 2), meta=np.ndarray>
    sample_id           (samples) <U3 dask.array<chunksize=(12,), meta=np.ndarray>
    call_genotype       (variants, samples, ploidy) int8 dask.array<chunksize=(1000, 12, 1), meta=np.ndarray>
    call_genotype_mask  (variants, samples, ploidy) bool dask.array<chunksize=(1000, 12, 1), meta=np.ndarray>
    sample_cohort       (samples) int64 0 1 2 3 0 1 2 3 0 1 2 3
    window_start        (windows) int64 0
    window_stop         (windows) int64 3000
Attributes:
    contigs:  [0, 1]
    source:   sgkit-0.5.0
>>> sg.divergence(ds, merge=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/popgen/lib/python3.9/site-packages/sgkit/stats/popgen.py", line 273, in divergence
    div = window_statistic(
  File "/opt/miniconda3/envs/popgen/lib/python3.9/site-packages/sgkit/window.py", line 406, in window_statistic
    return values.map_overlap(
  File "/opt/miniconda3/envs/popgen/lib/python3.9/site-packages/dask/array/core.py", line 2523, in map_overlap
    return map_overlap(
  File "/opt/miniconda3/envs/popgen/lib/python3.9/site-packages/dask/array/overlap.py", line 703, in map_overlap
    x = map_blocks(func, *args, **kwargs)
  File "/opt/miniconda3/envs/popgen/lib/python3.9/site-packages/dask/array/core.py", line 775, in map_blocks
    out = blockwise(
  File "/opt/miniconda3/envs/popgen/lib/python3.9/site-packages/dask/array/blockwise.py", line 262, in blockwise
    raise ValueError(
ValueError: Dimension 0 has 1 blocks, adjust_chunks specified with 2 blocks

Version of the modules

>>> sg.__version__
'0.5.0'
>>> xarray.__version__
'2022.3.0'
>>> dask.__version__
'2022.01.0'

alxsimon Aug 16, 2022
Author

Playing a bit with this example, varying the number of variants and how their are chunked is providing some working cases.

What's working or not for instance: not sure there is an intelligible pattern:

alxsimon Aug 16, 2022
Author

Pattern seems to be that it's working fine with two or less chunks but once you hit three chunks it breaks down.

tomwhite · 2022-08-16T16:53:27Z

tomwhite
Aug 16, 2022
Maintainer

Thanks for providing a reproducer @alxsimon. It looks like there is a bug in window_statistic that can't cope with this case.

Thinking about this more though, window_statistic uses Dask's map_overlap to compute statistics in windows, and it assumes that window sizes are about the same as a block (chunk) size. In this case, the window is the whole dataset, so that assumption isn't met.

So, I think in this case to compute divergence, we just want to sum the per-variant d values across all variants - without worrying about windows.

Something like this:

>>> import numpy as np
>>> import sgkit as sg
>>> ds = sg.simulate_genotype_call_dataset(
...     n_variant=3000,
...     n_sample=12,
...     n_ploidy=1,
...     n_allele=2,
...     n_contig=2,
...     seed=3,
...     missing_pct=0.1,
... ).chunk(1000)
>>> ds["sample_cohort"] = ("samples", np.array([0, 1, 2, 3] * 3))
>>> # don't window
>>> div = sg.divergence(ds, merge=False)
>>> div
<xarray.Dataset>
Dimensions:          (variants: 3000, cohorts_0: 4, cohorts_1: 4)
Dimensions without coordinates: variants, cohorts_0, cohorts_1
Data variables:
    stat_divergence  (variants, cohorts_0, cohorts_1) float64 dask.array<chunksize=(1000, 4, 4), meta=np.ndarray>
>>> genome_wide_div = div.sum(dim="variants")["stat_divergence"]
>>> genome_wide_div
<xarray.DataArray 'stat_divergence' (cohorts_0: 4, cohorts_1: 4)>
dask.array<sum-aggregate, shape=(4, 4), dtype=float64, chunksize=(4, 4), chunktype=numpy.ndarray>
Dimensions without coordinates: cohorts_0, cohorts_1
>>> from sgkit.stats.popgen import _Fst_Hudson
>>> _Fst_Hudson(genome_wide_div.compute().data)
array([[       nan, 0.01822849, 0.01228666, 0.02382808],
       [0.01822849,        nan, 0.01190565, 0.014855  ],
       [0.01228666, 0.01190565,        nan, 0.01998291],
       [0.02382808, 0.014855  , 0.01998291,        nan]])

If this looks like the right direction , then we can package it up as part of the library - but it would be good to get confirmation from a popgen expert that this is in fact meaningful.

5 replies

alxsimon Aug 16, 2022
Author

I see.

Yes I think your proposal gives the right result and is straightforward enough, I'll use this in the meantime.

tomwhite Aug 16, 2022
Maintainer

Thanks @alxsimon! We can get this turned into an issue and then a PR to add the new functionality.

timothymillar Aug 16, 2022
Maintainer

+1 for a general pattern that allows computation over the full genome or windows. We could potentially add a by parameter to the relevant methods which must be one of {"genome", "window", "variant"} (maybe adding "contig" in future). I have actually been considering this with methods identity_by_state and Weir_Goudet_beta which are currently only calculated over the full genome but would also be useful when windowed.

alxsimon Aug 17, 2022
Author

No problem, thanks for your reactivity @tomwhite. Let me know if I can help somewhere in the process.

timothymillar Aug 17, 2022
Maintainer

I've opened #893 to track this, please feel free to add any suggestions/comments you have there @alxsimon!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genome-wide average Fst #888

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Genome-wide average Fst #888

alxsimon Aug 12, 2022

Replies: 4 comments · 11 replies

tomwhite Aug 12, 2022 Maintainer

alxsimon Aug 13, 2022 Author

hammer Aug 12, 2022 Maintainer

alxsimon Aug 13, 2022 Author

timothymillar Aug 14, 2022 Maintainer

tomwhite Aug 15, 2022 Maintainer

alxsimon Aug 16, 2022 Author

alxsimon Aug 16, 2022 Author

alxsimon Aug 16, 2022 Author

tomwhite Aug 16, 2022 Maintainer

alxsimon Aug 16, 2022 Author

tomwhite Aug 16, 2022 Maintainer

timothymillar Aug 16, 2022 Maintainer

alxsimon Aug 17, 2022 Author

timothymillar Aug 17, 2022 Maintainer

alxsimon
Aug 12, 2022

Replies: 4 comments 11 replies

tomwhite
Aug 12, 2022
Maintainer

alxsimon Aug 13, 2022
Author

hammer
Aug 12, 2022
Maintainer

alxsimon Aug 13, 2022
Author

timothymillar Aug 14, 2022
Maintainer

tomwhite
Aug 15, 2022
Maintainer

alxsimon Aug 16, 2022
Author

alxsimon Aug 16, 2022
Author

alxsimon Aug 16, 2022
Author

tomwhite
Aug 16, 2022
Maintainer

alxsimon Aug 16, 2022
Author

tomwhite Aug 16, 2022
Maintainer

timothymillar Aug 16, 2022
Maintainer

alxsimon Aug 17, 2022
Author

timothymillar Aug 17, 2022
Maintainer