- (Optional) To run the simulation script, you need to set up scrm
- For the simulation script and data analysis, you need an environment with tsinfer and numpy, e.g.
conda create -n tsinfer -conda-forge tsinfer numpy ipython
There is a simulated data set of 105 sheep (i.e. 210 haploid genomes) along 1Mbp of sequence:
- 5 mouflon
- 50 Iranian
- 50 Border
Demography (N - effective population size, sampled - number of diplopids sampled) following the work of Pedro Morell Miranda et al. (2023; Demographic reconstruction of the Western sheep expansion from whole-genome sequences, https://academic.oup.com/g3journal/article/13/11/jkad199/7262565)
Both files have the same variant data, but the TS is only half the size. This ratio becomes even more beneficial for larger data sets. The TS file also contains richer information (trees).
-rw------- 1 hbecher datastore_eb_groups_HighlanderLab_admin 2.5M Feb 20 11:58 test1M.out.inf.ts
-rw------- 1 hbecher datastore_eb_groups_HighlanderLab_admin 5.6M Feb 20 11:58 test1M.out.inf.vcf
Open ipython, Jupyter notebook or similar
import tskit
# load the TS file
ts = tskit.load("test1M.out.inf.ts")
# how amny sample are there?
ts.num_samples
# samples 0 to 9 are mouflon genomes, 10 to 109 are Iranian, 110 to 209 are Border, mind the python indexing below
# compute nucleatide diversity in mouflon, Iranian, and Border
ts.diversity([range(10), range(10,110), range(110, 210)])
# Fst mouflon-Iranian
ts.Fst([range(10), range(10,110)])
> 0.0474287776006701
# Fst mouflon-Border
ts.Fst([range(10), range(110,210)])
> 0.3863240194680754
# Fst Iranian-Border
ts.Fst([range(10,110), range(110,210)])
> 0.344373582032233
# Based on gene trees and looking at the mouflon samples (first argument) who are the closest relatives (second argument = list of ranges)
ts.genealogical_nearest_neighbours(range(10), [range(10),range(10, 110), range(110, 210)])
> array([[0.59771033, 0.36771007, 0.0345796 ],
> [0.48813082, 0.48295405, 0.02891513],
> [0.51328397, 0.44918785, 0.03752818],
> [0.74977978, 0.23451288, 0.01570734],
> [0.63255684, 0.35004292, 0.01740025],
> [0.48170299, 0.49188582, 0.02641118],
> [0.51438889, 0.43455347, 0.05105764],
> [0.65187813, 0.32764667, 0.0204752 ],
> [0.46132971, 0.48903879, 0.0496315 ],
> [0.60860476, 0.35239091, 0.03900433]])
# The columns are mouflon, Iranian, Border
# The rows correspond to the mouflon genomes
# Mouflon genomes are most similar to mouflon, also simlar to Iranian, but very different to Border
# This is useful when there are unlabelled samples
See here for other stats and inspiration.