Replies: 3 comments 7 replies
-
Here's an example plot with some accompanying code that use the same selective sweep model as in #877, which I hope is a relatively stringent test of bad polarisation. It seems like even with 50% mispolarised (furthest right on the x axis), the effect on KC distance is small, and we probably need a more sensitive measure to study the effect. import msprime
import tsinfer
import numpy as np
import matplotlib.pyplot as plt
import tqdm
def make_sweep_ts(n, Ne, L, rho=1e-8, mu=1e-8, seed=1234):
sweep_model = msprime.SweepGenicSelection(
position=L/2, start_frequency=0.0001, end_frequency=0.9999, s=0.25, dt=1e-6)
models = [sweep_model, msprime.StandardCoalescent()]
ts = msprime.sim_ancestry(
n, model=models, population_size=Ne, sequence_length=L, recombination_rate=rho, random_seed=seed)
return msprime.sim_mutations(ts, rate=mu, random_seed=seed)
mu = 1e-8
pop_size = 10_000
mispolarise_proportions = [0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]
num_sites = []
tree_seqs = {k: [] for k in mispolarise_proportions}
kc = {k: [] for k in mispolarise_proportions}
for reps in tqdm.trange(20):
sim_ts = make_sweep_ts(100, Ne=pop_size, L=5_000_000, mu=mu, seed=reps+1)
sd = tsinfer.SampleData.from_tree_sequence(sim_ts)
num_sites.append(sd.num_sites)
# switch a few ancestral alleles - use the same simulation to allow partitioning of error
for mispolarise_proportion in tqdm.tqdm(mispolarise_proportions):
aa = sd.sites_ancestral_allele[:] # perfect polarisation has these all 0
to_switch = np.random.choice(np.arange(len(aa)), round(mispolarise_proportion * len(aa)), replace=False)
aa[to_switch] = 1 # assume all sites have 2 or more alleles: just switch to the first alt
sd_bad = sd.copy() # make editable
sd_bad.sites_ancestral_allele[:] = aa
sd_bad.finalise()
inferred_ts = tsinfer.infer(sd_bad, post_process=False) # post-process separately, to keep flanks
inferred_ts = tsinfer.post_process(inferred_ts, erase_flanks=False)
inferred_ts = inferred_ts.simplify()
# do comparisons here - kc doesn't like missing flanking regions
# uncomment below to save tree seqs e.g. for dating
# tree_seqs[mispolarise_proportion].append(inferred_ts)
kc[mispolarise_proportion].append(sim_ts.kc_distance(inferred_ts))
print("Num sites in each simulation", num_sites)
for i in range(len(num_sites)):
plt.plot(kc.keys(), [v[i] for v in kc.values()])
plt.xlabel(
f"Fraction of mispolarised sites in n={sim_ts.num_samples}"
f" {sim_ts.sequence_length/1e6} Mb selective sweep simulation"
)
plt.ylabel("KC metric (each line = one replicate simulation)")
plt.xscale("log") |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
FYI Reading recent Brandt et al ARG perspective I came upon this citation - Brandt cited it when mentioning impact of mispecified ancestral alleles Context dependence, ancestral misidentification, and spurious signatures of natural selection |
Beta Was this translation helpful? Give feedback.
-
@arka-pal was interested in the effect of wrong ("mispolarised") ancestral alleles in tsinfer:
For many species, ancestral alleles are not know, and need to be calculated e.g. from frequency and outgroups (see previous discussions at #523 and #637). As a worst-case, we might wrongly polarise (say) 30% of sites, but we might hope to get more in the range of 0-10% for well-studied groups.
We have not done extensive testing of sensitivity of tsinfer to this fraction of mispolarised alleles. While relatively easy to test, it's not clear what metrics you might use to check on quality of inference. E.g. the KC metric is not likely to be very sensitive to this (see below).
Gregor Goranj says:
Beta Was this translation helpful? Give feedback.
All reactions