Effect of wrong ancestral allele on tsinfer inference #882

hyanwong · 2024-01-17T12:32:47Z

hyanwong
Jan 17, 2024
Collaborator

@arka-pal was interested in the effect of wrong ("mispolarised") ancestral alleles in tsinfer:

How much does the knowledge of ancestral and derived alleles matter in tsinfer? I have used both tsinfer (where i have to specify this) and ARGweaver (where i don’t) and i get very similar results. I can theoretically imagine why having mis-identification of the alleles can lead to topology changes, but not sure to what extent does it alter tsinfer results?

For many species, ancestral alleles are not know, and need to be calculated e.g. from frequency and outgroups (see previous discussions at #523 and #637). As a worst-case, we might wrongly polarise (say) 30% of sites, but we might hope to get more in the range of 0-10% for well-studied groups.

We have not done extensive testing of sensitivity of tsinfer to this fraction of mispolarised alleles. While relatively easy to test, it's not clear what metrics you might use to check on quality of inference. E.g. the KC metric is not likely to be very sensitive to this (see below).

Gregor Goranj says:

I once toyed with a tiny example and from there I could see that picking wrong allele as ancestral impacted the trees I came up in the doodles. @hannesbecher has recently had the same question because we have been spending quite some time on getting ancestral alleles for some of the species we work with - it does require some work. We were toying with idea of doing simulation but were stuck on how to actually measure the impact - tree shapes don't matter for some metrics and could matter for some other metrics.

hyanwong · 2024-01-17T12:35:11Z

hyanwong
Jan 17, 2024
Collaborator Author

Here's an example plot with some accompanying code that use the same selective sweep model as in #877, which I hope is a relatively stringent test of bad polarisation. It seems like even with 50% mispolarised (furthest right on the x axis), the effect on KC distance is small, and we probably need a more sensitive measure to study the effect.

import msprime
import tsinfer
import numpy as np
import matplotlib.pyplot as plt
import tqdm

def make_sweep_ts(n, Ne, L, rho=1e-8, mu=1e-8, seed=1234):
    sweep_model = msprime.SweepGenicSelection(
        position=L/2, start_frequency=0.0001, end_frequency=0.9999, s=0.25, dt=1e-6)
    models = [sweep_model, msprime.StandardCoalescent()]
    ts = msprime.sim_ancestry(
        n, model=models, population_size=Ne, sequence_length=L, recombination_rate=rho, random_seed=seed)
    return msprime.sim_mutations(ts, rate=mu, random_seed=seed)

mu = 1e-8
pop_size = 10_000

mispolarise_proportions = [0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]

num_sites = []
tree_seqs = {k: [] for k in mispolarise_proportions}
kc = {k: [] for k in mispolarise_proportions}

for reps in tqdm.trange(20):
    sim_ts = make_sweep_ts(100, Ne=pop_size, L=5_000_000, mu=mu, seed=reps+1)
    sd = tsinfer.SampleData.from_tree_sequence(sim_ts)
    num_sites.append(sd.num_sites)
    # switch a few ancestral alleles - use the same simulation to allow partitioning of error
    for mispolarise_proportion in tqdm.tqdm(mispolarise_proportions):
        aa = sd.sites_ancestral_allele[:] # perfect polarisation has these all 0 
        to_switch = np.random.choice(np.arange(len(aa)), round(mispolarise_proportion * len(aa)), replace=False)
        aa[to_switch] = 1  # assume all sites have 2 or more alleles: just switch to the first alt
        sd_bad = sd.copy() # make editable
        sd_bad.sites_ancestral_allele[:] = aa
        sd_bad.finalise()
        inferred_ts = tsinfer.infer(sd_bad, post_process=False)  # post-process separately, to keep flanks
        inferred_ts = tsinfer.post_process(inferred_ts, erase_flanks=False)
        inferred_ts = inferred_ts.simplify()
        # do comparisons here - kc doesn't like missing flanking regions
        # uncomment below to save tree seqs e.g. for dating
        # tree_seqs[mispolarise_proportion].append(inferred_ts)
        kc[mispolarise_proportion].append(sim_ts.kc_distance(inferred_ts))

print("Num sites in each simulation", num_sites)
for i in range(len(num_sites)):
    plt.plot(kc.keys(), [v[i] for v in kc.values()])
plt.xlabel(
    f"Fraction of mispolarised sites in n={sim_ts.num_samples}"
    f" {sim_ts.sequence_length/1e6} Mb selective sweep simulation"
)
plt.ylabel("KC metric (each line = one replicate simulation)")
plt.xscale("log")

0 replies

hyanwong · 2024-01-17T22:56:40Z

hyanwong
Jan 17, 2024
Collaborator Author

Here is a seemingly better metric: the r_squared correlation between node times under mutations in the true versus the inferred topology. This, of course, requires tsdate (and I have used the new variational_gamma method in the code below). We know that mispolarised sites are bound to have the wrong estimated dates, so I have also plotted the r_squared only at sites known to have the correct polarisations.

This is obviously quite an indirect measure, and the problems caused by bad topological inference will have a not-entirely-predicatable effect on the datability of adjacent mutations. Nevertheless, I think it probably gets at a reasonably important truth, and reflects something that is probably relevant to the conclusions you might draw from the data.

Note that polarising on the basis of major allele frequency gives an expected mispolarisation rate of 0.1-0.2 (105-20%), which is about where the drop-off in dating accuracy kicks in.

import msprime
import tsinfer
import numpy as np
import matplotlib.pyplot as plt
import tqdm

def make_sweep_ts(n, Ne, L, rho=1e-8, mu=1e-8, seed=1234):
    sweep_model = msprime.SweepGenicSelection(
        position=L/2, start_frequency=0.0001, end_frequency=0.9999, s=0.25, dt=1e-6)
    models = [sweep_model, msprime.StandardCoalescent()]
    ts = msprime.sim_ancestry(
        n, model=models, population_size=Ne, sequence_length=L, recombination_rate=rho, random_seed=seed)
    return msprime.sim_mutations(ts, rate=mu, random_seed=seed)

def common_mutation_node_times(ts1, ts2):
    # Return times of nodes below mutations in ts1 and their corresponding nodes in ts2
    # index of "first" mutation at each site: assume most sites have 1 mutation
    _, muts_to_use = np.unique(ts1.mutations_site, return_index=True)
    sites = ts1.mutations_site[muts_to_use]
    nodes = ts1.mutations_node[muts_to_use]
    pos_to_index = {ts1.sites_position[s]: i for i, s in enumerate(sites)}

    node_below_mut = np.full(len(muts_to_use), -1)
    node_time_below_mut = np.full(len(muts_to_use), np.nan)
    for s in ts2.sites():
        idx = pos_to_index[s.position]
        if len(s.mutations) > 0:
            node_time_below_mut[idx] = ts2.nodes_time[s.mutations[0].node]
            node_below_mut[idx] = s.mutations[0].node
    return ts1.nodes_time[nodes], node_time_below_mut, ts1.sites_position[sites], node_below_mut

mu = 1e-8
pop_size = 10_000
import tsdate
mispolarise_proportions = [0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5]

num_sites = []
tree_seqs = {k: [] for k in mispolarise_proportions}
kc = {k: [] for k in mispolarise_proportions}
r_sq = {k: [] for k in mispolarise_proportions}
r_sq_correct = {k: [] for k in mispolarise_proportions}

for reps in tqdm.trange(8):
    sim_ts = make_sweep_ts(100, Ne=pop_size, L=5_000_000, mu=mu, seed=reps+1)
    sd = tsinfer.SampleData.from_tree_sequence(sim_ts)
    num_sites.append(sd.num_sites)
    sim_ts.dump(f"sim_{reps}.trees")
    # switch a few ancestral alleles - use the same simulation to allow partitioning of error
    for mispolarise_proportion in tqdm.tqdm(mispolarise_proportions):
        aa = sd.sites_ancestral_allele[:] # perfect polarisation has these all 0 
        to_switch = np.random.choice(np.arange(len(aa)), round(mispolarise_proportion * len(aa)), replace=False)
        aa[to_switch] = 1  # assume all sites have 2 or more alleles: just switch to the first alt
        sd_mispol = sd.copy() # make editable
        sd_mispol.sites_ancestral_allele[:] = aa
        sd_mispol.finalise()
        inferred_ts = tsinfer.infer(sd_mispol, post_process=False)  # post-process separately, to keep flanks
        inferred_ts = tsinfer.post_process(inferred_ts, erase_flanks=False)
        inferred_ts = inferred_ts.simplify()

        # Now run the dating algorithm
        dts = tsdate.variational_gamma(inferred_ts, population_size=pop_size, mutation_rate=mu)
        dts.dump(f"dated_{mispolarise_proportion}_{reps}.trees")

        x, y, sitepos, _ = common_mutation_node_times(sim_ts, dts)
        use = np.logical_and(x > 0, y > 0)
        r_sq[mispolarise_proportion].append(np.corrcoef(np.log(x[use]), np.log(y[use]))[0][1])
        # also look at only the correctly polarised sites
        site_correct = np.ones(sd_mispol.num_sites, dtype=bool)
        site_correct[to_switch] = False
        good_site_positions = sd_mispol.sites_position[:][site_correct]
        use = np.logical_and(use, np.isin(sitepos, good_site_positions))
        r_sq_correct[mispolarise_proportion].append(np.corrcoef(np.log(x[use]), np.log(y[use]))[0][1])

print("Num sites in each simulation", num_sites)

fig, axes = plt.subplots(2, figsize=(10, 5))
for ax, y in zip(axes, (r_sq, r_sq_correct)):
    for i in range(len(num_sites)):
        ax.plot(y.keys(), [v[i] for v in y.values()])
    ax.set_xlabel(
        f"Fraction of mispolarised sites in n={sim_ts.num_samples}"
        f" {sim_ts.sequence_length/1e6} Mb selective sweep simulation"
    )
    ax.set_ylabel("R_sq true vs inferred date (line = one sim)")
    ax.set_title("All sites" if y==r_sq else "Only correctly polarised sites")
    ax.set_xscale("log")

6 replies

jeromekelleher Jan 18, 2024
Maintainer

If you really wanted to get into it you could take a simulation, and then polarise using some method, which should give a good indication of how many you can expect to get wrong and what types. Maybe simulated data is too clean though, and the polarisation methods will be near-perfect.

hyanwong Jan 18, 2024
Collaborator Author

Yes, I wonder if @arka-pal would like to take this on for his project(s). If so, perhaps he could update on progress in this discussion thread.

hannesbecher Feb 5, 2024

I am just looking into this. I appreciate the last reply is from 3 weeks ago. Has anyone got results already? I am planning to use est-sfs.

hyanwong Feb 13, 2024
Collaborator Author

No, I haven't looked into this. More results would be great.

hannesbecher Feb 15, 2024

One slightly annoying thing is that for ancestral allele inference, one would use one or more outgroups. But if you simulate a demography with outgroups, then the you get the whole simulation's ancestral alleles out of the TS. But these may be different to the ancestral alleles of the ingroup of interest. I talked about this on slack earlier this week.

gregorgorjanc · 2024-01-25T06:49:05Z

gregorgorjanc
Jan 25, 2024

FYI Reading recent Brandt et al ARG perspective I came upon this citation - Brandt cited it when mentioning impact of mispecified ancestral alleles

Context dependence, ancestral misidentification, and spurious signatures of natural selection
https://pubmed.ncbi.nlm.nih.gov/17545186/

1 reply

hannesbecher Feb 5, 2024

The reference is interesting but it is more about SFS-based statistics not ARGs: Tajima's D is not affected by ancestral misspecification, but the statistics of Fu&Li and Fay&Wu are. The latter ones can be corrected to account for a proportion of sites with misspecified ancestral state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Effect of wrong ancestral allele on tsinfer inference #882

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Effect of wrong ancestral allele on tsinfer inference #882

hyanwong Jan 17, 2024 Collaborator

Replies: 3 comments · 7 replies

hyanwong Jan 17, 2024 Collaborator Author

hyanwong Jan 17, 2024 Collaborator Author

jeromekelleher Jan 18, 2024 Maintainer

hyanwong Jan 18, 2024 Collaborator Author

hannesbecher Feb 5, 2024

hyanwong Feb 13, 2024 Collaborator Author

hannesbecher Feb 15, 2024

gregorgorjanc Jan 25, 2024

hannesbecher Feb 5, 2024

hyanwong
Jan 17, 2024
Collaborator

Replies: 3 comments 7 replies

hyanwong
Jan 17, 2024
Collaborator Author

hyanwong
Jan 17, 2024
Collaborator Author

jeromekelleher Jan 18, 2024
Maintainer

hyanwong Jan 18, 2024
Collaborator Author

hyanwong Feb 13, 2024
Collaborator Author

gregorgorjanc
Jan 25, 2024