Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AFS folding (polarised=False) #2972

Open
sunyatin opened this issue Mar 9, 2024 · 6 comments
Open

AFS folding (polarised=False) #2972

sunyatin opened this issue Mar 9, 2024 · 6 comments

Comments

@sunyatin
Copy link

sunyatin commented Mar 9, 2024

Dear all,

I was checking variable implementations of the (joint) allele frequency spectrum folding and went through the polarisation subsection in msprime. I am not sure I clearly understood the procedure implemented in ts.allele_frequency_spectrum(..., polarised=False), so all my apologies if I misinterpreted something...

Consider for instance a simple two-population AFS, $\xi$ (with haploid sample sizes $n_0 = n_1$), I thought I understood that any site with derived allele counts $i$ and $j$ where $i+j=(n_0+n_1)/2$ would add a half-count in both $\xi(i,j)$ and $\xi(n_0-i,n_1-j)$. However, looking into the spectrum, it seems that all cells checking the condition of ambiguous allele minority do not equal their reverse (which I expect would be the case if indeed the said sites were following the implementation described previously).

Am I misunderstanding the msprime implementation of folding?

Many thanks for your lights!

All the best,

Rémi

@petrelharp
Copy link
Contributor

petrelharp commented Mar 10, 2024

Hm, well, maybe this bit answers your question? (It's talking about sites with more than two alleles)

Now, if we are computing the unpolarised AFS, we add one half to each entry of the folded output corresponding 
to each allele count including the ancestral allele. What does this mean? Well, polarised=False means that we 
cannot distinguish between an allele count of 6 and an allele count of 4. So, folding means that we would add our 
allele that is seen in 6 samples to AFS[4] instead of AFS[6]. So, in total, we will add 0.5 to each of AFS[4], AFS[3], 
and AFS[1]. This means that the sum of an unpolarised AFS will be equal to the total number of alleles that are 
inherited by any of the samples in the tree sequence, divided by two. Why one-half? Well, notice that if in fact 
the mutation that produced the b allele had instead produced an a allele, so that the site had only two alleles, 
with frequencies 7 and 3. Then, we would have added 0.5 to AFS[3] twice.

... if not, can you provide a concrete example (and the code to generate it)?

@sunyatin
Copy link
Author

Thank you Peter! Well, this is the passage which I have a bit of a hard time extrapolating to a joint 2D-AFS, for instance. But here is an example.

I consider two populations of size 10000 diploids each, which diverged 6000 generations ago. I sample 2 diploids (4 haploids) in each population.

Herebelow I build a polarised and unpolarised (folded) joint 2D-AFS using only one site, to showcase the problem I am considering. Specifically, I am looking at a site where the MAF is 50% (the "ambiguous" minority I was referring to in my previous message). Here, the genotypes are [1,1,1,1] in Pop A and [0,0,0,0] in Pop B.

Everything is fine for the polarised 2D-AFS.

However, for the folded version, I would expect that both entries $\xi(4,0)$ and $\xi(0,4)$ have each the value of 0.5. This is not the case: the returned folded 2D-AFS has an entry of 1 in $\xi(0,4)$... Yet, there is no reason to assume that the site is "fully" in this configuration, given that it has equal chance to be in the reverse one, given that we don't know which allele is really the major one? I think I am still misunderstanding the folding procedure done in msprime...

import msprime as msp
import numpy as np

def get_demo(Pops):
    No = 10000.
    Ts = 6000.
    #
    Demo = msp.Demography()
    for pop in Pops:
        Demo.add_population(name = pop,
                    default_sampling_time = 0.,
                    growth_rate = 0.,
                    initially_active = True,
                    initial_size = No)
    Demo.add_population_split(time = Ts, derived = ["M"], ancestral = "A")
    return Demo

Pops = ["A", "M"]
n_diploid_samples_per_pop = 2

Samples = [msp.SampleSet(n_diploid_samples_per_pop, ploidy=2, population=pop, time=0.) for pop in Pops]
Demo = get_demo(Pops)

# Simulate
ts = msp.sim_ancestry(Samples, demography=Demo, sequence_length=1e6, random_seed=12)
tsm = msp.sim_mutations(ts, rate=1e-8, random_seed=12, model="binary")

# Extract only one focal site with ambiguous MAC
ambiguous_minor_allele_count = 0.5*np.sum(np.array([x.num_samples*x.ploidy for x in Samples]))
sites_id_to_remove = []
first = True
for v in tsm.variants():
    g = v.genotypes
    if (first is True) and (np.sum(g) == ambiguous_minor_allele_count):
        print(v)
        first = False
    else:
        sites_id_to_remove += [v.site.id]
tsm_sub = tsm.delete_sites(sites_id_to_remove)

# Resulting genotype matrix
print(tsm_sub.genotype_matrix())

# AFS
Afs_Pops = [tsm_sub.samples(0), tsm_sub.samples(1)]
print(tsm_sub.allele_frequency_spectrum(Afs_Pops, polarised = True, mode="site", span_normalise=False))
print(tsm_sub.allele_frequency_spectrum(Afs_Pops, polarised = False, mode="site", span_normalise=False))

So, this is the variant at the single retained site:

Variant(site=Site(id=1, position=1061.0, ancestral_state='0', mutations=[Mutation(id=1, site=1, node=12, derived_state='1', parent=-1, metadata=b'', time=26708.361655509398)], metadata=b''), alleles=('0', '1'), genotypes=array([1, 1, 1, 1, 0, 0, 0, 0], dtype=int8))

We can see there is a single mutation.

The corresponding genotypes:
[[1 1 1 1 0 0 0 0]]

The polarised 2D-AFS:

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]]

The unpolarised (folded) 2D-AFS:

[[0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

But this is the folded 2D-AFS I would expect:

[[0. 0. 0. 0. 0.5]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0.5 0. 0. 0. 0.]]

Many thanks again for your lights!

@sunyatin
Copy link
Author

sunyatin commented Mar 10, 2024

Oh... this is interesting. If I now select a site with genotypes [[0 0 0 0 1 1 1 1]] (i.e. corresponding to the flipped version compared to my previous example) (here again, single mutation happening on the site), here are the 2D-AFS:

Polarised:

[[0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

Folded:

[[0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

So now, the folded is identical to the polarised...

NB. In all examples, I consider only biallelic sites.

Edit: Part of the code changed:

sites_id_to_remove = []
first = True
for v in tsm.variants():
    g = v.genotypes
    if (first is True) and (all((g-np.array([0,0,0,0, 1,1,1,1])) == 0)):
        print(v)
        first = False
    else:
        sites_id_to_remove += [v.site.id]
tsm_sub = tsm.delete_sites(sites_id_to_remove)

@petrelharp
Copy link
Contributor

Ah, okay - thanks for the examples, this is much clearer (not that your initial message wasn't clear, but it's clearer to me now I've seen the example). I think that bit I quoted is irrelevant, since that's just about sites with more than two alleles. So, it sounds like you're expecting a matrix with half the counts in each of the two places they could be assigned to (i.e. , ξ[i,j] == ξ[n-i,m-j])? Instead, we're returning an array in which some of the elements have been zeroed out. For instance:

>>> tsm.allele_frequency_spectrum([tsm.samples(0)], span_normalise=False, polarised=True)
array([1117.,   85.,    2.,   49.,  591.])
>>> tsm.allele_frequency_spectrum([tsm.samples(0)], span_normalise=False, polarised=False)
array([1708.,  134.,    2.,    0.,    0.])

Looking at the joint AFS I'm surprised by what I see there:

>>> tsm.allele_frequency_spectrum([tsm.samples(i) for i in [0,1]], span_normalise=False, polarised=True)
array([[  0.,  91., 652.,   0., 374.],
       [ 85.,   0.,   0.,   0.,   0.],
       [  2.,   0.,   0.,   0.,   0.],
       [ 49.,   0.,   0.,   0.,   0.],
       [591.,   0.,   0.,   0.,   0.]])
>>> tsm.allele_frequency_spectrum([tsm.samples(i) for i in [0,1]], span_normalise=False, polarised=False)
array([[  0.,  91., 652.,   0., 965.],
       [ 85.,   0.,   0.,   0.,   0.],
       [  2.,   0.,   0.,   0.,   0.],
       [ 49.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.]])

Notice that the only slots that got added together were (4,0) and (0,4). However, this is all as expected - adjusting the parameters a bit so I get mutations in most of the slots:

Polarised:
[[ 0. 68.  7.  2.  0.]
 [64. 21. 12.  7.  1.]
 [15. 13.  5.  9.  3.]
 [ 5.  5.  9. 14.  6.]
 [ 1.  6.  8. 14.  0.]]
------------
Unpolarised:
[[ 0. 82. 15.  8.  1.]
 [70. 35. 21. 12.  0.]
 [18. 22.  5.  0.  0.]
 [ 6.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

The rule for which half of the entries are kept in the unpolarised version is something like "entries with i + j <= n /2". (It should be the case that out of all entries that aren't distinguishable with unpolarised data, only one of them is nonzero.)

Is this answering the question?

@sunyatin
Copy link
Author

sunyatin commented Mar 10, 2024

Thank you so much Peter! I now get it, and it is now quite clear with the rule what msprime is doing for the "non-distinguishable counts". So this is a bit different compared to how folded AFSs are calculated, for instance in fastsimcoal or Arlequin. If this is not too cumbersome, could be worth explaining the folding more explicitly in the documentation for the multi-jAFS case, as some people for instance (like me ^^) might want to use msprime for simulating theoretical AFSs and afterwards compare them to empirical AFSs, which might therefore be produced with a slightly different folding procedure using softwares specialized in handling observed data (e.g. two-entry half-value for non-distinguishable counts).

Thanks a lot for your help :)

@petrelharp
Copy link
Contributor

I agree, better docs would be nice. Hm, it currently says:

 If there is more than one sample set, the returned array is “lower triangular” in a similar way.

I think we left it at this because the precise specification of which cells are empty is not at all easy to explain (because of the edge cases, basically). Do you have any suggestions? (And, how do those other programs handle it?)

@jeromekelleher jeromekelleher transferred this issue from tskit-dev/msprime Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants