Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TreeSequence.f2 is not symmetric with multiallelic sites #2919

Open
BenjaminPeter opened this issue Apr 5, 2024 · 3 comments
Open

TreeSequence.f2 is not symmetric with multiallelic sites #2919

BenjaminPeter opened this issue Apr 5, 2024 · 3 comments

Comments

@BenjaminPeter
Copy link

The implementation of the $f_2$-statistic for data with multiallelic sites is not symmetric (as would be expected from the definition of the statistic), I narrowed it down to the following minimal example:

import msprime as ms                                                          
i1, i2 = (0, 1), (2, 3)                                                       
                                                                              
ts = ms.sim_ancestry(sequence_length=1000, samples=2, ploidy=2, random_seed=1)
m = ms.sim_mutations(ts, rate=5e-3, random_seed=2)                            
                                                                              
f12 = m.f2((i1, i2), span_normalise=False)                                    
f21 = m.f2((i2, i1), span_normalise=False)                                    
                                                                              
assert f12 == f21, f"{f12} !={f21}"                                           

Digging a bit deeper, I found the issue is very likely due to the presence of a multiallelic site; coming from ms I assumed msprime would generate infinite sites by default, when it does not. Using

m = ms.sim_mutations(ts, rate=5e-3, model=ms.InfiniteSites(), random_seed=2)

fixes the problem

As F-statistics are not commonly run (or even defined for) multiallelic sites, this might not be a great practical problem, but I decided to post it anyways since it appears like the code for many statistics is shared.

pi1 = m.diversity((i1), span_normalise=False)          
pi2 = m.diversity((i2), span_normalise=False)          
pi12 = m.divergence((i1, i2), span_normalise=False)    
#fails
assert f21 == pi12 - pi1 / 2 - pi2 / 2                 
Other info:

The data set generated is

In [174]: np.vstack([i.genotypes for i in m.variants()])                                                                            
Out[174]:                                                                  
array([[1, 0, 0, 0],                                                       
       [1, 1, 0, 1],                                                       
       [1, 1, 0, 1],                                                       
       [0, 0, 1, 0],                                                       
       [1, 1, 0, 0],                                                       
       [0, 0, 0, 1],                                                       
       [0, 0, 1, 0],                                                       
       [1, 1, 0, 0],                                                       
       [0, 0, 0, 1],                                                       
       [1, 1, 0, 0],                                                       
       [1, 1, 0, 1],                                                       
       [1, 1, 0, 0],                                                       
       [0, 1, 0, 0],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 1, 0],                                                       
       [0, 1, 0, 0],                                                       
       [1, 1, 0, 2],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 0, 1],                                                       
       [0, 0, 0, 1],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 1, 0],                                                       
       [1, 1, 0, 0],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 0, 1],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 0, 1],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 0, 1],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 1, 0],                                                       
       [0, 0, 1, 0],                                                       
       [1, 1, 0, 1],                                                       
       [1, 1, 0, 1]], dtype=int32)                                         
Versions used:
In [180]: tskit.__version__
Out[180]: '0.5.6'

In [181]: ms.__version__
Out[181]: '1.3.1'
@petrelharp
Copy link
Contributor

petrelharp commented Apr 6, 2024

Hm, you're right. This was certainly not intentional. And, it's true for f4 as well:

import msprime as ms

ts = ms.sim_ancestry(sequence_length=1000, samples=4, ploidy=2, random_seed=1)
m = ms.sim_mutations(ts, rate=5e-3, random_seed=2)

a, b, c, d = (0, 1), (2, 3), (4, 5), (6, 7)
f12 = m.f4((a, b, c, d), span_normalise=False)
f21 = m.f4((c, d, a, b), span_normalise=False)

assert f12 == f21, f"f4: {f12} !={f21}"

Here's the issue: the definition we gave for f4 (and by extension, f3 and f2) is

The average density of sites at which a and c agree but differs from b and d,
minus the average density of sites at which a and d agree but differs from b and c

This sounds symmetric under switching (a,b) with (c,d), and indeed it is for biallelic loci, since

a and c agree, but differ from b and d

is the same for biallelic loci as

a=c != b=d

and hence the same as

b and d agree, but differ from a and c

However, with more than two alleles, we can have b and d differing yet still neither agreeing with a and c.

So, I see two options:

  1. symmetrize the definition; this would then have the interpretation
The average density of sites at which a and c agree but differs from b and d, 
plus the average density of sites at which b and d agree but differs from a and c, 
minus the average density of sites at which a and d agree but differs from b and c
and the average density of sites at which b and c agree but differs from a and d,
all divided by two.
  1. leave it as-is, documenting the possible asymmetry.

I'm in favor of (2) because:
a. people can symmetrize it themselves if they want;
b. the current version seems a bit more natural and perhaps informative than the symmetrized version
c. the only downside of the current version I'm aware of is that it's surprising, so documenting should fix that.

However, if you know of a reason that the symmetrized version (or something else?) is a more natural estimator or something, then that's worth considering?

FYI, we've defined all the statistics in a way that makes sense with more-than-biallelic sites: the strategy taken is to treat each allele as a binary split between "the allele" and "all other alleles", and compute statistics in a way that sums a summary over these splits (see the paper for more discussion). I'm not aware of any other issues like this one that arises from multiallelic sites: for instance, ts.divergence( ) is still symmetric.

@BenjaminPeter
Copy link
Author

BenjaminPeter commented Apr 6, 2024

Hi Peter,

Thanks a lot for your thorough investigation and explanation of the issue.

Perhaps a lot of the difference in intuition (and why this behavior of tskit was surprising to me) is that I tend to think of $F_4$ as a sum of $F_2$ statistics, whereas your approach seems to start with $F_4$, and treats $F_2$ as a special case.

However, I think from the $F_2$ perspective, the symmetric version you describe as the more intuitive and practical extension of $f_2$ to multiallelic sites, here is my pitch: ;-)

  • For biallelic sites, the $F$-stats have some useful relations, e.g. $2F_4(a,b,c,d) = F_2(a,d) + F_2(b,c) - F_2(a,c) - F_2(b,d)$, it would be nice if they were retained in the multiallelic case.
  • For biallelic sites, $F_2$ can be written in terms of pairwise divergences $\pi_{ab}$ and within-sample diversity $\pi_a$ (eq 17 in paper):
    $F_2 = \pi_{12} - \pi_1/2 - \pi_2/2$. This definition starts from asking whether the two alleles in two haploid individuals differ from each other. Thus, if you had pressed me to come up with an extension of $F_2$-stats to multiallelic sites, this is probably what I would have used.
  • Combining these two points, one can write $F_4$ in terms of pairwise divergences: $2F_4 = \pi_{14} + \pi_{23} - \pi_{13} -\pi_{24}$. This, I think, coincides with the symmetric estimator you laid out.
  • As far as I can see, the symmetric versions of the statistic retain all the interpretations and properties from the biallelic case; symmetric $F_2$ is a tree metric, a (squared) distance, a covariance., $F_3$ and $F_4$ can be interpreted as the lengths of internal and external branches, respectively. Now, these branch lengths will be biased relative to the ones calculated without mutations, presumably in similar ways that naive estimators of branch lengths from number of mutations are biased under recurrent mutation models. However, I think they should still be tree-additive.

In addition, most applications of $F$-statistics I have seen do not calculate all permutations of arguments. This is both true for studies that use $F$-statistics for individual tests, and tools that use matrices of $F$-statistics for more complex models, such as qpAdm.

In summary, I think the symmetric version is just the nicer, more convenient and, to me, more intuitive extension of $F$-stats to multiallelic sites. It retains most of the properties and interpretations from the biallelic case, whereas the current version does not.

Ben

@petrelharp petrelharp changed the title Bug in TreeSequence.f2 with multiallelic sites TreeSequence.f2 is not symmetric with multiallelic sites Apr 15, 2024
@petrelharp
Copy link
Contributor

Thanks very much - this is compelling (but I'm not yet sure). I think we considered the $\pi$-based definitions you present here,
but used the definitions we ended up with because (a) they were equivalent for biallelic sites, and (b) they were easier to explain in words. But, do you have a good and concise way to explain $F_2(a,b) = \pi_{a,b} - \pi_a/2 - \pi_b/2$ in terms of what statistic it is of the samples?

However, I think that our current version, symmetrized (i.e., (ts.f4(a,b,c,d) + ts.f4(c,d,a,b))/2) is not equal to the pairwise divergences version. So, I think what you've got is a third proposal for the definitions, in terms of divergence and diversity, and distinct from the "let's just symmetrize" proposal. Or, have I got this wrong?

Thanks for helping get this right.

p.s. I ran across the thread in which we worked out the multiallelic thing; it would have been nice to have your input at the time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants