Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify lowest normalised HMM likelihood value given MMR #242

Closed
szhan opened this issue Aug 27, 2024 · 25 comments
Closed

Identify lowest normalised HMM likelihood value given MMR #242

szhan opened this issue Aug 27, 2024 · 25 comments

Comments

@szhan
Copy link
Contributor

szhan commented Aug 27, 2024

Currently, mu (i.e. mutation rate) is arbitrarily set to 1e-3, but by setting it to a higher value, we could do better in terms of computational speed.

We need to do some math to figure out the relationship between a lower limit for HMM relative likelihood and precision value, given constraints on HMM input parameters, mu (which should be as high as possible) and MMR.

@szhan
Copy link
Contributor Author

szhan commented Aug 28, 2024

Relevant to this discussion are:

  1. The function in sc2ts solve_num_mismatches.
  2. This code snippet in tsinfer for updating HMM likelihood here.

@szhan
Copy link
Contributor Author

szhan commented Aug 30, 2024

Here's a first stab at thinking about this issue.

First, some notations.
k is the user-specified value of MMR (recall that higher values favour mutation over recombination)
n is the number of matchable nodes in a tree sequence
mu is probability of mutation to one of three other nucleotides
rho is probability of recombination
p_t is probability of HMM transition
p_e is probability of HMM emission
L(u) is the likelihood of node u at the current site.
L_prev(u) is the likelihood of node u at the previous site
p is the precision value

Given:

  • a user-specified value of k
  • a preset value of mu, and
  • a fixed value n (because we are matching a new sequence to an existing ARG containing n matchable nodes),
    we compute rho as follows as done here:

rho = [n * mu ** k] / [(1 - mu ** k) + (n - 1) * mu ** k]

The HMM probabilities of transition and emission are defined as in tsinfer as follows:
p_t = p_recomb if p_recomb > p_no_recomb else p_no_recomb
p_e = mu if mismatch else 1 - 3 * mu (assuming that there are four possible nucleotides)

where
p_recomb = rho / n, and
p_no_recomb = L_prev(u) * [1 - rho + rho / n]

Once the above are defined, then we define L(u) = p_t * p_e.

After L(u) is computed for all nodes, L(u) is normalised by the maximum likelihood value L_max across all the nodes, and then L(u) is replaced with L(u) / L_max, which is rounded to the nearest 10 ** p.

We define the lowest normalised likelihood value of node u at the current site under the following scenarios.

  1. Mutation without recombination
    There exist sequences in the existing ARG that are exact clones of the query sequence, i.e. requiring no recombination or mutation to explain. There also exist sequences that have one mismatch (the same mismatch) compared to the query sequence. The current site is where the mismatch occurs at node u.

The normalised likelihood of node u at the current site is M = mu / (1 - 3 * mu). Note that no mismatch has occurred until this site, so that L_prev(u) is 1).

By extension, if there exist sequences that have x mismatches (the same x mismatches) compared to the query sequence, then the normalised likelihood of node u at the current site (having emitted x - 1 mismatches already) is equal to M ** x.

  1. Recombination without mutation
    There exist sequences that are clones of the query sequence. There also exist sequences that can be explained by a single HMM transition (all switching at the same site) compared to the query sequence. The current site is where the transition occurs at node u.

The normalised likelihood of node u at the current site is R = [rho / n] / [1 - rho + rho / n]. Note that no switching has occurred until this site.

By extension, if there exist sequences that have x transitions (the same x switches) compared to the query sequence, then the normalised likelihood of node u at the current site (having switched x - 1 times already) is equal to R ** x.

@szhan
Copy link
Contributor Author

szhan commented Aug 30, 2024

Generalising the above a bit…

  1. Mutation without recombination
    If there exist only sequences that differ from the query sequences by x mismatches and sequences that differ by y mismatches, and x < y, then the lowest normalised likelihood at node u at the current site where the yth mismatch occurs is M ** (y - x).

  2. Recombination without mutation
    Analogous to above.

@jeromekelleher
Copy link
Owner

Sounds right to me

@szhan
Copy link
Contributor Author

szhan commented Aug 30, 2024

Here is a simple illustration for the scenario with mutation but without recombination.

def compute_rho(mu, k, n):
    rho = n * mu ** k
    rho /= (1 - mu) ** k + (n - 1) * mu ** k
    return rho


def compute_M(mu):
    # Assume four possible states
    return mu / (1 - 3 * mu)


def compute_R(rho, n):
    R = rho / n
    R /= 1 - rho + rho / n
    return R


mu = 1e-3 # Currently used in sc2ts when k > 0
k = 3 # Currently used in sc2ts
n = 10

for i in range(5):
    print(i, compute_M(mu) ** i)
num. mismatch relative likelihood
0 1.0
1 0.0010030090270812437
2 1.0060271084064632e-06
3 1.0090542712201234e-09
4 1.0120905428486695e-12

To differentiate cases where mismatch is equal to 0 versus 1 or higher, precision value of 0 is needed, since 10 ** 0 == 1. But to differentiate cases where mismatch is equal to 0 versus 1 versus 2 or higher, precision value of 3 is needed, which results in three possible rounded likelihood values, i.e. 1.0, 0.001, and 0.0.

By increasing mu to 1e-2, we need a lower precision value (i.e. 2) to differentiate cases where mismatch is equal to 0 versus 1 versus 2 or higher.

num. mismatch relative likelihood
0 1.0
1 0.010309278350515464
2 0.00010628122010840684
3 1.0956826815299675e-06
4 1.1295697747731622e-08

@szhan
Copy link
Contributor Author

szhan commented Aug 30, 2024

Here is a similar exercise for the scenario with recombination but without mutation

mu = 1e-2
k = 3
n = 10

rho = compute_rho(mu, k, n)

for i in range(5):
    print(i, compute_R(rho, n) ** i)
num. switches relative likelihood
0 1.0
1 1.0306101521283648e-06
2 1.0621572856700512e-12
3 1.0946700817686625e-18
4 1.1281780995019707e-24

We need a precision value of 0 to differentiate cases where the number of switches is equal to 0 versus 1 or higher; and 6 to differentiate cases where the number of switches is equal to 0 versus 1 versus 2 or higher.

Note that as mu gets lower and/or k gets higher, a higher precision value is needed. The likelihood values, however, are less sensitive to increases in n from 10 to 1e6.

@szhan
Copy link
Contributor Author

szhan commented Aug 30, 2024

Next, let's consider the scenario with both recombination and mutation.

The lowest likelihood given i mismatches and j switches is equal to (M ** i) * (R ** j).

mu = 1e-2
k = 3
n = 1e5

rho = compute_rho(mu, k, n)

for i in range(3):
    for j in range(3):
        print(i, j, compute_M(mu) ** i * compute_R(rho, n) ** j)
num .mismatches num. switches relative likelihood
0 0 1.0
0 1 1.0306101521283648e-06
0 2 1.0621572856700512e-12
1 0 0.010309278350515464
1 1 1.06248469291584e-08
1 2 1.0950075110000528e-14
2 0 0.00010628122010840684
2 1 1.095345044243134e-10
2 2 1.1288737226804667e-16

@szhan szhan changed the title identify lower threshold for HMM likelihood calculation given MMR and precision value identify lower thresholds for HMM likelihood given values for MMR and precision Aug 30, 2024
@jeromekelleher
Copy link
Owner

Nice. What happens if we set mu=0.5 in these calculations? 1e-2 is arbitrary, and we want to make the magnitudes of the likelihoods as large as possible, while keeping everything as a sensible probability. Ideally we should solve for mu with these constraints, given k.

@szhan
Copy link
Contributor Author

szhan commented Aug 30, 2024

I am assuming four possible nucleotides above, so max value for mu is 0.25.

For these values:

mu = 0.25
k = 3
n = 1e1

I get

num .mismatches num. switches relative likelihood
0 0 1.0
0 1 0.03703703703703704
0 2 0.0013717421124828536
1 0 1.0
1 1 0.03703703703703704
1 2 0.0013717421124828536
2 0 1.0
2 1 0.03703703703703704
2 2 0.0013717421124828536

@szhan
Copy link
Contributor Author

szhan commented Aug 30, 2024

Assuming two possible nucleotides, and setting

mu = 0.5
k = 3
n = 1e1

I get this pathological case where the relative likelihood is always 1.0.

@szhan
Copy link
Contributor Author

szhan commented Sep 2, 2024

We do want to relate precision to HMM cost instead of number of mismatches and switches, right? Which of course should be easy, given penalty weights for each type of event.

@jeromekelleher
Copy link
Owner

jeromekelleher commented Sep 2, 2024

Ultimately we want to do something like this:

exact_matches = run_tsinfer_hmm(samples, likelihood_threshold=1)
one_mismatch = run_tsinfer_hmm(samples - exact_matches, 
    likelihood_threshold=one_mismatch_threshold(num_mismatches)
... etc

We don't have likelihood_threshold at the moment, we just have precision, which almost is the same thing except we round rather than truncate, and we only support power-of-ten values.

@szhan
Copy link
Contributor Author

szhan commented Sep 2, 2024

Okay, just noting that I don't think it makes sense when num_alleles * mu >= 1.0, like the two examples above, where num_alleles is the number of distinct alleles at a site.

@szhan
Copy link
Contributor Author

szhan commented Sep 2, 2024

Something like this might do, assuming no recombination.

def compute_M(mu, num_alleles):
    assert num_alleles * mu < 1.0
    return mu / (1 - (num_alleles - 1) * mu)


mu = 0.25 - 1e-1
k = 3
n = 10

for i in range(5):
    print(i, compute_M(mu, num_alleles=4) ** i)
num .mismatches relative likelihood
0 1.0
1 0.2727272727272727
2 0.07438016528925619
3 0.020285499624342593
4 0.005532408988457071

@szhan
Copy link
Contributor Author

szhan commented Sep 3, 2024

Just noting that in sc2ts, there are actually 5 rather than 4 distinct allele states, because gaps are treated as a distinct allele during LS matching (but N is treated as missing data).

Because MMR is sufficiently high, recombination is pretty rare, we get:

mu = 0.20 - 1e-1
k = 3
n = 1e3

for i in range(5):
    print(i, compute_M(mu, num_alleles=5) ** i)
num .mismatches relative likelihood
0 1.0
1 0.16666666666666669
2 0.027777777777777783
3 0.004629629629629631
4 0.0007716049382716053

@szhan
Copy link
Contributor Author

szhan commented Sep 3, 2024

This example illustrates why it makes sense that we can use lower precision when running LS HMM without sacrificing accuracy of matching results.

Let's take 50 samples from the Viridian dataset from the early days of the pandemic to form a reference panel. Assuming no recombination and probability of mutation set to 0.20 - 1e-1, these are the Viterbi matrices (normalised by the max likelihood per site) obtained by matching a query sequence with one identical sequence in the ref. panel, while rounding the normalised likelihood across a range of precision values (here, it's just the nth decimal place to round to).

likelihood_profile_varying_precision

It is clear that the most likely path (the yellow line) emerges as the algorithm proceeds from the first site (site index = 0) to the last site (site index = 121). The pattern is there at the lowest precision level of 0.

Here are the unique likelihood values from each matrix.

# precision = 10
[0.00000000e+00 3.00000000e-10 2.60000000e-09 2.32000000e-08
 2.61000000e-08 2.09100000e-07 1.88170000e-06 1.69351000e-05
 1.52415800e-04 1.71467800e-04 1.37174210e-03 1.54320990e-03
 1.23456790e-02 1.11111111e-01 1.00000000e+00]
# precision = 6
[0.00000e+00 2.00000e-06 1.70000e-05 1.52000e-04 1.72000e-04 1.37200e-03
 1.54300e-03 1.23460e-02 1.11111e-01 1.00000e+00]
# precision = 2
[0.   0.01 0.11 1.  ]
# precision = 1
[0.  0.1 1. ]
# precision = 0
[0. 1.]

@szhan szhan changed the title identify lower thresholds for HMM likelihood given values for MMR and precision Identify lowest normalised HMM likelihood value given MMR Sep 3, 2024
@jeromekelleher
Copy link
Owner

jeromekelleher commented Sep 4, 2024

Hmm, so it turns out that our parameter values depend quiet strongly on the reference panel size. Here's the value of solve_num_mismatches for values of n we need to deal with, when we fix mu=1e-2:

10      1.0409065987294759e-05
100     0.00010408100593776988
1000    0.0010398456498417583
10000   0.010302989499111497
1000000 0.5126128128841425
10000000        0.9206677886051955

So, the absolute values of the likelihoods are going to change depending on n, and so our precision values (or more sensibly likelihood thresholds) will also have to take this into account.

This is all very fiddly, and I wonder if it's unnecessarily complicated. A lot of the complication here stems from the fact that recombination becomes relatively more likely as the size of the reference panel increases in the LS model.

But, is this actually necessary in this case? As recombination events are rare, and we're really just doing parsimony matches in the case of sc2ts, it seems odd now that I think about it to change the relative likelihoods based on the number of sequences in the tree. Should recombination really be more likely if we sample heavily vs sparsely?

What if we made a simpler LS model where we don't scale by n by doing this in the tsinfer (and later tskit) code:

if scale_by_n:
      p_no_recomb = p_last * (1 - rho + rho / n);
      p_recomb = rho / n;
else:
       p_no_recomb = p_last ;
       p_recomb = rho;

(or just set n = 1 if scale_by_n is true)

I think this should make things simpler and more interpretable here.

Any thoughts @szhan @hyanwong @astheeggeggs?

(I'm also wondering now if scaling by n makes sense in the tsinfer case too. The connection with the population genetic arguments behind it is pretty tenuous when we're doing ancestor matching I think and it may make it easier for us to understand our parameters if we stop doing it)

@szhan
Copy link
Contributor Author

szhan commented Sep 4, 2024

Just making a comment to clarify that I was looking at the relative likelihood (i.e. normalised by max likelihood) rather than the absolute likelihood above.

@jeromekelleher
Copy link
Owner

Quick implementation of these ideas here for experimentation: tskit-dev/tsinfer#959

@jeromekelleher
Copy link
Owner

Experimental code using this here: #258

Will run it now and see how it goes.

@szhan
Copy link
Contributor Author

szhan commented Sep 4, 2024

I'm trying to interpret what rho (or r in the preprint) means in our formulation. Let's say that there are two kinds of paths: one kind is a straight path that leads to the end of the sequence with k mismatches somewhere along the way (no more switching), and the other kind involves a switch to another reference sequence but has no mismatches along the way. Here, rho is the probability of switching to any reference sequence (besides the current reference sequence) in the panel that leads to no mismatch. As the size of the panel increases, rho must also increase so that the probability of switching to any one other reference sequence is equal to the probability of staying on a straight path that leads to k mismatches somewhere later.

@szhan
Copy link
Contributor Author

szhan commented Sep 4, 2024

And what we are storing in the ts compressed matrix is the relative likelihood of switching to any one other reference sequence in the panel.

If we don't scale by n, then we are storing the relative likelihood of switching to any other reference sequence (besides the current reference sequence) in the panel, right?

@szhan
Copy link
Contributor Author

szhan commented Sep 4, 2024

This discussion is starting to branch a bit. I'd like to insert a comment here to follow up on the earlier comment here.

This example shows that when the closest match to a query sequence in the reference panel has one mismatch compared to the query, we need higher precision value (at least 1) to get to the correct match. If the precision value is too low (equal to 0), we can get a wrong match.

likelihood_profile_varying_precision_1mismatch

@jeromekelleher
Copy link
Owner

If we don't scale by n, then we are storing the relative likelihood of switching to any other reference sequence (besides the current reference sequence) in the panel, right?

That sounds right, and what we want when recombination is rare. I think we were already doing this, effectively, by solving for a given number of mismatches. So basically, we were multiplying by n in the input parameters so that we could divide it through again during the HMM evaluation. Simpler to just factor it out entirely (as the code changes are so small)?

@hyanwong
Copy link
Contributor

hyanwong commented Sep 4, 2024

(I'm also wondering now if scaling by n makes sense in the tsinfer case too. The connection with the population genetic arguments behind it is pretty tenuous when we're doing ancestor matching I think and it may make it easier for us to understand our parameters if we stop doing it)

Yes, this tsinfer application is something that we did discuss with @astheeggeggs a while back, when thinking about the L/S parameterisation. As I recall, I was also unclear as to whether scaling by n was really the right thing to do here. The sample-size dependency makes me feel somewhat uncomfortable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants