Identify lowest normalised HMM likelihood value given MMR #242

szhan · 2024-08-27T12:55:41Z

Currently, mu (i.e. mutation rate) is arbitrarily set to 1e-3, but by setting it to a higher value, we could do better in terms of computational speed.

We need to do some math to figure out the relationship between a lower limit for HMM relative likelihood and precision value, given constraints on HMM input parameters, mu (which should be as high as possible) and MMR.

szhan · 2024-08-28T15:40:31Z

Relevant to this discussion are:

The function in sc2ts solve_num_mismatches.
This code snippet in tsinfer for updating HMM likelihood here.

szhan · 2024-08-30T11:06:29Z

Here's a first stab at thinking about this issue.

First, some notations.
k is the user-specified value of MMR (recall that higher values favour mutation over recombination)
n is the number of matchable nodes in a tree sequence
mu is probability of mutation to one of three other nucleotides
rho is probability of recombination
p_t is probability of HMM transition
p_e is probability of HMM emission
L(u) is the likelihood of node u at the current site.
L_prev(u) is the likelihood of node u at the previous site
p is the precision value

Given:

a user-specified value of k
a preset value of mu, and
a fixed value n (because we are matching a new sequence to an existing ARG containing n matchable nodes),
we compute rho as follows as done here:

rho = [n * mu ** k] / [(1 - mu ** k) + (n - 1) * mu ** k]

The HMM probabilities of transition and emission are defined as in tsinfer as follows:
p_t = p_recomb if p_recomb > p_no_recomb else p_no_recomb
p_e = mu if mismatch else 1 - 3 * mu (assuming that there are four possible nucleotides)

where
p_recomb = rho / n, and
p_no_recomb = L_prev(u) * [1 - rho + rho / n]

Once the above are defined, then we define L(u) = p_t * p_e.

After L(u) is computed for all nodes, L(u) is normalised by the maximum likelihood value L_max across all the nodes, and then L(u) is replaced with L(u) / L_max, which is rounded to the nearest 10 ** p.

We define the lowest normalised likelihood value of node u at the current site under the following scenarios.

Mutation without recombination
There exist sequences in the existing ARG that are exact clones of the query sequence, i.e. requiring no recombination or mutation to explain. There also exist sequences that have one mismatch (the same mismatch) compared to the query sequence. The current site is where the mismatch occurs at node u.

The normalised likelihood of node u at the current site is M = mu / (1 - 3 * mu). Note that no mismatch has occurred until this site, so that L_prev(u) is 1).

By extension, if there exist sequences that have x mismatches (the same x mismatches) compared to the query sequence, then the normalised likelihood of node u at the current site (having emitted x - 1 mismatches already) is equal to M ** x.

Recombination without mutation
There exist sequences that are clones of the query sequence. There also exist sequences that can be explained by a single HMM transition (all switching at the same site) compared to the query sequence. The current site is where the transition occurs at node u.

The normalised likelihood of node u at the current site is R = [rho / n] / [1 - rho + rho / n]. Note that no switching has occurred until this site.

By extension, if there exist sequences that have x transitions (the same x switches) compared to the query sequence, then the normalised likelihood of node u at the current site (having switched x - 1 times already) is equal to R ** x.

szhan · 2024-08-30T11:15:52Z

Generalising the above a bit…

Mutation without recombination
If there exist only sequences that differ from the query sequences by x mismatches and sequences that differ by y mismatches, and x < y, then the lowest normalised likelihood at node u at the current site where the yth mismatch occurs is M ** (y - x).
Recombination without mutation
Analogous to above.

jeromekelleher · 2024-08-30T13:18:18Z

Sounds right to me

szhan · 2024-08-30T13:36:04Z

Here is a simple illustration for the scenario with mutation but without recombination.

def compute_rho(mu, k, n):
    rho = n * mu ** k
    rho /= (1 - mu) ** k + (n - 1) * mu ** k
    return rho


def compute_M(mu):
    # Assume four possible states
    return mu / (1 - 3 * mu)


def compute_R(rho, n):
    R = rho / n
    R /= 1 - rho + rho / n
    return R


mu = 1e-3 # Currently used in sc2ts when k > 0
k = 3 # Currently used in sc2ts
n = 10

for i in range(5):
    print(i, compute_M(mu) ** i)

num. mismatch	relative likelihood
0	1.0
1	0.0010030090270812437
2	1.0060271084064632e-06
3	1.0090542712201234e-09
4	1.0120905428486695e-12

To differentiate cases where mismatch is equal to 0 versus 1 or higher, precision value of 0 is needed, since 10 ** 0 == 1. But to differentiate cases where mismatch is equal to 0 versus 1 versus 2 or higher, precision value of 3 is needed, which results in three possible rounded likelihood values, i.e. 1.0, 0.001, and 0.0.

By increasing mu to 1e-2, we need a lower precision value (i.e. 2) to differentiate cases where mismatch is equal to 0 versus 1 versus 2 or higher.

num. mismatch	relative likelihood
0	1.0
1	0.010309278350515464
2	0.00010628122010840684
3	1.0956826815299675e-06
4	1.1295697747731622e-08

szhan · 2024-08-30T13:52:53Z

Here is a similar exercise for the scenario with recombination but without mutation

mu = 1e-2
k = 3
n = 10

rho = compute_rho(mu, k, n)

for i in range(5):
    print(i, compute_R(rho, n) ** i)

num. switches	relative likelihood
0	1.0
1	1.0306101521283648e-06
2	1.0621572856700512e-12
3	1.0946700817686625e-18
4	1.1281780995019707e-24

We need a precision value of 0 to differentiate cases where the number of switches is equal to 0 versus 1 or higher; and 6 to differentiate cases where the number of switches is equal to 0 versus 1 versus 2 or higher.

Note that as mu gets lower and/or k gets higher, a higher precision value is needed. The likelihood values, however, are less sensitive to increases in n from 10 to 1e6.

szhan · 2024-08-30T14:00:29Z

Next, let's consider the scenario with both recombination and mutation.

The lowest likelihood given i mismatches and j switches is equal to (M ** i) * (R ** j).

mu = 1e-2
k = 3
n = 1e5

rho = compute_rho(mu, k, n)

for i in range(3):
    for j in range(3):
        print(i, j, compute_M(mu) ** i * compute_R(rho, n) ** j)

num .mismatches	num. switches	relative likelihood
0	0	1.0
0	1	1.0306101521283648e-06
0	2	1.0621572856700512e-12
1	0	0.010309278350515464
1	1	1.06248469291584e-08
1	2	1.0950075110000528e-14
2	0	0.00010628122010840684
2	1	1.095345044243134e-10
2	2	1.1288737226804667e-16

jeromekelleher · 2024-08-30T14:18:38Z

Nice. What happens if we set mu=0.5 in these calculations? 1e-2 is arbitrary, and we want to make the magnitudes of the likelihoods as large as possible, while keeping everything as a sensible probability. Ideally we should solve for mu with these constraints, given k.

szhan · 2024-08-30T14:39:26Z

I am assuming four possible nucleotides above, so max value for mu is 0.25.

For these values:

mu = 0.25
k = 3
n = 1e1

I get

num .mismatches	num. switches	relative likelihood
0	0	1.0
0	1	0.03703703703703704
0	2	0.0013717421124828536
1	0	1.0
1	1	0.03703703703703704
1	2	0.0013717421124828536
2	0	1.0
2	1	0.03703703703703704
2	2	0.0013717421124828536

szhan · 2024-08-30T14:43:53Z

Assuming two possible nucleotides, and setting

mu = 0.5
k = 3
n = 1e1

I get this pathological case where the relative likelihood is always 1.0.

szhan · 2024-09-02T12:41:49Z

We do want to relate precision to HMM cost instead of number of mismatches and switches, right? Which of course should be easy, given penalty weights for each type of event.

jeromekelleher · 2024-09-02T12:53:15Z

Ultimately we want to do something like this:

exact_matches = run_tsinfer_hmm(samples, likelihood_threshold=1)
one_mismatch = run_tsinfer_hmm(samples - exact_matches, 
    likelihood_threshold=one_mismatch_threshold(num_mismatches)
... etc

We don't have likelihood_threshold at the moment, we just have precision, which almost is the same thing except we round rather than truncate, and we only support power-of-ten values.

szhan · 2024-09-02T13:58:06Z

Okay, just noting that I don't think it makes sense when num_alleles * mu >= 1.0, like the two examples above, where num_alleles is the number of distinct alleles at a site.

szhan · 2024-09-02T14:02:38Z

Something like this might do, assuming no recombination.

def compute_M(mu, num_alleles):
    assert num_alleles * mu < 1.0
    return mu / (1 - (num_alleles - 1) * mu)


mu = 0.25 - 1e-1
k = 3
n = 10

for i in range(5):
    print(i, compute_M(mu, num_alleles=4) ** i)

num .mismatches	relative likelihood
0	1.0
1	0.2727272727272727
2	0.07438016528925619
3	0.020285499624342593
4	0.005532408988457071

szhan · 2024-09-03T14:10:57Z

Just noting that in sc2ts, there are actually 5 rather than 4 distinct allele states, because gaps are treated as a distinct allele during LS matching (but N is treated as missing data).

Because MMR is sufficiently high, recombination is pretty rare, we get:

mu = 0.20 - 1e-1
k = 3
n = 1e3

for i in range(5):
    print(i, compute_M(mu, num_alleles=5) ** i)

num .mismatches	relative likelihood
0	1.0
1	0.16666666666666669
2	0.027777777777777783
3	0.004629629629629631
4	0.0007716049382716053

szhan · 2024-09-03T17:48:44Z

This example illustrates why it makes sense that we can use lower precision when running LS HMM without sacrificing accuracy of matching results.

Let's take 50 samples from the Viridian dataset from the early days of the pandemic to form a reference panel. Assuming no recombination and probability of mutation set to 0.20 - 1e-1, these are the Viterbi matrices (normalised by the max likelihood per site) obtained by matching a query sequence with one identical sequence in the ref. panel, while rounding the normalised likelihood across a range of precision values (here, it's just the nth decimal place to round to).

It is clear that the most likely path (the yellow line) emerges as the algorithm proceeds from the first site (site index = 0) to the last site (site index = 121). The pattern is there at the lowest precision level of 0.

Here are the unique likelihood values from each matrix.

# precision = 10
[0.00000000e+00 3.00000000e-10 2.60000000e-09 2.32000000e-08
 2.61000000e-08 2.09100000e-07 1.88170000e-06 1.69351000e-05
 1.52415800e-04 1.71467800e-04 1.37174210e-03 1.54320990e-03
 1.23456790e-02 1.11111111e-01 1.00000000e+00]
# precision = 6
[0.00000e+00 2.00000e-06 1.70000e-05 1.52000e-04 1.72000e-04 1.37200e-03
 1.54300e-03 1.23460e-02 1.11111e-01 1.00000e+00]
# precision = 2
[0.   0.01 0.11 1.  ]
# precision = 1
[0.  0.1 1. ]
# precision = 0
[0. 1.]

jeromekelleher · 2024-09-04T09:04:34Z

Hmm, so it turns out that our parameter values depend quiet strongly on the reference panel size. Here's the value of solve_num_mismatches for values of n we need to deal with, when we fix mu=1e-2:

10      1.0409065987294759e-05
100     0.00010408100593776988
1000    0.0010398456498417583
10000   0.010302989499111497
1000000 0.5126128128841425
10000000        0.9206677886051955

So, the absolute values of the likelihoods are going to change depending on n, and so our precision values (or more sensibly likelihood thresholds) will also have to take this into account.

This is all very fiddly, and I wonder if it's unnecessarily complicated. A lot of the complication here stems from the fact that recombination becomes relatively more likely as the size of the reference panel increases in the LS model.

But, is this actually necessary in this case? As recombination events are rare, and we're really just doing parsimony matches in the case of sc2ts, it seems odd now that I think about it to change the relative likelihoods based on the number of sequences in the tree. Should recombination really be more likely if we sample heavily vs sparsely?

What if we made a simpler LS model where we don't scale by n by doing this in the tsinfer (and later tskit) code:

if scale_by_n:
      p_no_recomb = p_last * (1 - rho + rho / n);
      p_recomb = rho / n;
else:
       p_no_recomb = p_last ;
       p_recomb = rho;

(or just set n = 1 if scale_by_n is true)

I think this should make things simpler and more interpretable here.

Any thoughts @szhan @hyanwong @astheeggeggs?

(I'm also wondering now if scaling by n makes sense in the tsinfer case too. The connection with the population genetic arguments behind it is pretty tenuous when we're doing ancestor matching I think and it may make it easier for us to understand our parameters if we stop doing it)

szhan · 2024-09-04T09:44:12Z

Just making a comment to clarify that I was looking at the relative likelihood (i.e. normalised by max likelihood) rather than the absolute likelihood above.

jeromekelleher · 2024-09-04T10:11:04Z

Quick implementation of these ideas here for experimentation: tskit-dev/tsinfer#959

jeromekelleher · 2024-09-04T11:09:25Z

Experimental code using this here: #258

Will run it now and see how it goes.

szhan · 2024-09-04T11:55:13Z

I'm trying to interpret what rho (or r in the preprint) means in our formulation. Let's say that there are two kinds of paths: one kind is a straight path that leads to the end of the sequence with k mismatches somewhere along the way (no more switching), and the other kind involves a switch to another reference sequence but has no mismatches along the way. Here, rho is the probability of switching to any reference sequence (besides the current reference sequence) in the panel that leads to no mismatch. As the size of the panel increases, rho must also increase so that the probability of switching to any one other reference sequence is equal to the probability of staying on a straight path that leads to k mismatches somewhere later.

szhan · 2024-09-04T11:58:51Z

And what we are storing in the ts compressed matrix is the relative likelihood of switching to any one other reference sequence in the panel.

If we don't scale by n, then we are storing the relative likelihood of switching to any other reference sequence (besides the current reference sequence) in the panel, right?

szhan · 2024-09-04T12:15:26Z

This discussion is starting to branch a bit. I'd like to insert a comment here to follow up on the earlier comment here.

This example shows that when the closest match to a query sequence in the reference panel has one mismatch compared to the query, we need higher precision value (at least 1) to get to the correct match. If the precision value is too low (equal to 0), we can get a wrong match.

jeromekelleher · 2024-09-04T12:21:22Z

If we don't scale by n, then we are storing the relative likelihood of switching to any other reference sequence (besides the current reference sequence) in the panel, right?

That sounds right, and what we want when recombination is rare. I think we were already doing this, effectively, by solving for a given number of mismatches. So basically, we were multiplying by n in the input parameters so that we could divide it through again during the HMM evaluation. Simpler to just factor it out entirely (as the code changes are so small)?

hyanwong · 2024-09-04T17:36:23Z

(I'm also wondering now if scaling by n makes sense in the tsinfer case too. The connection with the population genetic arguments behind it is pretty tenuous when we're doing ancestor matching I think and it may make it easier for us to understand our parameters if we stop doing it)

Yes, this tsinfer application is something that we did discuss with @astheeggeggs a while back, when thinking about the L/S parameterisation. As I recall, I was also unclear as to whether scaling by n was really the right thing to do here. The sample-size dependency makes me feel somewhat uncomfortable.

szhan changed the title ~~identify lower threshold for HMM likelihood calculation given MMR and precision value~~ identify lower thresholds for HMM likelihood given values for MMR and precision Aug 30, 2024

szhan changed the title ~~identify lower thresholds for HMM likelihood given values for MMR and precision~~ Identify lowest normalised HMM likelihood value given MMR Sep 3, 2024

jeromekelleher mentioned this issue Sep 4, 2024

Experimental hmm tskit-dev/tsinfer#959

Draft

jeromekelleher closed this as completed Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify lowest normalised HMM likelihood value given MMR #242

Identify lowest normalised HMM likelihood value given MMR #242

szhan commented Aug 27, 2024

szhan commented Aug 28, 2024

szhan commented Aug 30, 2024 •

edited

Loading

szhan commented Aug 30, 2024

jeromekelleher commented Aug 30, 2024

szhan commented Aug 30, 2024 •

edited

Loading

szhan commented Aug 30, 2024 •

edited

Loading

szhan commented Aug 30, 2024 •

edited

Loading

jeromekelleher commented Aug 30, 2024

szhan commented Aug 30, 2024 •

edited

Loading

szhan commented Aug 30, 2024

szhan commented Sep 2, 2024

jeromekelleher commented Sep 2, 2024 •

edited

Loading

szhan commented Sep 2, 2024 •

edited

Loading

szhan commented Sep 2, 2024 •

edited

Loading

szhan commented Sep 3, 2024 •

edited

Loading

szhan commented Sep 3, 2024 •

edited

Loading

jeromekelleher commented Sep 4, 2024 •

edited

Loading

szhan commented Sep 4, 2024 •

edited

Loading

jeromekelleher commented Sep 4, 2024

jeromekelleher commented Sep 4, 2024

szhan commented Sep 4, 2024

szhan commented Sep 4, 2024 •

edited

Loading

szhan commented Sep 4, 2024

jeromekelleher commented Sep 4, 2024

hyanwong commented Sep 4, 2024

Identify lowest normalised HMM likelihood value given MMR #242

Identify lowest normalised HMM likelihood value given MMR #242

Comments

szhan commented Aug 27, 2024

szhan commented Aug 28, 2024

szhan commented Aug 30, 2024 • edited Loading

szhan commented Aug 30, 2024

jeromekelleher commented Aug 30, 2024

szhan commented Aug 30, 2024 • edited Loading

szhan commented Aug 30, 2024 • edited Loading

szhan commented Aug 30, 2024 • edited Loading

jeromekelleher commented Aug 30, 2024

szhan commented Aug 30, 2024 • edited Loading

szhan commented Aug 30, 2024

szhan commented Sep 2, 2024

jeromekelleher commented Sep 2, 2024 • edited Loading

szhan commented Sep 2, 2024 • edited Loading

szhan commented Sep 2, 2024 • edited Loading

szhan commented Sep 3, 2024 • edited Loading

szhan commented Sep 3, 2024 • edited Loading

jeromekelleher commented Sep 4, 2024 • edited Loading

szhan commented Sep 4, 2024 • edited Loading

jeromekelleher commented Sep 4, 2024

jeromekelleher commented Sep 4, 2024

szhan commented Sep 4, 2024

szhan commented Sep 4, 2024 • edited Loading

szhan commented Sep 4, 2024

jeromekelleher commented Sep 4, 2024

hyanwong commented Sep 4, 2024

szhan commented Aug 30, 2024 •

edited

Loading

szhan commented Aug 30, 2024 •

edited

Loading

szhan commented Aug 30, 2024 •

edited

Loading

szhan commented Aug 30, 2024 •

edited

Loading

szhan commented Aug 30, 2024 •

edited

Loading

jeromekelleher commented Sep 2, 2024 •

edited

Loading

szhan commented Sep 2, 2024 •

edited

Loading

szhan commented Sep 2, 2024 •

edited

Loading

szhan commented Sep 3, 2024 •

edited

Loading

szhan commented Sep 3, 2024 •

edited

Loading

jeromekelleher commented Sep 4, 2024 •

edited

Loading

szhan commented Sep 4, 2024 •

edited

Loading

szhan commented Sep 4, 2024 •

edited

Loading