-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify lowest normalised HMM likelihood value given MMR #242
Comments
Relevant to this discussion are:
|
Here's a first stab at thinking about this issue. First, some notations. Given:
The HMM probabilities of transition and emission are defined as in tsinfer as follows: where Once the above are defined, then we define After We define the lowest normalised likelihood value of node
The normalised likelihood of node By extension, if there exist sequences that have
The normalised likelihood of node By extension, if there exist sequences that have |
Generalising the above a bit…
|
Sounds right to me |
Here is a simple illustration for the scenario with mutation but without recombination. def compute_rho(mu, k, n):
rho = n * mu ** k
rho /= (1 - mu) ** k + (n - 1) * mu ** k
return rho
def compute_M(mu):
# Assume four possible states
return mu / (1 - 3 * mu)
def compute_R(rho, n):
R = rho / n
R /= 1 - rho + rho / n
return R
mu = 1e-3 # Currently used in sc2ts when k > 0
k = 3 # Currently used in sc2ts
n = 10
for i in range(5):
print(i, compute_M(mu) ** i)
To differentiate cases where mismatch is equal to 0 versus 1 or higher, precision value of 0 is needed, since By increasing
|
Here is a similar exercise for the scenario with recombination but without mutation mu = 1e-2
k = 3
n = 10
rho = compute_rho(mu, k, n)
for i in range(5):
print(i, compute_R(rho, n) ** i)
We need a precision value of 0 to differentiate cases where the number of switches is equal to 0 versus 1 or higher; and 6 to differentiate cases where the number of switches is equal to 0 versus 1 versus 2 or higher. Note that as |
Next, let's consider the scenario with both recombination and mutation. The lowest likelihood given mu = 1e-2
k = 3
n = 1e5
rho = compute_rho(mu, k, n)
for i in range(3):
for j in range(3):
print(i, j, compute_M(mu) ** i * compute_R(rho, n) ** j)
|
Nice. What happens if we set mu=0.5 in these calculations? 1e-2 is arbitrary, and we want to make the magnitudes of the likelihoods as large as possible, while keeping everything as a sensible probability. Ideally we should solve for mu with these constraints, given k. |
I am assuming four possible nucleotides above, so max value for For these values:
I get
|
Assuming two possible nucleotides, and setting
I get this pathological case where the relative likelihood is always 1.0. |
We do want to relate precision to HMM cost instead of number of mismatches and switches, right? Which of course should be easy, given penalty weights for each type of event. |
Ultimately we want to do something like this:
We don't have |
Okay, just noting that I don't think it makes sense when |
Something like this might do, assuming no recombination. def compute_M(mu, num_alleles):
assert num_alleles * mu < 1.0
return mu / (1 - (num_alleles - 1) * mu)
mu = 0.25 - 1e-1
k = 3
n = 10
for i in range(5):
print(i, compute_M(mu, num_alleles=4) ** i)
|
Just noting that in sc2ts, there are actually 5 rather than 4 distinct allele states, because gaps are treated as a distinct allele during LS matching (but N is treated as missing data). Because MMR is sufficiently high, recombination is pretty rare, we get: mu = 0.20 - 1e-1
k = 3
n = 1e3
for i in range(5):
print(i, compute_M(mu, num_alleles=5) ** i)
|
This example illustrates why it makes sense that we can use lower precision when running LS HMM without sacrificing accuracy of matching results. Let's take 50 samples from the Viridian dataset from the early days of the pandemic to form a reference panel. Assuming no recombination and probability of mutation set to It is clear that the most likely path (the yellow line) emerges as the algorithm proceeds from the first site (site index = 0) to the last site (site index = 121). The pattern is there at the lowest precision level of 0. Here are the unique likelihood values from each matrix. # precision = 10
[0.00000000e+00 3.00000000e-10 2.60000000e-09 2.32000000e-08
2.61000000e-08 2.09100000e-07 1.88170000e-06 1.69351000e-05
1.52415800e-04 1.71467800e-04 1.37174210e-03 1.54320990e-03
1.23456790e-02 1.11111111e-01 1.00000000e+00]
# precision = 6
[0.00000e+00 2.00000e-06 1.70000e-05 1.52000e-04 1.72000e-04 1.37200e-03
1.54300e-03 1.23460e-02 1.11111e-01 1.00000e+00]
# precision = 2
[0. 0.01 0.11 1. ]
# precision = 1
[0. 0.1 1. ]
# precision = 0
[0. 1.] |
Hmm, so it turns out that our parameter values depend quiet strongly on the reference panel size. Here's the value of solve_num_mismatches for values of n we need to deal with, when we fix mu=1e-2:
So, the absolute values of the likelihoods are going to change depending on n, and so our precision values (or more sensibly likelihood thresholds) will also have to take this into account. This is all very fiddly, and I wonder if it's unnecessarily complicated. A lot of the complication here stems from the fact that recombination becomes relatively more likely as the size of the reference panel increases in the LS model. But, is this actually necessary in this case? As recombination events are rare, and we're really just doing parsimony matches in the case of sc2ts, it seems odd now that I think about it to change the relative likelihoods based on the number of sequences in the tree. Should recombination really be more likely if we sample heavily vs sparsely? What if we made a simpler LS model where we don't scale by n by doing this in the tsinfer (and later tskit) code:
(or just set n = 1 if scale_by_n is true) I think this should make things simpler and more interpretable here. Any thoughts @szhan @hyanwong @astheeggeggs? (I'm also wondering now if scaling by n makes sense in the tsinfer case too. The connection with the population genetic arguments behind it is pretty tenuous when we're doing ancestor matching I think and it may make it easier for us to understand our parameters if we stop doing it) |
Just making a comment to clarify that I was looking at the relative likelihood (i.e. normalised by max likelihood) rather than the absolute likelihood above. |
Quick implementation of these ideas here for experimentation: tskit-dev/tsinfer#959 |
Experimental code using this here: #258 Will run it now and see how it goes. |
I'm trying to interpret what |
And what we are storing in the ts compressed matrix is the relative likelihood of switching to any one other reference sequence in the panel. If we don't scale by n, then we are storing the relative likelihood of switching to any other reference sequence (besides the current reference sequence) in the panel, right? |
This discussion is starting to branch a bit. I'd like to insert a comment here to follow up on the earlier comment here. This example shows that when the closest match to a query sequence in the reference panel has one mismatch compared to the query, we need higher precision value (at least 1) to get to the correct match. If the precision value is too low (equal to 0), we can get a wrong match. |
That sounds right, and what we want when recombination is rare. I think we were already doing this, effectively, by solving for a given number of mismatches. So basically, we were multiplying by n in the input parameters so that we could divide it through again during the HMM evaluation. Simpler to just factor it out entirely (as the code changes are so small)? |
Yes, this tsinfer application is something that we did discuss with @astheeggeggs a while back, when thinking about the L/S parameterisation. As I recall, I was also unclear as to whether scaling by n was really the right thing to do here. The sample-size dependency makes me feel somewhat uncomfortable. |
Currently, mu (i.e. mutation rate) is arbitrarily set to 1e-3, but by setting it to a higher value, we could do better in terms of computational speed.
We need to do some math to figure out the relationship between a lower limit for HMM relative likelihood and precision value, given constraints on HMM input parameters, mu (which should be as high as possible) and MMR.
The text was updated successfully, but these errors were encountered: