-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy Function #317
base: main
Are you sure you want to change the base?
Discrepancy Function #317
Conversation
Awesome, let me know when you want feedback. |
So we need to pass over edges and pass over roots. To do so we should write a separate function. |
|
I think that is everything we need for |
Oh sorry I just saw this! Will look ASAP. |
I still haven't checked if the tests from |
All tests in |
tests/test_evaluation.py
Outdated
dis, err = evaluation.tree_discrepancy(ts, other) | ||
true_error = np.sqrt((2 * 6 * 300**2 + 2 * 2 * 150**2) / 46) | ||
assert dis == 0.0 | ||
assert np.isclose(err, true_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add another test that tests with both time and span True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention it, that would make sense to do.
I have some minor suggestions, then yes - ready for review! |
Co-authored-by: Peter Ralph <[email protected]>
Co-authored-by: Peter Ralph <[email protected]>
Have a look, @nspope ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks! Just have a couple questions about the "one step" procedure to find the best matching node.
tests/test_evaluation.py
Outdated
if shared_spans[i, j] == max_span[i]: | ||
match[i, j] = shared_spans[i, j] | ||
time_array[i, j] = np.abs(ts.nodes_time[i] - other.nodes_time[j]) | ||
discrepancy[i, j] = 1 / (1 + match[i, j] * time_array[i, j]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remind me of the rationale here?
The best match should be the node that has the greatest shared span, if this is unambiguous. If it is ambiguous (nodes are tied in max shared span), then we could choose the best match to be the tied node with with the closest age to the focal node.
But that two-step procedure isn't happening here, right? So we could conceivably get a "best-matching" node with a relatively small shared span but a very similar age? (and the chance of this happening would depend on the choice of time scaling?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see-- you are doing the two step procedure, sorry! What threw me off was discrepancy[i, j] = 1 / (1 + match[i, j] * time_array[i, j])
... does this need to include match
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose not, I could just clean it up and use shared_spans[i,j]
.
tsdate/evaluation.py
Outdated
time_difference = np.absolute(np.asarray(ts_times - other_times)) | ||
best_match_matrix = scipy.sparse.coo_matrix( | ||
( | ||
1 / (1 + (match_matrix.data * time_difference)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as earlier comment -- couldn't we end up with best matches that don't satisfy the "max shared span" criterion, but are very similar in terms of age?
And, the choice of time scaling would impact the chance of this happening, right? In that, if the time scaling is such that the absolute difference between ages is large relative to typical node spans, then this one-step criterion would be dominated by the difference in ages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I see how this works-- very nice.
tsdate/evaluation.py
Outdated
match = shared_spans.data == max_span[row_ind] | ||
# Construct a matrix of potiential matches and | ||
match_matrix = scipy.sparse.coo_matrix( | ||
(shared_spans.data[match], (row_ind[match], col_ind[match])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work as intended if shared_spans.data[match]
was np.ones(len(match))
, correct?
tests/test_evaluation.py
Outdated
shared_spans = naive_shared_node_spans(ts, other).toarray() | ||
max_span = np.max(shared_spans, axis=1) | ||
assert len(max_span) == ts.num_nodes | ||
match = np.zeros((ts.num_nodes, other.num_nodes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is match
needed? Seems like this would work fine with
if shared_spans[i, j] == max_span[i]:
time_array[i, j] = np.abs(...)
discrepancy[i, j] = 1 / (1 + time_array[i, j])
(also, maybe add a comment about the necessity of 1 / (1 + x) for the sparse matrix format)
tests/test_evaluation.py
Outdated
match[i, j] = shared_spans[i, j] | ||
time_array[i, j] = np.abs(ts.nodes_time[i] - other.nodes_time[j]) | ||
discrepancy[i, j] = 1 / (1 + match[i, j] * time_array[i, j]) | ||
best_match = np.argmax(discrepancy, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm nitpicking here, but I think a more naive (e.g. clearer to read) implementation would not use sparse matrix operations -- instead, go over shared_spans
row by row, find the max, find ties, calculate time differences for those ties, and append the best match to to a list.
Nevermind, I misunderstood on my initial read-through. Looks great! The only real suggestion I have is to rewrite the naive test suite implementation to be more naive (e.g. clearer on a quick glance). That is, don't use sparse matrix operations, but instead loop over rows of |
Perhaps a short explanatory comment could clarify how it works - I thought it was wrong last time I looked at it also. =) |
We've got line ending problems; for the record here's what I did to fix them:
|
Is this ready for merging? I'm happy to do so if @petrelharp and/or @nspope give the OK (assume it's fine for the moment in the I have a feeling it might require a tskit release too, though? |
I think this PR is waiting on |
It is merged to tskit, but we need a release. |
Note: we need to ensure that the discrepancy function and time RMSE are calculated correctly when nodes have no matches. Currently, the tie would be broken by comparing to all nodes in the tree sequence (we think?). @hfr1tz3 will fix |
Added unit test. Edited both naive_discrepancy and tree_discrepancy to exclude non-matched nodes. Had to remove -n argument in pytest.ini file. (I have no idea why its there)
Note that when we add stuff to a tree sequence we add both right and wrong stuff; if the proprotion of wrong stuff we add is greater than the proportion of wrong stuff already there, we increase discrepancy, even if this proportion is very low. (Example: extending a true but simplified tree sequence.) So, proposal is that
|
Also, a |
I think the code does what we want but the docs had it the other way around? I could be mixed up, though - please check? |
Okay - if you agree with my suggested changes, then please hit the "commit change" button on them. |
Co-authored-by: Peter Ralph <[email protected]>
Co-authored-by: Peter Ralph <[email protected]>
TODO:
However, this currently uses |
This discrepancy functions gives us a metric for comparing two tree sequences.
The
tree_discrepancy
method returns for the discrepancy between tsa and tsb, the tuple ofa. the total shared span divided by total node span in tsa (which we need to compute); so this is proprotion of the span in tsa that is accurately represented in tsb; and
b. the root-mean-squared discrepancy in time, with the average weighted by span in tsa.