Discrepancy Function #317

hfr1tz3 · 2023-09-22T22:27:52Z

This discrepancy functions gives us a metric for comparing two tree sequences.
The tree_discrepancy method returns for the discrepancy between tsa and tsb, the tuple of
a. the total shared span divided by total node span in tsa (which we need to compute); so this is proprotion of the span in tsa that is accurately represented in tsb; and
b. the root-mean-squared discrepancy in time, with the average weighted by span in tsa.

nspope · 2023-09-25T18:52:32Z

Awesome, let me know when you want feedback.

hfr1tz3 · 2023-09-25T20:40:23Z

So we need to pass over edges and pass over roots. To do so we should write a separate function.
Proposal: Need to get per node total span in ts ( ie. ) w_i in sum (t_x-t_y)**2 w_i/ sum (w_i)
count up in each node the total span from where it is a child which we can get from the edge table np.bincount...
doesn't account for nodes which are roots so we need to include contribution of nodes who are roots.

petrelharp · 2023-10-02T18:18:58Z


def node_spans(ts):
    """
    Returns the array of "node spans", i.e., the `j`th entry gives
    the total span over which node `j` is in the tree (i.e., does
    not have 'missing data' there).
    """
    child_spans = np.bincount(
            ts.edges_child,
            weights=ts.edges_right - ts.edges_left,
            minlength=ts.num_nodes,
    )

    for t in ts.trees():
        span = t.interval[1] - t.interval[0]
        for r in t.roots:
            # do this check to exempt 'missing data'
            if t.num_children[r] > 0:
                child_spans[r] += span

hfr1tz3 · 2023-10-02T22:00:44Z

I think that is everything we need for tree_discrepancy. I think I will be ready for feedback now @nspope .

nspope · 2023-10-09T19:55:32Z

Oh sorry I just saw this! Will look ASAP.

hfr1tz3 · 2023-10-09T20:34:36Z

I still haven't checked if the tests from test_evaluation.py have passed yet, but I am working on trying to do that.

petrelharp · 2023-10-09T23:10:01Z

Hold off @nspope - we looked at this today and @hfr1tz3 has homework yet.

hfr1tz3 · 2023-10-25T21:39:30Z

All tests in test_evaluation.py for tree_discrepancy now pass!
@petrelharp does this mean we are ready for review?

tests/test_evaluation.py

petrelharp · 2023-10-26T03:20:46Z

tests/test_evaluation.py

+        dis, err = evaluation.tree_discrepancy(ts, other)
+        true_error = np.sqrt((2 * 6 * 300**2 + 2 * 2 * 150**2) / 46)
+        assert dis == 0.0
+        assert np.isclose(err, true_error)


Why not add another test that tests with both time and span True?

Now that you mention it, that would make sense to do.

tsdate/evaluation.py

petrelharp · 2023-10-26T03:30:33Z

I have some minor suggestions, then yes - ready for review!

Co-authored-by: Peter Ralph <[email protected]>

petrelharp · 2023-10-27T20:07:38Z

Have a look, @nspope ?

nspope

Looks good, thanks! Just have a couple questions about the "one step" procedure to find the best matching node.

nspope · 2023-10-27T20:33:51Z

tests/test_evaluation.py

+            if shared_spans[i, j] == max_span[i]:
+                match[i, j] = shared_spans[i, j]
+                time_array[i, j] = np.abs(ts.nodes_time[i] - other.nodes_time[j])
+                discrepancy[i, j] = 1 / (1 + match[i, j] * time_array[i, j])


Can you remind me of the rationale here?

The best match should be the node that has the greatest shared span, if this is unambiguous. If it is ambiguous (nodes are tied in max shared span), then we could choose the best match to be the tied node with with the closest age to the focal node.

But that two-step procedure isn't happening here, right? So we could conceivably get a "best-matching" node with a relatively small shared span but a very similar age? (and the chance of this happening would depend on the choice of time scaling?)

Oh, I see-- you are doing the two step procedure, sorry! What threw me off was discrepancy[i, j] = 1 / (1 + match[i, j] * time_array[i, j]) ... does this need to include match?

I suppose not, I could just clean it up and use shared_spans[i,j].

nspope · 2023-10-27T20:48:00Z

tsdate/evaluation.py

+    time_difference = np.absolute(np.asarray(ts_times - other_times))
+    best_match_matrix = scipy.sparse.coo_matrix(
+        (
+            1 / (1 + (match_matrix.data * time_difference)),


Same as earlier comment -- couldn't we end up with best matches that don't satisfy the "max shared span" criterion, but are very similar in terms of age?

And, the choice of time scaling would impact the chance of this happening, right? In that, if the time scaling is such that the absolute difference between ages is large relative to typical node spans, then this one-step criterion would be dominated by the difference in ages?

Nevermind, I see how this works-- very nice.

nspope · 2023-10-27T21:15:14Z

tsdate/evaluation.py

+    match = shared_spans.data == max_span[row_ind]
+    # Construct a matrix of potiential matches and
+    match_matrix = scipy.sparse.coo_matrix(
+        (shared_spans.data[match], (row_ind[match], col_ind[match])),


This would work as intended if shared_spans.data[match] was np.ones(len(match)), correct?

nspope · 2023-10-27T21:22:32Z

tests/test_evaluation.py

+    shared_spans = naive_shared_node_spans(ts, other).toarray()
+    max_span = np.max(shared_spans, axis=1)
+    assert len(max_span) == ts.num_nodes
+    match = np.zeros((ts.num_nodes, other.num_nodes))


is match needed? Seems like this would work fine with

if shared_spans[i, j] == max_span[i]: time_array[i, j] = np.abs(...) discrepancy[i, j] = 1 / (1 + time_array[i, j])

(also, maybe add a comment about the necessity of 1 / (1 + x) for the sparse matrix format)

nspope · 2023-10-27T21:26:23Z

tests/test_evaluation.py

+                match[i, j] = shared_spans[i, j]
+                time_array[i, j] = np.abs(ts.nodes_time[i] - other.nodes_time[j])
+                discrepancy[i, j] = 1 / (1 + match[i, j] * time_array[i, j])
+    best_match = np.argmax(discrepancy, axis=1)


I'm nitpicking here, but I think a more naive (e.g. clearer to read) implementation would not use sparse matrix operations -- instead, go over shared_spans row by row, find the max, find ties, calculate time differences for those ties, and append the best match to to a list.

nspope · 2023-10-27T21:30:43Z

Nevermind, I misunderstood on my initial read-through. Looks great!

The only real suggestion I have is to rewrite the naive test suite implementation to be more naive (e.g. clearer on a quick glance). That is, don't use sparse matrix operations, but instead loop over rows of shared_spans, find the set of nodes matching the row max with np.where, get the age difference for these, and append the node with the min age difference to a list.

petrelharp · 2023-10-27T21:56:23Z

Nevermind, I misunderstood on my initial read-through. Looks great!

Perhaps a short explanatory comment could clarify how it works - I thought it was wrong last time I looked at it also. =)

petrelharp · 2023-10-30T18:50:59Z

We've got line ending problems; for the record here's what I did to fix them:

sed -i 's/^M$//' tsdate/evaluation.py
sed -i 's/^M$//' tests/test_evaluation.py

hyanwong · 2023-11-08T12:57:36Z

Is this ready for merging? I'm happy to do so if @petrelharp and/or @nspope give the OK (assume it's fine for the moment in the tsdate repo). If we merge it this week, it should make it into the next release (although I don't think it's a documented thing yet anyway, right?)

I have a feeling it might require a tskit release too, though?

nspope · 2023-11-08T17:38:44Z

I think this PR is waiting on extend_edges for the tests to pass? Which might now be merged into the github head for tskit. But maybe we should wait until this is in a release -- I don't think there's an immediate rush.

petrelharp · 2023-11-14T05:06:05Z

It is merged to tskit, but we need a release.

nspope · 2023-12-11T19:48:57Z

Note: we need to ensure that the discrepancy function and time RMSE are calculated correctly when nodes have no matches. Currently, the tie would be broken by comparing to all nodes in the tree sequence (we think?). @hfr1tz3 will fix

Added unit test. Edited both naive_discrepancy and tree_discrepancy to exclude non-matched nodes. Had to remove -n argument in pytest.ini file. (I have no idea why its there)

petrelharp · 2024-02-21T19:45:50Z

Note that when we add stuff to a tree sequence we add both right and wrong stuff; if the proprotion of wrong stuff we add is greater than the proportion of wrong stuff already there, we increase discrepancy, even if this proportion is very low. (Example: extending a true but simplified tree sequence.)

So, proposal is that tree_discrepancy(a, b) also returns the proportion of the total span of b that is matched in a; this would be

(1 - discrepancy) * total_span(a) / total_span(b)

petrelharp · 2024-02-21T19:47:21Z

Also, a evaluation.total_span(ts) function that just does np.sum(evaluation.node_spans(ts)).

tsdate/evaluation.py

petrelharp · 2024-02-23T00:58:26Z

I think the code does what we want but the docs had it the other way around? I could be mixed up, though - please check?

petrelharp · 2024-02-23T05:54:50Z

Okay - if you agree with my suggested changes, then please hit the "commit change" button on them.

Co-authored-by: Peter Ralph <[email protected]>

petrelharp · 2024-08-09T22:09:32Z

TODO:

squash commits
rebase to main
double check things are ready to merge

However, this currently uses extend_edges, which we are renaming extend_haplotypes; so let's (a) make the change to extend_haplotypes; (b) check the tests run locally; (c) mark skip them so this passes; (d) make an issue to un-skip these when a new tskit with extend_haplotypes is released.

hfr1tz3 and others added 3 commits September 8, 2023 14:46

discrepancy function

2da31ba

reformat discrepancy function

fa1f4ac

discrepancy tests

13807e3

hfr1tz3 added 2 commits October 2, 2023 12:25

added node_spans

47991ab

added naive tests, cleaned up comments

f9275ee

hfr1tz3 added 3 commits October 9, 2023 11:58

some edits

87fb038

flake fix

a7c656e

flake fix

a8d8493

hfr1tz3 and others added 4 commits October 23, 2023 11:16

tests edits

dea2543

test edits

bc35e83

test node spans

fa8a8db

passes tests

2a39628