-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can I get a list of descendants of a branch (edge)? #2882
Comments
Hi @hanbin973, welcome 👋 Here's an inefficient (and totally untested!) first pass: # E is the set of samples that descend from each edge.
E = [set() for _ in range(ts.num_edges)]
for tree in ts.trees():
for u in tree.nodes():
if tree.edge(u) != -1:
E[tree.edge(u)] |= tree.samples(u)
A = np.zeros((ts.num_samples, ts.num_edges))
for e in range(ts.num_edges):
for u in E[e]:
# WARNING! This assumes that samples are 0..., n-1 which isn't true in general
A[u, e] += 1 Is this the object you want to compute? If so, we can think about how to make it faster using approaches discussed in #2869 |
Great! It's exactly what I was looking for. I'm trying to do some sort of complex trait analysis, so I'll return after reading #2869. |
The following is an adoption from the docs.
Result:
If I'm correct, the takeaway is that edge weights (here, the sum of phenotypes of all direct/indirect descendant nodes) can be stored more compactly at nodes because the number of edges far exceeds the number of nodes. |
That's very nice @hanbin973! Very clean way of doing it. Here's an implementation using numba, which should be quite fast: import numba
@numba.njit
def _num_graph_descendant_samples(edges_parent, edges_child, W):
last_parent = -1
children = set()
for e in range(edges_parent.shape[0]):
if edges_parent[e] != last_parent:
if last_parent != -1:
for c in children:
W[last_parent] += W[c]
children.clear()
children.add(edges_child[e])
last_parent = edges_parent[e]
if last_parent != -1:
for c in children:
W[last_parent] += W[c]
def num_graph_descendant_samples(ts):
"""
Returns the number of samples that are reachable from each in the graph,
i.e., descend from a particular node in at least one local tree.
"""
W = np.zeros(ts.num_nodes)
W[ts.samples()] = 1
_num_graph_descendant_samples(ts.edges_parent, ts.edges_child, W)
return W |
Thank you for the nice refactoring. The next step is to find a way to multiply A from the edge side. If A is an N(sample nodes) * E(edges) matrix as before, how should we do A*b where b is an E * 1 vector? This expression frequently appears in regression contexts like E(y|A) = Ab (A is the explanatory variable and b is the coefficient of edges). My idea is that traveling the DAG upward for each node will do. I guess there's something we can borrow from |
According to my quick look-up, the implementation of Let N be the number of sample nodes, E be the number of edges (as before), and I be the number of all nodes, including the internal ones. Let B be an I * E matrix that is 1 if the i-th node is the direct child of the e-th edge (0 otherwise). Hence, this matrix is sparse with only E non-zero elements. The matrix can be directly obtained from
or we could just loop over the edges to save memory. Let C be an N * I matrix that is 1 if the n-th sample is a descendant (including indirect relationship) of the i-th node. An important observation is that the A matrix (an N * E matrix that was defined previously) factorizes to A = C * B. Hence, Ab = (C * B) * b = C * (B * b) (by associativity). The matrix-vector multiplication can be done efficiently.
The remaining step is, for each sample, to traverse upward and add elements of |
This is the actual code that works. I didn't notice that @jeromekelleher had already wrote a depth-first search in ARGs. The following is minor modification to his code in #2869. def graph_mv_nodes(ts, arg, vec): # a better name for the function? graph + mv (matrix-vector product) + nodes (side of the multiplication)
"""
Perform a matrix-vector multiplication through a depth-first search
vec: length is number of nodes
out: length is number of samples
"""
out = np.zeros(ts.num_samples, dtype=np.float32) # declare outcome matrix
for s in ts.samples():
u = s
out[s] += vec[s]
is_ancestor = np.zeros(ts.num_nodes, dtype=bool)
is_ancestor[u] = True
stack = [u]
while len(stack) > 0:
u = stack.pop()
for j in range(*arg.parent_range[u]):
e = arg.parent_index[j]
e_parent = ts.edges_parent[e]
if not is_ancestor[e_parent]:
# Note: setting is_ancestor here because we can
# push the same node on the stack multiple times otherwise
is_ancestor[e_parent] = True
stack.append(e_parent)
# update out vector
out[s] += vec[e_parent]
return out
def graph_mv_samples(ts, arg, vec): # a better name for the function? graph + mv (matrix-vector product) + samples (side of the multiplication)
"""
Perform a matrix-vector multiplication through a depth-first search
vec: length is number of samples
out: length is number of nodes
"""
out = np.zeros(ts.num_nodes, dtype=np.float32) # declare outcome matrix
for s in ts.samples():
u = s
is_ancestor = np.zeros(ts.num_nodes, dtype=bool)
is_ancestor[u] = True
stack = [u]
while len(stack) > 0:
u = stack.pop()
for j in range(*arg.parent_range[u]):
e = arg.parent_index[j]
e_parent = ts.edges_parent[e]
if not is_ancestor[e_parent]:
# Note: setting is_ancestor here because we can
# push the same node on the stack multiple times otherwise
is_ancestor[e_parent] = True
stack.append(e_parent)
out += is_ancestor * vec[s]
return out Execution code
Result
|
I've not sat down to think carefully about the comparison, but here's an algorithm that does something equivalent: #1547 (comment) (it's looking at mutations rather than edges, but the basic idea is the same). |
Thank you for the suggestion @petrelharp! It's a bit difficult for me to understand your algorithm. Could you expand? |
Very nice @hanbin973! It should be fairly straightforward to accelerate your algorithms above using numba and you should be able to use them on pretty tree sequences then. Let me know if you'd like some help with this. There's probably a bit of improvement to be made by doing the upward traversals for each sample at the same time (rather than sample-by-sample) but I haven't thought much about this. |
Agreed @jeromekelleher. I think it's possible to propagate mutations "downwards" somehow, rather than climbing upwards. The simultaneous downward approach could remove redundant calculations, e.g. some mutations share descendants so the shared path is passed through multiple times if not done simultaneously. @petrelharp 's algorithm seems to remove the redundancies. I tried to reproduce his idea without really understanding it, and the result dosen't seem to be correct. For example, for each edge, climb up to the root and record the intermediate nodes in a stack. This would take at most "depth of the tree" steps so O(logN). Next, pop out the elements from the stack and add x[v] to x[u]. This is climbing down the path so also O(logN). Repeating over edges will therefore be O(num edges * logN). I think some duplicate additions are being made here, and my rough guess is that I'm not following how the values of x are being updated when the descendant samples change over trees. Such an incremental algorithm would be definitely desirable but I really need concrete examples to fully implement it myself. |
Hi there, thank you for developing a terrific library. I'm trying to perform certain matrix operations using tskit.
Suppose that A is a matrix in which the number of rows (N) are individuals, and the number of columns (E) are edges (brancehs). The value is 1 if the node is a descendant of an edge and 0 otherwise (not only the direct descendant but all descendants). A is the same as the genotype matrix, given one mutation for each branch.
I'm willing to multiply this matrix A with a phenotype matrix X of individuals (X = N x P, where P is the number of phenotypes) to get A^TX. Is this possible within the statistics functionality of tskit?
The text was updated successfully, but these errors were encountered: