-
Notifications
You must be signed in to change notification settings - Fork 26
supertree via ranked trees on a taxo scaffold
This page describes an alternative supertree method summarized by:
- taxonomic groups that are not contested by any input must be in the final supertree,
- after honoring that constraint, we attempt to find supertree which maximizes the "max weight of input tree edges displayed" objective function using tree-based ranks to supply the weights.
This doc builds off of the supertrees using tree ranking doc; so that is considered background reading.
It seems that procedure for finding an optimal to the "max weight of input tree edges displayed" criterion may not scale well even when using our crude tree-based ranks.
MTH recently realized that:
- the set of optimal phylogenetic graphs (during the step of adding each new tree) may require the use of a set of phylogenetic graphs rather than just one. (each member of the set would need to be built upon until it was obvious that it was sub-optimal),
- the presence of an taxonomic group that is not contested by any input is not sufficient to guarantee that the group will be present the final trees. This complicates the requirements for the "divide" step of the divide-and-conquer method. Note that Rod Page's modified mincut supertrees algorithm (pdf here) modified the original mincut approach (Charles Semple and Mike Steel 2000 pdf here) to retain uncontested groups when feasible. Here were are assuring that the set of uncontentested groups that we constrain to be present are jointly compatible because they all come from one tree (the phylogenetic interpretation of the taxonomy).
- the order which you choose to add group from one tree affects the output graph. So an exact method would need to follow up each possibility.
A high priority desirable property for a method for producing the final tree is that it should make it easy to explain why groups are displayed and how someone from the community of systematic biologists could fix a spuriuos grouping by correcting an input tree or adding another input tree.
Making the constraint that "taxonomic groups that are not contested by any input must be in the final supertree":
- is very easy to explain,
- makes an obvious fix apparent (if you know genus X is not monophyletic, please add a tree that you know of which contests its monophyly).
- makes the construction of the tree easier (by making it easy to break the problem into several subproblems; working on the set of inputs that was used for the open tree draft version 2 tree, there are 6354 such problems that have more than one input).
The inputs are the same as the other pipeline.
Any tip not mapped to OTT cannot be used (as before). The method described below may not require the pruning of repeated or nested labels. However, nested taxa being treated as tips of the same tree should almost certainly be treated as a curation error.
short term: continure starting with our treemachine-processed trees from draft synthesis v2. These inputs have redundantly-mapped, unmapped, and nesting mapping pruned.
As before, we think that we can place any "taxonomy-only" terminal taxa as a last step in the pipeline. So an early pre-processing step is to produce a much reduce taxonomy (it is about 1% of the total number of tips) by pruning off any tips that are not represented by any phylogenetic input).
more below
(those pruned from step 2)
Just because it easier to navigate and serve a tree...
Following this page:
-
T
will be the taxonomy. -
t
will be a input tree. -
L(t)
is the label set of the treet
-
L(t, n)
for a noden
will be the label set of the subtree oft
that descends fromn
-
L(M)
for a nodeM
in the taxonomy will be the labels under this node. -
E(t)
is the set of edges in treet
-
I
is the set of input trees
self-explanatory.
Here we map each node of an input to a node of the taxonomy and we create "path pairings".
The "NodePair" mapping of source tree node n
to taxonomic node M
is accomplished by:
- if
n
is a leaf, thenL(t,n)
is guaranteed to be a taxon. SoM
is defined to be the node of the taxonomy (not necessarily a leaf) for whichL(t,n) = L(M)
. (Note we'll need a rule for handling Gregg's paradox; with some extra bookeepping to make sure that we realize that many artifactual nodes-of-outdegree=1 are produces by pruning in step 2. Using the taxonomic node furthest to the root in the case of multiple nodes that match this definition should work.) - if
n
is not a leaf, then theM
is the least-inclusive taxon (LIT) which includes all members ofL(t,n)
.
This implies that every edge in the input tree can be mapped to a path in the taxonomy. The data structure that stores this mapping is a "PathPair":
class PathPair {
Node * inputTreeChildNode;
Node * inputTreeParentNode;
Node * taxoChildNode;
Node * taxoParentNode;
bool edgeIsTrivialInInputTree; // true if inputTreeChildNode is a tip
};
Note that inputTreeParentNode
will always be the parent of inputTreeChildNode
(because in the source tree, the "path"
is just an edge).
taxoChildNode
may be identical to taxoParentNode
. Or it can be a descendant of taxoParentNode
. Whenever the 2 nodes
are identical, you have a loop in the mapping which is indicative of an input tree providing
information about how to resolve the polytomy.
Each node will store a mapping of input tree identity to the list of nodes in that input that map to the tree.
Similarly each edge in the taxonomy stores a mapping of source tree to references to any PathPair instances that cross the edge.
Step 3C: post-order sweep over the taxonomy making supertree decisions and building up the phylogenetic graph of life.
The steps here are:
-
Is this taxonomic node contested? This can be determined quickly. A taxonomic node is contested if any single source tree has:
- more than one PathPair that cross the node's subtending edge AND
- the set of source tree parent nodes in the these paths includes multiple nodes (this second condition assures that a polytomy including multiple nodes from the taxon and a node from an external group is not considered to be contesting of the node. Such polytomies can be resolved in a manner consistent with the taxonomy, so they should not be considered as conflict).
-
If the node is contested, then it should be "collapsed":
- switching all references to the node in "NodePairs" or "PathPairs" to make them refer to its' parent instead.
- add a pseudo source tree for the taxon as monophyletic (so that if the full resolution of the parent allows for monophyly [despite the fact that the grouping is contested by at least one tree], the taxonomic tree can still exert its preference for monophyly.)
-
If the node is not contested, then it should be resolved. We plan on doing this by greeding construction of a phylogenetic graph (described in the other pipeline), but with an additional "map deeper" step.
Whenever we resolve a node during the postorder traversal we are greedily resolving it subtree. To make the phylogenetic statements from trees that pass through the node be more comparable, we can also remap the lineages passing through the node such that they are mapped to the lowest branch in the subgraph that they traverse. This could correspond to mapping the tip to the taxon that was just accepted; or it might entail (for some graphs with multiple roots) noting them as terminating in an arbitarily named non-taxonmic node.
Inputs:
highest #0: ((A1,B1),C1);
next #1: ((A2,C2),B2);
taxonomy #2: ((A1,A2)A,(B1,B2)B,(C1,C2)C);
The internals of each tree would map to the root of the taxonomy.
So none of the taxanomic groups would be contested.
Since each taxon just has 2 members it is full resolved. So, after iterating over the 3 non-root internals, the tree would look like the taxonomy. But note that the "map deeper" step would remap the edges passing through these clades such that the first two inputs are effectively:
remapped highest #0: ((A,B),C);
remapped next #1: ((A,C),B);
At the root, there will be 2 loops. applying them in order favors the A+B grouping. Thus the final tree would be (((A1,A2)A,(B1,B2)B),(C1,C2)C)
Inputs:
highest #0: ((A1,B1),C1);
next #1: ((A2,C2),B2);
third #2: (((A1,A3),B2),A2);
taxonomy #3: ((A1,A2,A3)A,(B1,B2,C3)B,(C1,C2,C3)C);
All of internals other than (A1,A3)
of tree #2, would map to the root of the taxonomy. The node (A1,A3)
maps to A
in the taxonomy.
Thus A
is contested (there will be 2 paths from tree #2 that pass through A
).
Since each taxon just has 2 members it is full resolved. So, after iterating over the 3 non-root internals, the tree would look like the taxonomy. But note that the "map deeper" step would remap the edges passing through these clades such that the first two inputs are effectively:
remapped highest #0: ((A,B),C);
remapped next #1: ((A,C),B);
At the root, there will be 2 loops. applying them in order favors the A+B grouping. Thus the final tree would be (((A1,A2)A,(B1,B2)B),(C1,C2)C)