supertree via ranked trees on a taxo scaffold

This page describes an alternative supertree method summarized by:

taxonomic groups that are not contested by any input must be in the final supertree,
after honoring that constraint, we attempt to find supertree which maximizes the "max weight of input tree edges displayed" objective function using tree-based ranks to supply the weights.

This doc builds off of the supertrees using tree ranking doc; so that is considered background reading.

Motivation

It seems that procedure for finding an optimal to the "max weight of input tree edges displayed" criterion may not scale well even when using our crude tree-based ranks.

MTH recently realized that:

the set of optimal phylogenetic graphs (during the step of adding each new tree) may require the use of a set of phylogenetic graphs rather than just one. (each member of the set would need to be built upon until it was obvious that it was sub-optimal),
the presence of an taxonomic group that is not contested by any input is not sufficient to guarantee that the group will be present the final trees. This complicates the requirements for the "divide" step of the divide-and-conquer method. Note that Rod Page's modified mincut supertrees algorithm (pdf here) modified the original mincut approach (Charles Semple and Mike Steel 2000 pdf here) to retain uncontested groups when feasible. Here were are assuring that the set of uncontentested groups that we constrain to be present are jointly compatible because they all come from one tree (the phylogenetic interpretation of the taxonomy).
the order which you choose to add group from one tree affects the output graph. So an exact method would need to follow up each possibility.

Justification

A high priority desirable property for a method for producing the final tree is that it should make it easy to explain why groups are displayed and how someone from the community of systematic biologists could fix a spuriuos grouping by correcting an input tree or adding another input tree.

Making the constraint that "taxonomic groups that are not contested by any input must be in the final supertree":

is very easy to explain,
makes an obvious fix apparent (if you know genus X is not monophyletic, please add a tree that you know of which contests its monophyly).
makes the construction of the tree easier (by making it easy to break the problem into several subproblems; working on the set of inputs that was used for the open tree draft version 2 tree, there are 6354 such problems that have more than one input).

Sketch of an algorithm/pipeline

step 0: grab the input trees, pruned taxonomy, and tree ranking

The inputs are the same as the other pipeline.

step 1: prune unmapped leaves from the input trees.

Any tip not mapped to OTT cannot be used (as before). The method described below may not require the pruning of repeated or nested labels. However, nested taxa being treated as tips of the same tree should almost certainly be treated as a curation error.

short term: continure starting with our treemachine-processed trees from draft synthesis v2. These inputs have redundantly-mapped, unmapped, and nesting mapping pruned.

step 2: prune singleton leaves from the taxonomy.

As before, we think that we can place any "taxonomy-only" terminal taxa as a last step in the pipeline. So an early pre-processing step is to produce a much reduce taxonomy (it is about 1% of the total number of tips) by pruning off any tips that are not represented by any phylogenetic input).

step 3: build a the phylogenetic graph of life

more below

step 4: graft on the taxonomic singletons

(those pruned from step 2)

step 5: produce a representative tree from the graph

Just because it easier to navigate and serve a tree...

Step 3: construction of the tree

Following this page:

T will be the taxonomy.
t will be a input tree.
L(t) is the label set of the tree t
L(t, n) for a node n will be the label set of the subtree of t that descends from n
L(M) for a node M in the taxonomy will be the labels under this node.
E(t) is the set of edges in tree t
I is the set of input trees

Step 3A: read in the taxonomy

self-explanatory.

Step 3B: read in each input tree and "thread" the tree onto the taxonomic scaffold.

Here we map each node of an input to a node of the taxonomy and we create "path pairings".

The "NodePair" mapping of source tree node n to taxonomic node M is accomplished by:

if n is a leaf, then L(t,n) is guaranteed to be a taxon. So M is defined to be the node of the taxonomy (not necessarily a leaf) for which L(t,n) = L(M). (Note we'll need a rule for handling Gregg's paradox; with some extra bookeepping to make sure that we realize that many artifactual nodes-of-outdegree=1 are produces by pruning in step 2. Using the taxonomic node furthest to the root in the case of multiple nodes that match this definition should work.)
if n is not a leaf, then the M is the least-inclusive taxon (LIT) which includes all members of L(t,n).

This implies that every edge in the input tree can be mapped to a path in the taxonomy. The data structure that stores this mapping is a "PathPair":

class PathPair {
    Node * inputTreeChildNode;
    Node * inputTreeParentNode;
    Node * taxoChildNode;
    Node * taxoParentNode;
    bool edgeIsTrivialInInputTree; // true if inputTreeChildNode is a tip 
};

Note that inputTreeParentNode will always be the parent of inputTreeChildNode (because in the source tree, the "path" is just an edge).

taxoChildNode may be identical to taxoParentNode. Or it can be a descendant of taxoParentNode. Whenever the 2 nodes are identical, you have a loop in the mapping which is indicative of an input tree providing information about how to resolve the polytomy.

Each node will store a mapping of input tree identity to the list of nodes in that input that map to the tree.

Similarly each edge in the taxonomy stores a mapping of source tree to references to any PathPair instances that cross the edge.

Step 3C: post-order sweep over the taxonomy making supertree decisions and building up the phylogenetic graph of life.

The steps here are:

Is this taxonomic node contested? This can be determined quickly. A taxonomic node is contested if any single source tree has:
- more than one PathPair that cross the node's subtending edge AND
- the set of source tree parent nodes in the these paths includes multiple nodes (this second condition assures that a polytomy including multiple nodes from the taxon and a node from an external group is not considered to be contesting of the node. Such polytomies can be resolved in a manner consistent with the taxonomy, so they should not be considered as conflict).
If the node is contested, then it should be "collapsed":
1. switching all references to the node in "NodePairs" or "PathPairs" to make them refer to its' parent instead.
2. add a pseudo source tree for the taxon as monophyletic (so that if the full resolution of the parent allows for monophyly [despite the fact that the grouping is contested by at least one tree], the taxonomic tree can still exert its preference for monophyly.)
If the node is not contested, then it should be resolved. We plan on doing this by greeding construction of a phylogenetic graph (described in the other pipeline), but with an additional "map deeper" step.

"map deeper" step

Whenever we resolve a node during the postorder traversal we are greedily resolving it subtree. To make the phylogenetic statements from trees that pass through the node be more comparable, we can also remap the lineages passing through the node such that they are mapped to the lowest branch in the subgraph that they traverse. This could correspond to mapping the tip to the taxon that was just accepted; or it might entail (for some graphs with multiple roots) noting them as terminating in an arbitarily named non-taxonmic node.

An Example:

Inputs:

highest #0: ((A1,B1),C1);
next #1:  ((A2,C2),B2);
taxonomy #2: ((A1,A2)A,(B1,B2)B,(C1,C2)C);

The internals of each tree would map to the root of the taxonomy.

So none of the taxanomic groups would be contested.

Since each taxon just has 2 members it is full resolved. So, after iterating over the 3 non-root internals, the tree would look like the taxonomy. But note that the "map deeper" step would remap the edges passing through these clades such that the first two inputs are effectively:

remapped highest #0: ((A,B),C);
remapped next #1:  ((A,C),B);

At the root, there will be 2 loops. applying them in order favors the A+B grouping. Thus the final tree would be (((A1,A2)A,(B1,B2)B),(C1,C2)C)

Example: