supertree via ranked trees on a taxo scaffold

This page describes an alternative supertree method summarized by:

taxonomic groups that are not contested by any input must be in the final supertree,
after honoring that constraint, we attempt to find supertree which maximizes the "max weight of input tree edges displayed" objective function using tree-based ranks to supply the weights.

This doc builds off of the supertrees using tree ranking doc; so that is considered background reading.

Motivation

It seems that procedure for finding an optimal to the "max weight of input tree edges displayed" criterion may not scale well

even when using our crude tree-based ranks.

MTH recently realized that:

the set of optimal phylogenetic graphs (during the step of adding each new tree) may require the use of a set of phylogenetic graphs rather than just one. (each member of the set would need to be built upon until it was obvious that it was sub-optimal),
the presence of an taxonomic group that is not contested by any input is not sufficient to guarantee that the group will be present the final trees. This complicates the requirements for the "divide" step of the divide-and-conquer method.
the order which you choose to add group from one tree affects the output graph. So an exact method would need to follow up each possibility.

Justification

A high priority desirable property for a method for producing the final tree is that it should make it easy to explain why groups are displayed and how someone from the community of systematic biologists could fix a spuriuos grouping by correcting an input tree or adding another input tree.

Making the constraint that "taxonomic groups that are not contested by any input must be in the final supertree":

is very easy to explain,
makes an obvious fix apparent (if you know genus X is not monophyletic, please add a tree that you know of which contests its monophyly).
makes the construction of the tree easier.

Sketch of an algorithm/pipeline

step 0: grab the input trees, pruned taxonomy, and tree ranking

The inputs are the same as the other pipeline.

step 1: prune unmapped leaves from the input trees.

Any tip not mapped to OTT cannot be used (as before). The method described below may not require the pruning of repeated or nested labels. However, nested taxa being treated as tips of the same tree should almost certainly be treated as a curation error.

short term: continure starting with our treemachine-processed trees from draft synthesis v2. These inputs have redundantly-mapped, unmapped, and nesting mapping pruned.

step 2: prune singleton leaves from the taxonomy.

As before, we think that we can place any "taxonomy-only" terminal taxa as a last step in the pipeline. So an early pre-processing step is to produce a much reduce taxonomy (it is about 1% of the total number of tips) by pruning off any tips that are not represented by any phylogenetic input).

step 3: build a the phylogenetic graph of life

more below

step 4: graft on the taxonomic singletons

(those pruned from step 2)

step 5: produce a representative tree from the graph

Just because it easier to navigate and serve a tree...

Step 3: construction of the tree

Following this page:

T will be the taxonomy.
t will be a input tree.
L(t) is the label set of the tree t
L(t, n) for a node n will be the label set of the subtree of t that descends from n
L(M) for a node M in the taxonomy will be the labels under this node.
E(t) is the set of edges in tree t
I is the set of input trees

Step 3A: read in the taxonomy

self-explanatory.

Step 3B: read in each input tree and "thread" the tree onto the taxonomic scaffold.

Here we map each node of an input to a node of the taxonomy and we create "path pairings".

The mapping of source tree node n to taxonomic node M is accomplished by:

if n is a leaf, then L(t,n) is guaranteed to be a taxon. So M is defined to be the node of the taxonomy (not necessarily a leaf) for which L(t,n) = L(M). (Note we'll need a rule for handling Gregg's paradox; with some extra bookeepping to make sure that we realize that many artifactual nodes-of-outdegree=1 are produces by pruning in step 2. Using the taxonomic node furthest to the root in the case of multiple nodes that match this definition should work.)
if n is not a leaf, then the M is the least-inclusive taxon (LIT) which includes all members of L(t,n).

This implies that every edge in the input tree can be mapped to a path in the taxonomy. The data structure that stores this mapping is a "PathPair":

class PathPair {
    Node * inputTreeChildNode;
    Node * inputTreeParentNode;
    Node * taxoChildNode;
    Node * taxoParentNode;
    bool edgeIsTrivialInInputTree; // true if inputTreeChildNode is a tip 
};

Note that inputTreeParentNode will always be the parent of inputTreeChildNode (because in the source tree, the "path" is just an edge).

taxoChildNode may be identical to taxoParentNode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly