Skip to content

phylogenetic graphs

Mark T. Holder edited this page Feb 14, 2015 · 12 revisions

Phylogenetic graph for the finding a supertree based on ranks

See supertrees using tree ranking page for some broader discussion of the role of a phylogenetic network. This page discusses how to construct a "phylogenetic graph" or "phylogenetic network" used in step 5 "build a phylogenetic graph for each subproblem" of a pipeline for producing a supertree is here..

input

The input is a ranked set of trees with a full tree from the taxonomy as the lowest ranked input.

Comments on mapping of tips

There is a proposal to remap tips to the deepest uncontested taxon that they examplify. So tips of the tree may be associated with higher taxa.

This page assumes that a this proposal or something similar is adpoted to prune any tip that was mapped by a curator to a higher taxon which is contested.

Thus we assume that, at this point in the pipeline, the inputs that are higher taxa are uncontested higher taxa.

preprocessing step: replace higher taxa with polytomies of terminal taxa.

Because we know that each of the tips mapped to a higher taxon is a case of a taxon that is going to be in the final answer, we can unproblematically create a set of inputs with no complexity of mapping to taxonomic groups by replacing such labels with polytomies include the terminal taxa. In fact, (because we know that the last input will bring in all of the leaves), it is safe to remap each leaf of the higher ranked trees as follows.

  1. create a list of all of the tips in the phylo inputs that are non-terminal taxa.
  2. partially sort the list such that, if any of these higher taxa are nested (based on the taxonomy) the shallower taxa come first.
  3. walk through this list remapping by: * A. find the union of terminal taxa used in the phylo (non taxonomy) trees. * B. if the set from A is empty, then choose one terminal taxon (arbitrarily) to stand in for the higher taxon * C. if the set from A is not empty, replace the tip with a polytomy of the set of terminals, but make an annotation on the branch that will indicate that this tree is not providing support for this grouping. Later when support for different branches are tabulated these expansions should not be included because thes are cases in which we use the taxonomy to assume that the group is monophyletic, even though the phylo input just has one tip.

the goal

Imagine that the score for a supertree, s, is a tuple:

( |c(s, N)| , |c(s, N-1)|, |c(s, N-2)|, ... |c(s, 0)| )  

where c(s,N) is the set of groups in the highest ranked input tree N, that are displayed on the supertree. |c(s,x)| indicates the size of the sets.

We would like to take these inputs and find the set of trees that has the highest score attainable if scores are computed lexicographically. So it is better to display "4 groupings from tree 4, and none from trees 0-3" than it is to display "3 groupings from tree 4, and all of the groupings from the lower ranked inputs".

The set of trees may be very large, so we would like to have a compact structure that represents the entire set. As discussed here, MTH conjectures that a phylogenetic graph will be able to accomplish this. This document will try to describe the rules for such graphs.

The summary of the supertree strategy is: walk through ranked list of trees. For each tree, walk through each grouping. For each grouping decide if it is compatible with the current phylogenetic graph or not. If it is, update the graph. If the information from the grouping is not redundant with previously constructed information, then this will change the structure of the graph. At minimum it will entail making note of the association between the graph edge and the input edge.

If the grouping is not compatible, make note of that split as an undisplayed split, but do not alter the graph.

Properties

  1. The edges are directed.
  2. there are no directed cycles.
  3. There are 2 sorts of adjectives types of edges: looped vs unlooped. dangling vs rooted. So there are four types of edges. All edges to terminals are unlooped. * unlooped edge type: These are the edges that are most like a directed edge in a phylogenetic tree. If the set of trees can be displayed in a consensus tree, then each edge makes a group of phylogenetic statements on the full set of leaves. These edges make similar statements about all of the leaves in their connected component of the graph. * looped edges: these edges create "bubbles" in the graph. They can be used to make phylogenetic statements about a subset of the full set of leaves. * dangling edges: if a connected component of the graph has multiple roots then each edge that whose ancestors include only a proper subset of all of roots is considered to be a dangling edge. In some sense, a dangling edge is a type of looped edge.

Every node in the graph has a unique set of taxa descended from it with the exception of the nodes created to close the ends of a loops. These nodes may have a set of descendants that is identical to their child node.

Some properties of these graphs when they are used in the suggested pipeline.

  1. because of the tree ranking, the graph will be start as a phylogenetic tree identical to the highest ranked tree.

Example

Consider an example with 3 ranked inputs. The highest priority tree is:

tree1

The second ranked tree is:

tree1

and the tree with the lowest priority is:

tree1

These inputs are all pairwise compatible and jointly compatible. After adding the first input, the phylogenetic graph would be identical to that input tree. After the second input is added, the graph would be:

tree1

with the looped edges shown in red and the red nodes being the degenerate nodes at the end the loop.

interpreting loops

The loop is a statement of partial relative placement of some of the taxa. If we were trying to simply add E (from tree 2) to the top tree, it would be unclear where on the path from the MRCA of A+F to the MRCA of A+B we should attach E. In fact E could be on a "side-branch" - such as sister to D and the resulting tree would still satisfy both trees.

Each "side" of the loop make some statements (e.g. C is closer to A than D is), but these statements can be mixed and matched with the statements on the other "side" of the loop to generate final trees that satisfy the phylogenetic statement made on both branches.

Note that a polytomy of E, C, and D would not correctly capture the statement: "C is closer to A than D is".

Also note that this graph makes some statements that are not in either input, but must be true of any solution that satisfies both trees. For example, A+E form a group that excludes G. Thus we see that the structure can detect some incompatibilities with subsequent inputs that would not be found if we just scanned all input trees for clades that conflict with a grouping.

Note that for the nodes inside the loop (but not the degenerate nodes at the ends), the complete set of descendant taxa in the final graph is not known. For example the node that is the parent of E must have E, B, and A as descendants (or the node could later be merged with another node in the graph, but the resulting node will still have these descendants). Similarly the "outgroup" taxa (F and G) must be outside of the loop must not be descendants of that node. But subsequently the loop may be close by another tree; at that point the node may be merged with the parent of C, or the parent of D; or the node may be placed along any of the edges in the "C + D side" of the loop. So we do not know whether the node that is currently the parent of E will have C or D as descendants. The graph structure (and rules to be described later) will guarantee that, in the final graph, if the MRCA of E and A is an ancestor of D, then it will also be an ancestor of C.

Clone this wiki locally