Skip to content

Latest commit

 

History

History
618 lines (440 loc) · 26.1 KB

README.md

File metadata and controls

618 lines (440 loc) · 26.1 KB

otcetera - phylogenetic file format parser in C++

Build Status

otcetera owes a lot of code and ideas to Paul Lewis' Nexus Class Library. See http://hydrodictyon.eeb.uconn.edu/ncl/ and https://github.com/mtholder/ncl

Some set comparisons (in util.h) were based on http://stackoverflow.com/posts/1964252/revisions by http://stackoverflow.com/users/127669/graphics-noob

The gitversion trick for the otc-version-reporter is from http://stackoverflow.com/questions/6526451/how-to-include-git-commit-number-into-a-c-executable

https://peerj.com/preprints/2538/ describes some of the tools that are a part of otcetera.

Installation

The instructions below contain all of the gory detail. There are a few quirks with OS X installation. See Short OSX instructions for an overview of the process on OS X.

prerequisites: overview

Compiler

Otcetera requires a C++20 compiler.

Build tools: meson, cmake, and ninja

To build otcetera, we need the build tools

  • meson
  • ninja
  • cmake (to build the restbed library)

HTTP library: restbed

We are using the Restbed framework to implement web services for the tree of life. By default, otcetera will not compile the web services if it can't find restbed.

Logging library: g3log

Otcetera now requires the logging library g3log.

Testing: requests

The python requests package is need for running the ninja test target because it runs tests in the ws subdirectory.

prerequisites: quick start

Mac

On a Mac, you can install dependencies with:

brew install meson cmake ninja boost
pip install requests

Linux

On recent versions of Debian or Ubuntu Linux, you can run:

sudo apt-get install meson cmake ninja-build libboost-all-dev libcurl4-openssl-dev

Meson (alternate)

If you don't have version >= 0.60 of meson, you can install it in a virtualenv

# Install meson in a virtualenv
python3 -m venv meson
source meson/bin/activate
pip3 install meson ninja

On windows, you can install meson using the MSI installer on the releases page.

Meson's installation instructions give more detail.

configuration + building: quick start

After installing prerequisites, try the following commands to build restbed and then otcetera under the directory $HOME/Applications/OpenTree/.

# Fetch source
OPENTREE=$HOME/Applications/OpenTree
mkdir -p $OPENTREE/restbed
cd $OPENTREE/restbed
git clone --recursive https://github.com/corvusoft/restbed.git
mkdir -p $OPENTREE/otcetera
cd $OPENTREE/otcetera
git clone https://github.com/OpenTreeOfLife/otcetera.git

# On Mac, check that we are using homebrew ssl in /usr/local/opt/openssl, not system ssl!
echo "CPPFLAGS=${CPPFLAGS}"
echo "LDFLAGS=${LDFLAGS}"

# Build g3log
git clone https://github.com/KjellKod/g3log.git
mkdir g3log/build
(cd g3log/build
 cmake .. -G Ninja -DUSE_DYNAMIC_LOGGING_LEVELS=ON -DCMAKE_INSTALL_PREFIX=/usr -DCPACK_PACKAGE_FILE_NAME=g3log
 nice -n10 ninja package
 sudo dpkg -i g3log.deb)

# Build restbed
alias ninja='nice -n10 ninja'
cd $OPENTREE/restbed
mkdir restbed/build       
cd restbed/build        # Go to $OPENTREE/restbed/restbed/build
cmake .. -G Ninja -DBUILD_SSL=NO -DCMAKE_INSTALL_PREFIX="$OPENTREE/local" -DCMAKE_POSITION_INDEPENDENT_CODE=ON
ninja install

# Make restbed library available too.
export CPPFLAGS="-I${OPENTREE}/local/include $CPPFLAGS"
export LDFLAGS="-L${OPENTREE}/local/library $LDFLAGS"
echo "CPPFLAGS=${CPPFLAGS}"
echo "LDFLAGS=${LDFLAGS}"
# Mac ignores LD_LIBRARY_PATH and doesn't need it, but linux needs it.
export LD_LIBRARY_PATH=${OPENTREE}/local/library

# Build otcetera
cd $OPENTREE/otcetera
meson build otcetera --prefix=$OPENTREE/local
ninja -C build install
ninja -C build test

If the meson tests fail, then examine the logs

less $OPENTREE/otctera/build/meson-logs/testlog.txt

Finally, add the bin directory to your $PATH:

export PATH=$PATH:$OPENTREE/local/bin

Documentation

A LaTeX documentation file is ./doc/summarizing-taxonomy-plus-trees.tex periodically, that is compiled and posted. The currently URL for that compiled documentation is http://phylo.bio.ku.edu/ot/summarizing-taxonomy-plus-trees.pdf

Usage

See the supertree/README.md for instructions on using otcetera to build a supertree (work in progress).

Common command line flags

The tools use the same (OTCLI) class to process command line arguments. This provides the following command line flags:

  • -h for help
  • -fFILE to treat every line of FILE as if it were a command line argument (useful for processing hundreds of filenames)
  • -v for verbose output
  • -q for quieter than normal output
  • -t for trace level (extremely verbose) output

Unless otherwise stated:

  1. the command line tools that need a tree take a filepath to a newick tree file. The numeric suffix of each label in the tree is taken to be the OTT id. This accommodates the name munging that some of the open tree of life tools perform on taxonomic names with special characters (because only the OTT id is used to associate labels in different trees)
  2. a full supertree tree and taxonomy tree have the same leaf set in terms of OTT ids

Config file

You may optionally initialize the global config file. The filepath for the config file can be set using the OTC_CONFIG environmental variable. If that is not set, the default path is ~/.opentree for the config file. The config can hold the location of the OpenTree Taxonomy (OTT). Currently the only use of this file in otcetera is to avoid specifying the taxonomy argument on the command-line to a few commands.

The config file should contain the [opentree] section with a definition for the variable ott:

[opentree]
home = /home/USER/OpenTree
...
ott = %(home)s/ott/ott2.9draft12/
...

You can optionally define a variable such as home to point to the parent directory. Then you can reference that directory by writing %(home)s in other variables in the same section.

Tools for checking a supertree against inputs

Checking for nodes in a supertree that have no tree supporting them

otc-check-supertree taxonomy.tre synth.tre inp1.tre inp2.tre ...

will report any nodes in synth.tre that are not named and which do not have any ITEB support

The taxonomy is just used for the ottID validation (on the assumption that the nodes supported by the taxonomy and the the otcchecktaxonomicnodes tool can help identify problems with those nodes).

Note that this check identifies a nodes that could be collapsed to produce a "minimal" tree (sensu Semple, 2003). As discussed in section 1.3 "Trees without unsupported groups" of the docs, if you want to get rid of such groups, then you should remove them one at a time and rerun the check. Otherwise you may cause a supertree to display fewer of the input clusters.

Checking for incorrect internal labels in a full tree

If you add a -x argument to the invocation, then the program will act like the taxonomy is also a source of support for nodes. Furthermore a report on problems with taxonomic labels in synth.tre will be reported before the summary.

Using both -x and -r will create the taxonomic report, but then clear the taxonomic support stats before analyzing the subsequent inputs. So the final summary should be equivalent to what you get by dropping both the -x and -r args.

otc-check-supertree -x -d taxonomy.tre synth.tre

will check every labelled internal node is correctly labelled. To do this, it verifies that the set of OTT ids associated with tips that descend from the node is identical to the set of OTT ids associated with terminal taxa below the corresponding node in the taxonomic tree.

otc-taxon-conflict-report takes at least 2 newick file paths: a full tree, and some number of input trees. It will write a summary of the difference in taxonomic inclusion for nodes that are in conflict:

otc-taxon-conflict-report taxonomy.tre inp1.tre inp2.tre

Removal of otc-find-unsupported-nodes and otc-check-taxonomic-nodes

The functionality that was previously in otc-find-unsupported-nodes and otc-check-taxonomic-nodes is now implemented in otc-check-supertree. This new tool has the same interface as otc-find-unsupported-nodes but the -d option from otc-check-taxonomic-nodes was also added. Thus otc-check-taxonomic-nodes is no longer necessary, and the name of the tool was changed to reflect its broader set of checks.

Counting input tree groups displayed by a full tree

The otc-displayed-stats analyzes the nodes of the inputs in the context of a summary tree.

otc-displayed-stats -x taxonomy.tre synth.tre inp1.tre inp2.tre ...

writes tab-separated output. The first column is the number of non-redundant input tree nodes that are displayed by the synth.tre. This vector of groupings displayed directly corresponds to "Weighted Input Phylogenetic Statements Displayed" described in the documentation. If you were to assign each tree a weight (based on its rank), you could then calculate a score by multiplying the tree weight by the number displayed in the first column of the output.

Or you could simply view the score of the synth.tre to be a vector of numbers that corresponds to this first column of output. The goal of they Open Tree of Life supertree operation is to maximize this score (in a lexicographic ordering with the first tree being the most significant) while introducing no unsupported groups.

The -x flag above tells the tool to treat the taxonomy as the last input. (if this is lacking the taxonomy is only used for validation of OTT identifiers in the other trees.

Explanation of the otc-displayed-stats output

The full output of the otc-displayed-stats is explained below. Each row of the output reports the number of internal nodes of the input tree that fall into each category. The two "axes" that the statistics explore are support and out-degree.

Columns starting with "F" are "forking" internal nodes with out-degree > 1. Columns starting with "R" are "redundant" internal nodes with out-degree = 1. A "D" suffix to a column header means that the node is displayed by the summary tree. A "CR" suffix means that the node is could resolve a polytomy in the summary tree (so the summary tree is not unambiguously in conflict in the node). An "I" suffix to a column header means that the node is incompatible with every resolution ofthe summary tree.

For the redundant nodes, the report indicates the conflict status of their closest non-redundant descendant. A redundant node can also be marked "T" (for "trivial")if it is an ancestor of only 1 leaf or of the root.

The "F" and "R" column are just the sums for forking and redundant entries.

The "label" shows the tree name or "Total of # trees" for the global sum The ordering of the rows is the input order. The final row shows the totals. . For columns, the order is: FD FCR FI F RD RCR RI RT R label.

Checking for additional splits that could be added

otc-find-resolution taxonomy.tre synth.tre tree1.tre tree2.tre ...

will look for groups in the input trees (tree1.tre, tree2.tre...) which could resolve polytomies in synth.tre. taxonomy.tre is used for label validation and expanding any tips in input trees that are mapped to non-terminal taxa.

Tools used in the supertree pipeline

expanding tips mapped to higher taxa and pruning the taxonomy

otc-nonterminals-to-exemplars takes an -e flag specifying an export diretory and at least 2 newick file paths: a full taxonomy tree some number of input trees. Any tip in non-taxonomic input that is mapped to non-terminal taoxn will be remapped such that the parent of the non-terminal tip will hold all of the expanded exemplars. The exemplars will be the union of tips that (a) occur below this non-terminal taxon in the taxonomy and (b) occur, or are used as an exemplar, in another input tree. The modified version of each input will be written in the export directory. Trees with no non-terminal tips should be unaltered. The taxonomy written out will be the taxonomy restricted to the set of leaves that are leaves of the exported trees:

otc-nonterminals-to-exemplars -estep_5 taxonomy.tre inp1.tre inp2.tre ...

This is intended to perform steps 2.5 and 2.6 of the supertree pipeline mentioned in the doc subdirectory.

pruning a taxonomy

otc-prune-taxonomy taxonomy.tre inp1.tre inp2.tre ...

will write (to stdout) a newick version of the taxonomy that has been pruned to not include subtrees that do not include any of the tips of the input trees. See supertree/step-2-pruned-taxonomy/README.md for a more precise description of the pruning rules. This is intended to be used in the ranked tree supertree pipeline,

decomposing set of trees into subproblems based on uncontested taxa

otc-uncontested-decompose -eEXPORT taxonomy.tre -ftree-list.txt

will create subproblems in the (existing) subdirectory EXPORT using the taxonomy.tre as the taxonomy and every tree listed in tree-list.txt. (each line of that file) should be an input tree filepath. Each output will have:

  • a name that corresponds to the OTT taxon,
  • the trees pruned down for each subproblem (in the) same order as the trees were provided in the invocation, and
  • a corresponding ott###-tree-names.txt file that list the input filenames for each tree (or "TAXONOMY" for taxonomy, which will always be the last tree).

NOTE: phylogenetic tips mapped to internal labels in the taxonomy will be pruned if the taxon is contested. This is probably not what one usually wants to do...

supertree using the decomposition and a greedy subproblem solver

otc-scaffolded-supertree is incomplete. If completed it will produces a supertree of the its inputs.

Solving a subproblem

otc-solve-subproblem subproblem.tre
otc-solve-subproblem tree1.tre tree2.tre taxonomy.tre

This will construct a synthesis tree and write it out in newick format. Here subproblem.tre contains a list of newick trees ending in the taxonomy. If more than one tree file is supplied, the trees are concatenated to form a single subproblem. Earlier trees are ranking higher.

The current solution algorithm attempts to add splits one-at-a-time, checking to see whether the split set is consistent using the BUILD algorithm.

Non-terminal taxa in the input trees are allowed if they occur in the taxonomy. Each terminal taxon contained in the non-terminal taxon is attached to the parent of the non-terminal taxon. The non-terminal taxon is them removed. This behavior can be changed to reject non-terminal taxa with

  • -ifalse for rejecting non-terminal taxa

Flags allow running the solver on non-standard input.

  • -ofalse for handling tree files without OTT ids
  • -T for handling subproblems without a taxonomy.
  • -S writes out a standardized subproblem instead of running a solver.

Miscellaneous tree manipulations and tree statistics

Calculating stats for the subproblems

This works on the outputs of otc-uncontested-decompose. Running:

otc-subproblem-stats *.tre > stats.tsv

Will create a tab-separated file of stats for the subproblems. As of 5, May 2015, the columns of the report are:

  • Subproblem name
  • InSp = # of informative (nontrivial) splits
  • LSS = size of the leaf label set
  • ILSS = size of the set of labels included in at least one "ingroup"
  • NT = The number of trees.
  • TreeSummaryName = tree index or summary name where the summary name can be Phylo-only or Total. "Total" summarizes info all trees in the file (including the taxonomy). "Phylo-only" former summarizes all of the phylogenetic inputs.

Use the -h option to see an explanation of the columns if they differ from this list.

Grafting subproblems back together

This works on the outputs of otc-solve-subproblem. Running

	otc-graft-solutions ott*-solution.tre > grafted_solution.tre

or

	cat ott*-solution.tre > solutions.tre
	otc-graft-solutions solutions.tre > grafted_solution.tre

will produce a newick tree file containing the grafted solution.

If the sub-problems do not connect into a single component, the program will exit with error code 1. The program will write multiple trees, where each tree is a connected component whose root is not found in the other trees.

The -n argument can be used to name the root if desired:

	otc-graft-solutions solutions.tre -nlife > grafted_solution.tre

Unpruning the grafted solution

This tool takes the grafted solution and re-attaches leaves that were pruned

	otc-unprune-solution grafted_solution.tre cleaned_ott.tre > full_supertree.tre

The first argument is the grafted solution. This is a solution on a reduced taxon set.

The second argument is a full (cleaned) taxonomy. This contains leaves that have been pruned.

In order for this tool to work, the grafted solution must have internal nodes corresponding to the taxonomy labelled with their OTT Ids. Currently the generation of subproblems, solution of subproblems, and grafting of solutions preserve these labels.

Typically, many leaves on the grafted solution are internal nodes in the full taxonomy. In this case, the leaves in the grafted solution are expanded to match the taxonomy.

Since many nodes in the taxonomy may have out-degree 1, unpruning involves re-inserting such nodes into the grafted solution to form the full supertree.

In theory, one could use the sub-problem solver to unprune, if the sub-problem solver would handle taxonomy nodes with out-degree 1.

Naming unnamed nodes

This tool takes a series of trees, names the unnamed nodes, and writes out the resulting trees:

otc-name-unnamed-nodes tree1.tre > tree1-named.tre

It is assumed that monotypic nodes always have OTT Ids, and are therefore named. Names for unnamed nodes are of the form mrca-ottX-ottY. To find X and Y in a unique, repeatable way, each node in the tree is annotated with the OTT Id of the smallest leaf in the include group for that node. X and Y are then the annotations of the child nodes with the smallest, and second-smallest annotations, respectively.

Creating annotations for the synthesis tree

This tool takes a series of newick trees: a full supertree, and some number of input trees.

otc-annotate-synth super.tre inp1.tre inp2.tree ...

It outputs a JSON document with fields describing relationships between the input tree edges and the supertree. Relationships include conflict, support, etc and are described in the OpenTree v3 conflict API.

Renaming tree nodes

This tool takes a Newick tree and writes out a relabelled tree.

otc-relabel-tree in.tre --format-tax="%N ott%I" --taxonomy=<ott-dir> --del-monotypic > out.tre

Format codes are given in otc-relabel-tree -h. It is also possible to relabel non-taxonomy nodes, but without refering to taxonomy fields.

It is possible to avoid specifying the taxonomy, if the the file ~/.opentree contains a config file specifying the location of OTT.

getting the full distribution of out-degree counts for a tree

otc-degree-distribution sometree.tre

will write out a tab-separated pair of columns of "out degree" and "count" that shows how many nodes in the tree tree have each outdegree (0 are leaves. 1 are redundant nodes. 2 are fully resolved internals...)

counting the number of polytomies in a tree

otc-polytomy-count sometree.tre

will write out the number of nodes with out degree greater than 2 to stdout. This is just a summary of the info reported by otcdegreedistribution.

Untested

counting the number of leaves in a tree

otc-count-leaves takes a filepath to a newick file and reports the number of leaves:

otc-count-leaves sometree.tre

Detecting contested taxa

otc-detect-contested takes at least 2 newick file paths: a full taxonomy tree, and some number of input trees. It will print out the OTT IDs of clades in the taxonomy whose monophyly is questioned by at least one input:

otc-detect-contested taxonomy.tre inp1.tre inp2.tre

Get an induced subtree

otc-induced-subtree takes at least 2 newick file paths: a full tree, and some number of input trees. It will print a newick representation of the topology of the first tree if it is pruned down to the leafset of the inputs (without removing internal nodes):

otc-induced-subtree taxonomy.tre inp1.tre

Untested

Extract a subtree from a larger tree

otc-prune-to-subtree: Reads a large tree and takes a set of OTT Ids. It finds the MRCA of the OTT Ids, and writes the subtree for that MRCA as newick. The flag preceding the comma-separated list of IDs indicates whether the user want the subtree for the MRCA node (-n flag), its parent(-p flag), each of its children (-c flag and writing one line per child), or each of its siblings (-s flag and writing one line per sib):

otc-prune-to-subtree -p5315,3512 some.tre
otc-prune-to-subtree -n5315,3512 some.tre
otc-prune-to-subtree -c5315,3512 some.tre
otc-prune-to-subtree -s5315,3512 some.tre

Untested

Find distance between a supertree and the input trees

otc-disance takes at least 2 newick file paths: a supertree, and some number of input trees. It will print the Robinson-Foulds symmetric difference between the induced tree from the full tree to each input tree (one RF distance per line), or the number of groupings in each input tree that are either displayed or not displayed by the supertree

otc-distance -r taxonomy.tre inp1.tre inp2.tre

Note the otc-missing-splits script reports just the splits in the induced tree that are missing from the subsequent trees. Comparing this number to the RF would reveal the number of groupings that are missing from the induced tree but present in a subsequent tree. Thus, one can calculate "missing" and "extra" grouping counts from the output of both tools.

Untested

Suppress nodes of outdegree=1

otc-suppress-monotypic takes a filepath to a newick file and writes a newick without any nodes that have just one child:

otc-suppress-monotypic taxonomy.tre

Suppress nodes of outdegree=1

otc-suppress-monotypic takes a filepath to a newick file and writes a newick without any nodes that have just one child:

otc-suppress-monotypic taxonomy.tre

finding the union or intersection of the OTT Ids in a set of trees

otc-set-of-ids tree1.tre tree2.tre

will print out the union of OTT Ids in tree1 and tree2. The -i flag requests the intersection rather than the union. The -t flag requests that only the tips be considered. The -n flag requests that the output should be a newick tree (a polytomy) rather than a list.

Testing

otcetera is still very much under development. You can trigger the running of the tests by:

$ make
$ make check

(currently there are no tests in the make installcheck target). The data for running these tests is in the data subdirectory (but the tests are supposed to know how to find that data, so users do not need to know the location).

unit tests

Some of the operations have unit-tests. These tests are found in the test subdir. Successful execution of these tests results in a row of periods (one per test) appearing when the make check enters the test directory.

tools tests

Some of the executables in the tools subdirectory have tests. These are executed as a part of the normal make check target. The output of the tool can be check using text comparison or tree comparisons (to handle cases in which branch rotation might result in multiple valid outputs of the same operation). Some of the tests just check the exit code.

The syntax used to describe a new test is described in ../expected/README.md and the directories that describe the expected behavior are in the expected subdirectory.

ACKNOWLEDGEMENTS

See comments above about usage of nlohmann::json

To acknowledge the contributions of the NCL code and ideas, a snapshot of the NCL credits taken from the version of NCL used to jump start otcetera is:

As of March 09, 2012, NCL is available under a Simplified BSD license (see BSDLicense.txt) in addition to the GPL license.

NCL AUTHORS -- the author of the NEXUS Class Library (NCL) version 2.0 is

Paul O. Lewis, Ph.D. Department of Ecology and Evolutionary Biology The University of Connecticut 75 North Eagleville Road, Unit 3043 Storrs, CT 06269-3043 U.S.A.

WWW: http://lewis.eeb.uconn.edu/lewishome Email: [email protected]

Versions after 2.0 contain changes primarily made by: Mark T. Holder [email protected]

Other contributors to these versions include: Derrick Zwickl, Brian O'Meara, Brandon Chisham, François Michonneau, and Jeet Sukumaran

The code in examples/phylobase... was written by Brian O'Meara and Derrick Zwickl for phylobase.

David Suárez Pascal contributed SWIG bindings which heavily influenced those found in branches/v2.2. Thanks to David for blazing the way on the swig binding, Google for funding, and NESCent (in particular Hilmar Lapp) for getting the NESCent GSoC program going.

The 2010 GSoC effort also led to enhancements in terms of annotation storage and xml parsing which are currently on. Michael Elliot contributed some code to the branches/xml branch. Thanks to NESCent and Google for supporting that work.

Many of the files used for testing were provided by Arlin Stoltzfus (see http://www.molevol.org/camel/projects/nexus/ for more information), the Mesquite package, and from TreeBase (thanks, Bill Piel!).

Docker

See https://github.com/mtholder/dockerot for the beginnings of a Docker-based build system for otcetera.