| Open Tree Taxonomy moved

!!! Do not edit this page - the content is now maintained in the reference-taxonomy repo wiki

This page describes the taxonomy used by the Open Tree of Life project.

Some of us were not fond of the name "OTToL," so I'm using a new acronym "OTT" (Open Tree taxonomy), but it's the same idea.

The taxonomy is a merge of SILVA, NCBI/Genbank Taxonomy, and the GBIF "nub" taxonomy. There is a list of requirements for potential new inputs into OTT, see below.

Download location

Version 2.3 release notes

Incorporates SILVA (minus plants, animals, fungi) and the taxonomy from study 713
Incorporates patches from the Interim taxonomy patch feature
More thorough synonym processing
The deprecated.tsv file has a column providing a replacement ID for cases where a newly known synonym caused an ID to be deprecated
Updated to latest version of NCBI taxonomy (GBIF hasn't released a new version yet)

Version 2.1 release notes

Updated to latest version of NCBI taxonomy and GBIF nub
Includes all taxa from NCBI, with "dubious" ones (unclassified, virus, etc.) marked with a "D" flag in the last column
Various improvements to the "unique names"
Synonyms file now contains "unique names" of the form "Xyz (synonym for Pqr)"
Now includes synonyms from GBIF (about 800,000 of them)
Synonyms file now has a column-name header row, like the taxonomy file
Some IRMNG homonyms which were previously suppressed have been recovered (those taxa that have children)
Eliminated the "kill lists" of certain NCBI and GBIF taxa. This might introduce some problems, unclear
Fixed Rhodophyta duplications, Ciliophora mapping error, and various other minor problems
Removed dubious IRMNG genera from top level of Metazoa and Plantae
Editing system implemented, and used to add a list of about 70 fungal species

Version 2.0 release notes

OTT 2.0 employs a version of the GBIF taxonomy that was published in July 2012. The GBIF taxonomy was retrieved via this page.

This version is intended to supersede what I'll call "version 1.0" (the OTToL we've been using through March 2013, see here).

New features / changes since version 1.0:

Use of synonyms from NCBI in resolving taxa from GBIF
A more careful merging method, some errors fixed
Update to more recent version of NCBI (previous was from approximately March 1)
Inclusion of "unclassified" taxa from NCBI
About 600,000 additional taxa from GBIF
Ensures that a name occurring under disjoint nomenclatural authorities is considered homonymous
The build process is repeatable with identifier semantics preserved across updates to the sources
A full mapping from "preottol" is provided

OTT 2.0 is provided as a suite of files gathered into a compressed tarball; see http://dev.opentreeoflife.org/ott2.0/

Representation

Taxonomy

File ott2.0/taxonomy = the taxonomy itself. There is one row per taxon. The column separator (following NCBI's example) is tab-stroke-tab.

Columns:

OTT identifier - these have been kept stable relative to OTToL 1.0
OTT identifier for the parent of this taxon, or empty if none
Name (e.g. "Rana palustris")
Rank ("genus" etc.)
Sources - this takes the form tag:id,tag:id where tag is a short string identifying the source taxonomy (currently just "ncbi" or "gbif") and id is the numeric accession number within that taxonomy. Examples: ncbi:8404,gbif:2427185 ncbi:1235509
Unique name - if the name is a homonym, then the name qualified with its rank and the name of its parent taxon, e.g. "Roperia (genus in family Hemidiscaceae)"
Flags - see https://github.com/OpenTreeOfLife/taxomachine/blob/master/src/main/java/opentree/taxonomy/OTTFlag.java

Synonyms

File ott2.0/synonyms - this is a simple mapping of synonym to OTT identifier. The content derives from NCBI; currently we don't harvest synonyms from GBIF (although it has a ton of them). Two columns, separated by tab-stroke-tab:

Name
OTT identifier

Deprecated

File ott2.0/deprecated - taxa that are in version 1.0 but need to be deleted because they were deemed incorrect in some regard (incorrectly placed, ambiguous, synonyms, etc)

Column separator is just tab (beware!). Of primary interest is the first column, which is an OTT identifier for a taxon in a previous version of OTT/OTToL, that in this version has been deprecated. Any uses of such an id ought to be reprocessed by a TNRS or similar mechanism.

Aux (Pre-OTToL mapping)

File ott2.0/aux - mapping of PreOTToL ids into OTT 2.0. There is an entry for every PreOTToL id that maps to OTT 2.0, and in addition entries for PreOTToL ids for which the OTToL 1.0 file provided a mapping. Column separator is tab.

PreOTToL identifier
OTT identifier, if the PreOTToL id maps to OTT 2.0, or empty, if OTToL gave a mapping of the PreOTToL id to OTToL 1.0 but there is no mapping to OTT 2.0.
Comment

Log file

File ott2.0/log - detailed trace of merge algorithm for those names for which the process was "interesting". Currently this is probably only readable by me (JAR). Can be used for diagnosing problems and explaining mapping decisions.

Gotchas

This taxonomy is not authoritative by any stretch of the imagination. It is a product of expedience meant to fill the particular immediate needs of Open Tree of Life, nothing else.

Mistakes might come from any of the sources (NCBI Taxonomy, GBIF) or introduced by us.

When there is any question about parent/child taxon relationships NCBI always takes precedence over GBIF.

When mapping GBIF to NCBI, if a name occurring in both places is deemed to be a match, all of the GBIF children that don't map to some other NCBI taxon are added as children of the merged node. Usually that will mean all of them, but there are a number of cases (about 1500) where the GBIF taxon is "paraphyletic" and the decision as to where to place the children (when they don't already belong to the corresponding NCBI taxon) is somewhat arbitrary.

A name is sometimes judged by a process of elimination as naming a single unified taxon - that is, there is no reason to think there's only one taxon instead of two, other than that they have the same name; but no evidence to the contrary either. This is the case for about 4000 tips (usually species) and 500 internal taxa. Although I'm not aware of counterexamples, this kind of argument is weak (especially in the case of genera) and the name might in fact name two different taxa homonymously, one from each source taxonomy.

Contrariwise, sometimes there is evidence that a name means different taxa in NCBI and GBIF, with no evidence it only names one taxon, and so the merge process creates homonyms that weren't homonyms in either input taxonomy. This determination is heuristic and may be wrong in some causes (in fact, probably most of the time; typical example: Parauronematidae), with the effect that a single taxon appears to occur in multiple places in the tree. There are about 6000 of these names.

Future work

Incorporate additions from source trees proposed via treemachine and/or phylografter, as needed
Update to newer version of GBIF
Propagate provenance provided by GBIF
Add other taxonomies?
Manual corrections?
GBIF synonyms ??
Taxonomy level metadata

Requirements

Following is the analysis that led to the current design of OTT, copied from the minutes of a meeting of the software group held in January 2013:

Requirements on inputs to the opentree taxonomy synthesis step

Source of our requirements = ingest (matching tree tips) and query (searching for parts of synthetic tree)
We (opentree) can do a limited amount of programmatic synthesis/stitching (no manual steps) but...
Minimize number of input taxonomies that feed into opentree taxonomy synthesis process... we want someone to be responsible for being comprehensive
Combined set of input taxonomies should be comprehensive
NCBI at .4M is not comprehensive enough
Should pass our informal spot checks
Must be of adequate precision (in particular should not treat IRMNG homonym list as valid)
Functional hierarchy - each should be a tree (not a forest, not a graph, no orphan taxa)
Each should have a commitment to active maintenance, should be responsive to our bug reports
Should be open (probably we need public domain) (there are possible problems with some candidate input taxonomies)
We can repair problematic backbone issues in inputs, by overriding bad sources with good ones (cf. synthesis above)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly