-
Notifications
You must be signed in to change notification settings - Fork 26
| Open Tree Taxonomy moved
!!! Do not edit this page - the content is now maintained in the reference-taxonomy repo wiki
This page describes the taxonomy used by the Open Tree of Life project.
Some of us were not fond of the name "OTToL," so I'm using a new acronym "OTT" (Open Tree taxonomy), but it's the same idea.
The taxonomy is a merge of SILVA, NCBI/Genbank Taxonomy, and the GBIF "nub" taxonomy. There is a list of requirements for potential new inputs into OTT, see below.
- Incorporates SILVA (minus plants, animals, fungi) and the taxonomy from study 713
- Incorporates patches from the Interim taxonomy patch feature
- More thorough synonym processing
- The deprecated.tsv file has a column providing a replacement ID for cases where a newly known synonym caused an ID to be deprecated
- Updated to latest version of NCBI taxonomy (GBIF hasn't released a new version yet)
- Updated to latest version of NCBI taxonomy and GBIF nub
- Includes all taxa from NCBI, with "dubious" ones (unclassified, virus, etc.) marked with a "D" flag in the last column
- Various improvements to the "unique names"
- Synonyms file now contains "unique names" of the form "Xyz (synonym for Pqr)"
- Now includes synonyms from GBIF (about 800,000 of them)
- Synonyms file now has a column-name header row, like the taxonomy file
- Some IRMNG homonyms which were previously suppressed have been recovered (those taxa that have children)
- Eliminated the "kill lists" of certain NCBI and GBIF taxa. This might introduce some problems, unclear
- Fixed Rhodophyta duplications, Ciliophora mapping error, and various other minor problems
- Removed dubious IRMNG genera from top level of Metazoa and Plantae
- Editing system implemented, and used to add a list of about 70 fungal species
OTT 2.0 employs a version of the GBIF taxonomy that was published in July 2012. The GBIF taxonomy was retrieved via this page.
This version is intended to supersede what I'll call "version 1.0" (the OTToL we've been using through March 2013, see here).
New features / changes since version 1.0:
- Use of synonyms from NCBI in resolving taxa from GBIF
- A more careful merging method, some errors fixed
- Update to more recent version of NCBI (previous was from approximately March 1)
- Inclusion of "unclassified" taxa from NCBI
- About 600,000 additional taxa from GBIF
- Ensures that a name occurring under disjoint nomenclatural authorities is considered homonymous
- The build process is repeatable with identifier semantics preserved across updates to the sources
- A full mapping from "preottol" is provided
OTT 2.0 is provided as a suite of files gathered into a compressed tarball; see http://dev.opentreeoflife.org/ott2.0/
File ott2.0/taxonomy = the taxonomy itself. There is one row per taxon. The column separator (following NCBI's example) is tab-stroke-tab.
Columns:
- OTT identifier - these have been kept stable relative to OTToL 1.0
- OTT identifier for the parent of this taxon, or empty if none
- Name (e.g. "Rana palustris")
- Rank ("genus" etc.)
- Sources - this takes the form tag:id,tag:id where tag is a short string identifying the source taxonomy (currently just "ncbi" or "gbif") and id is the numeric accession number within that taxonomy. Examples: ncbi:8404,gbif:2427185 ncbi:1235509
- Unique name - if the name is a homonym, then the name qualified with its rank and the name of its parent taxon, e.g. "Roperia (genus in family Hemidiscaceae)"
- Flags - see https://github.com/OpenTreeOfLife/taxomachine/blob/master/src/main/java/opentree/taxonomy/OTTFlag.java
File ott2.0/synonyms - this is a simple mapping of synonym to OTT identifier. The content derives from NCBI; currently we don't harvest synonyms from GBIF (although it has a ton of them). Two columns, separated by tab-stroke-tab:
- Name
- OTT identifier
File ott2.0/deprecated - taxa that are in version 1.0 but need to be deleted because they were deemed incorrect in some regard (incorrectly placed, ambiguous, synonyms, etc)
Column separator is just tab (beware!). Of primary interest is the first column, which is an OTT identifier for a taxon in a previous version of OTT/OTToL, that in this version has been deprecated. Any uses of such an id ought to be reprocessed by a TNRS or similar mechanism.
File ott2.0/aux - mapping of PreOTToL ids into OTT 2.0. There is an entry for every PreOTToL id that maps to OTT 2.0, and in addition entries for PreOTToL ids for which the OTToL 1.0 file provided a mapping. Column separator is tab.
- PreOTToL identifier
- OTT identifier, if the PreOTToL id maps to OTT 2.0, or empty, if OTToL gave a mapping of the PreOTToL id to OTToL 1.0 but there is no mapping to OTT 2.0.
- Comment
File ott2.0/log - detailed trace of merge algorithm for those names for which the process was "interesting". Currently this is probably only readable by me (JAR). Can be used for diagnosing problems and explaining mapping decisions.
This taxonomy is not authoritative by any stretch of the imagination. It is a product of expedience meant to fill the particular immediate needs of Open Tree of Life, nothing else.
Mistakes might come from any of the sources (NCBI Taxonomy, GBIF) or introduced by us.
When there is any question about parent/child taxon relationships NCBI always takes precedence over GBIF.
When mapping GBIF to NCBI, if a name occurring in both places is deemed to be a match, all of the GBIF children that don't map to some other NCBI taxon are added as children of the merged node. Usually that will mean all of them, but there are a number of cases (about 1500) where the GBIF taxon is "paraphyletic" and the decision as to where to place the children (when they don't already belong to the corresponding NCBI taxon) is somewhat arbitrary.
A name is sometimes judged by a process of elimination as naming a single unified taxon - that is, there is no reason to think there's only one taxon instead of two, other than that they have the same name; but no evidence to the contrary either. This is the case for about 4000 tips (usually species) and 500 internal taxa. Although I'm not aware of counterexamples, this kind of argument is weak (especially in the case of genera) and the name might in fact name two different taxa homonymously, one from each source taxonomy.
Contrariwise, sometimes there is evidence that a name means different taxa in NCBI and GBIF, with no evidence it only names one taxon, and so the merge process creates homonyms that weren't homonyms in either input taxonomy. This determination is heuristic and may be wrong in some causes (in fact, probably most of the time; typical example: Parauronematidae), with the effect that a single taxon appears to occur in multiple places in the tree. There are about 6000 of these names.
- Incorporate additions from source trees proposed via treemachine and/or phylografter, as needed
- Update to newer version of GBIF
- Propagate provenance provided by GBIF
- Add other taxonomies?
- Manual corrections?
- GBIF synonyms ??
- Taxonomy level metadata
Following is the analysis that led to the current design of OTT, copied from the minutes of a meeting of the software group held in January 2013:
Requirements on inputs to the opentree taxonomy synthesis step
- Source of our requirements = ingest (matching tree tips) and query (searching for parts of synthetic tree)
- We (opentree) can do a limited amount of programmatic synthesis/stitching (no manual steps) but...
- Minimize number of input taxonomies that feed into opentree taxonomy synthesis process... we want someone to be responsible for being comprehensive
- Combined set of input taxonomies should be comprehensive
- NCBI at .4M is not comprehensive enough
- Should pass our informal spot checks
- Must be of adequate precision (in particular should not treat IRMNG homonym list as valid)
- Functional hierarchy - each should be a tree (not a forest, not a graph, no orphan taxa)
- Each should have a commitment to active maintenance, should be responsive to our bug reports
- Should be open (probably we need public domain) (there are possible problems with some candidate input taxonomies)
- We can repair problematic backbone issues in inputs, by overriding bad sources with good ones (cf. synthesis above)