Skip to content

| Interim taxonomy file format (moved)

Jonathan A Rees edited this page Jan 25, 2016 · 1 revision

This page is obsolete. It has been moved to, and is now being maintained as part of the reference-taxonomy wiki.


I didn't invent this format, I inherited it from Stephen, who copied it from NCBI...

A better format to use would be Darwin Core Archive, which is used by GBIF. TBD.


Following is the very limited format for taxonomy source files. Each source taxonomy (NCBI, GBIF, ...) has its own script that converts its native format into this format.

There should be one directory per taxonomy; the directory should be named descriptively. The contents of the directory are files with fixed names. Example: mycobank/taxonomy.tsv, mycobank/synonyms.tsv, mycobank/about.md.

Character encoding

All files should use the UTF-8 character encoding. Native taxonomy files often use some other encoding, so conversion might be necessary. Some aggregated taxonomies on the web have gotten this wrong and are a mess of mixed encodings and spurious re-encodings.

Taxonomy

File taxonomy.tsv:

Four columns, each column followed by tab - vertical bar - tab.

There should be a header row, which looks like:

uid	|	parent_uid	|	name	|	rank	|	

Followed by one row per taxon.

Column 1: identifier - an integer identifier for the taxon, unique within this file. Should be native accession number whenever possible, sine

Column 2: parent taxon identifier, or the empty string if there is no parent.

Column 3: name - arbitrary text for the taxon name; not necessarily unique within the file.

Column 4: rank, e.g. species, family, class. Should be all lower case. If no rank is assigned, or the rank is unknown, put "no rank".

Example (from NCBI):

    5157	|	1028423	|	Ceratocystis	|	genus	|	
    5156	|	91171	|	Gondwanamyces proteae	|	species	|	

Optional column:

sourceinfo: a comma-separated list of specifiers, each one either a URL or a CURIE. If a URL, it should be either a DOI in the form of a URL, or a link to some other source such as a database. URLs begin 'http://' or 'https://' and DOI URLs begin 'http://dx.doi.org/10.'. A CURIE is an abbreviated URI using a prefix drawn from a known set, e.g. ncbi:1234 is taxon 1234 in the NCBI taxonomy. Other prefixes include gbif:, if: (Index Fungorum), mb: (Mycobank). New prefixes can be added but this is a manual process, please request explicitly.

Synonyms

Usually there are synonyms. These go into a second file, synonyms.tsv. This file should have a header row

    uid	|	parent_uid	|	name	|	rank	|	

Thereafter there are four columns:

Column 1: uid - the id for the taxon (from the taxonomy file) that this synonym resolves to

Column 2: name - the synonymic taxon name

Column 3: type - typically will be 'synonym' but could be any of the NCBI synonym types (authority, common name, etc.)

Column 4: I don't know what this is for. Seems to always be empty, and is ignored by taxonomy synthesis.

Example from NCBI:

    89373	|	Flexibacteraceae	|	synonym	|	|	

Metadata

Overall metadata for the taxonomy should be placed in a file as well. The metadata format is still under development, so for now you should create a markdown or plain text file called 'about.md' in the same directory as taxonomy.tsv and synonyms.tsv files. The file should give the source of the taxonomy (article or database) and any other descriptive information that's available. The purpose of the metadata is not just explanatory but also to explain how to check the correctness of the taxonomy against its source and make corrections and other improvements.

When using information from changing sources (databases) the date or dates of retrieval should be recorded.

Clone this wiki locally