Conflict and synthesis file formats

The purpose of this wiki page is document the format for synthetic trees, i.e. the interface between synthetic tree generators (such as propinquity) and synthetic tree consumers (such as tm-lite).

This page is derived from the sections of this google-drive doc that have to do with conflict analyses and the new synthesis tree file formats.

Support/conflict information

(There are some comments and corrections on this section here)

The fields described below are in terms of a synthetic tree being tested for conflict against a corpus of input trees. The notion of support and conflict could be made more general by thinking of one reference tree being compared to another arbitrary input tree or set of input trees. In the case in which the synthetic tree is the reference tree, it is taxonomically complete (define?). Other fields will be needed if we want to implement the conflict API for test trees that are not complete. We use synthesis tree and reference tree interchangeably in this document.

For each node in the reference tree we report the following fields if the values of the fields are not empty:

terminal - list of input trees (or nodes - see below for details) that contain this reference node as a terminal. In the case of the synthetic tree, we can easily link to the node name within the input tree since it will be the same OTT id as the node in the synthesis tree. Note Currently, we may flag terminals in the "exemplified" version of the input tree (if that input has a tip mapped to a higher taxon, that tip will be replaced with the relevant descendants) -- it is not clear if that is the optimal behavior.
supported_by - list of node identifiers in the input tree(s) that support this reference node (sensu https://github.com/OpenTreeOfLife/treemachine/blob/nonsense-1/nonsense/iteb_support_theorem.md).
conflicts_with - list of node identifiers in the input tree(s) that conflict with this reference node.
resolves - list of nodes in input trees. Each of these nodes could be resolved in a manner that would result in a tree that displays a node that corresponds to the reference node (e.g. if you took the reference tree and extracted the tree induces by the leaf set of a tree listed in the resolves list, you would get a tree with a polytomy which does not display this node. However, a resolution of that polytomy would display this node).
resolved_by - (write me)
partial_path_of - list of internal nodes identifiers for the input tree(s) - maximum of one node per input tree. The field means that the edge below this node in the reference tree is compatible with the edge that connects the listed nodes to their parents. However this is not the only edge in the reference tree that maps to that node in the input tree. With respect to the leaf set of the input tree, multiple edges in the reference tree are redundant.
was_constrained - boolean. default is False. If true, then the clade was constrained to be monophyletic in the supertree algorithm.
was_uncontested - boolean. default is False. If true, then the taxon had no single input tree that showed the taxon to be non-monophyletic.

This classification results from mapping each input tree onto the synthesis tree, and classifying the synthesis tree edges according to how they correspond to input tree edges. If you take the induced tree (trace the input tree inside the synthesis tree) and contract all of the resolves edges. and you take the input and contract all of the conflicts_with edges, then you should get 2 identical trees.

Here an input tree edge x with split s(x) is considered to support a synthesis tree edge iff that synthesis tree edge is the only synthesis tree edge to display the split s(x).

Editorially speaking it might be possible to replace some of the above text with a reference to this page which defines support, resolution, etc. in a more general setting.

Synthetic tree

However, we do want to document the interface, so that other people can produce trees and use the treemachineLITE software to serve tree_of_life queries.

v1.0 synth format = one newick, one JSON

The tree structure is to be described in one file using the rules described below.

The JSON file for additional data contains the fields described below with the node IDs of the newick used as keys for node and edge data.

In v1.0 of the synthetic tree format the node labels are either OTT Ids or (if not OTT Ids) opaque strings.

v2.0 synth format = same information, but multiple files are supported

This is a slight tweak designed to make it easier to construct the full tree from smaller analyses.

The only difference between this format and the v1.0 is that:

the newick input can consist of multiple newicks. One newick will have the root of the tree labelled with an ID that occurs in no other newick. In all other newicks, the root of the newick will have label that is a tip ID in the "ancestral" newick. This indicates the grafting point for the newicks. No other IDs are allowed to be re-used across files.
node and edge information in the JSON can be specified in multiple files. A parser simply takes the union of the information. It will illegal to have conflicting information about same node and edge in different JSON files. Typically a node or edge would only be described in one file, but it is also permitted to have some complementary information about the same node/edge spread across multiple files.

v3.0 synth format

We are thinking of creating a registry for node/path IDs. After that is implemented, we will be able to make the restriction that every node ID in the newick corresponds to the ID of this node in the registry.

For named taxa, this will be the OTT ID as in v1 and v2 So this change is just a change in the semantics of the labels of nodes that are not in OTT (from "meaningless label" to "registered ID").

Completely simple id-labelled newick

This is for the v1.0 format. The representation is a set of files:

all_tips.tre - complete newick string
phylo_only_tips.tre - newick for tree with only tips occurring in source trees
annotations.json - tree metadata including conflict/support information

Comment on why we aren't planning to using an existing format

Newick is a commonly supported, terse format for expressing tree structure, but it is weak in terms of expressing other information because the meaning of the node labels and branch length information is not specified by the standard. The New Hampshire Extended convention and the metacomments used by BEAST and associated tools rectify this via "hot comments". While these solutions work, they increase the size of the newick. For large trees, this makes handling the tree representation cumbersome, and is particularly galling for client code that is only interested in the tree representation.

NeXML is very nice for representing rich data, but also results in a very large representation for a tree of over 2 million leaves. The richness of NeXML's annotations relies heavily on the fact that the fundamental entities of the format have IDs that can serve as the target of an annotation.

Rules for the completely simple id-labelled newick format.

The format obeys the rules of the standard newick format, but adds the following restrictions:

Every node must have a label (hence "complete")
each label fits the regex [a-zA-Z0-9]+. In other words only numbers and roman alphabet are allowed. (hence "simple")
each label begins with a letter (so that it is also an NCNAME)? - I think we agreed on this, but should make sure
each label is unique (hence "id-labelled") in the context of the newick (not necessarily a globally unique ID).
branch lengths are not included in the newick representation (and therefore, colons do not appear in the tree representation

The unique IDs can be used in accompanying data structures to uniquely refer to any node in the tree. The mandatory ID expands the size of the newicks somewhat, but requiring simple node labels makes it much easier to implement a validating parser.

Synthetic tree additional data (annotations.json)

"tree level" fields

These fields specify information about the synthetic tree's construction (and are used in the tree stats and tree_of_life/about calls):

date_completed - the date the synthetic tree's construction was completed. (As a property of the tree I think date_created, as originally proposed here, will be less surprising, more suggestive of meaning, and more memorable. - @jar398)
tree_id - a unique identifier for this version of the synthetic tree (I believe we agreed to change this to synth_id - @jar398)
taxonomy_version - the identifier for the version of the taxonomy that was used.
num_tips - the number of leaves in the tree
run_time - an estimate of the time taken to build the tree
num_source_trees - the number of input trees not counting the taxonomy
num_source_studies - the number of studies that contributed trees to the num_source_trees
root_taxon_name - (e.g. "cellular organisms")
root_ott_id
sources - list of strings. Each element is a reference to a source tree where the source_id_map is used to provide defining information for each source. The list is in order, if order of trees affects the supertree.
source_id_map - object with string keys that map to objects describing the source (see below)
generated_by - list of objects that describes the software tools and versions used to build the tree (see below)
filtered_flags - list of taxon flags causing a taxonomy node to have been filtered out before synthesis (see here)

The other top-level fields hold information on the nodes and edges:

nodes - object with node_id's as keys used to describe the nodes in the tree (see below)
edges - object with node_id's as keys used to describe the edges in the tree (see below)

With the exception of nodes and edges this info occurs in only 1 JSON file at the highest level of the JSON, even in the v2 format that supports multiple files.

source_id_map objects

Have these fields:

study_id - the study's phylesystem id, e.g. 'ot_22'
tree_id - NCNAME for the tree within the study, e.g. 'tree3' or '_tree_17'
git_sha

generated_by objects

name - name of the software
version - version string
git_sha - the version identifier for the source code
url - link to the tool
invocation - list of strings describing the command line arguments, possibly containing place holders like "<STUDY_LIST>". intended to document how to run the tool.

node fields

The value of nodes is a dictionary keyed by node id. The value under each node id is a dictionary with support/conflict information as described above (sort of). Example:

 "nodes": {
        "mrcaott4606641ott5048975": {
            "supported_by": [
                [
                    "ot_628_tree1", 
                    "node12"
                ], 
                [
                    "ot_612_tree1", 
                    "node22"
                ], 
                [
                    "ot_611_tree1", 
                    "node39"
                ], 
                [
                    "ot_616_tree2", 
                    "node8"
                ]
            ]
        },

edge fields

Not in use as of 3/10/2016.

length

Provide feedback

Saved searches

Use saved searches to filter your results more quickly