Skip to content

Molecular JSON Draft Spec

Aaron Virshup edited this page Sep 1, 2016 · 15 revisions

WARNING: WIP

Aims

  1. Facilitate interchange between most computational chemistry/materials programs
  2. Store data in an unambiguous, easy-to-parse format
  3. Offer human-readable and -writable syntax
  4. Support extensible, self-describing, hierarchical storage with internal and external references
  5. Encourage rigorous data provenance tracking

Technical Specifications

  • Text format: JSON
  • High-performance format: HDF5
  • Encoding: UTF-8 (the JSON standard)
  • File size limits: ???

Scope

The file format is designed specifically to support common input and output for these applications:

  • Molecular dynamics: simulation (OpenMM, DESMOND, etc.) and analysis packages (MDTraj, PyTraj, etc.)
  • Excited state dynamics: Surface hopping, AIMS, MCTDH, etc.
  • Coarse grained simulations: MARTINI, rigid body MD, etc.
  • Quantum chemistry (PySCF, Psi4, NWChem, GAMESS, etc.)
  • Docking (UCSF-, Auto-, GLIDE, etc.)
  • Informatics (OpenBabel, RDKit, OEChem, etc.)
  • Visualization (VMD, Chimera, etc.)

JSON and HDF5

The object specifications in this document are tailored to JSON, but can be easily stored in an HDF5 file as well. HDF5 is, like JSON, hierarchical and self-describing. These similarities make it easy to perform 1-to-1 transformations between well-formed JSON and a corresponding HDF5 representation.

Unlike JSON, HDF5 is binary and requires custom libraries to read, but has far better performance and storage characteristics for numerical data. We will provide tools to easily interconvert files between JSON and HDF5. Applications that support this format should always provide JSON support; ones that require high performance should also support the HDF5 variant.

Data types

Note that this specification is intended to produce JSON files that are human-readable (and, to an extent, human-writable) ... especially by humans that have not read this specification.

Molecules structures will necessarily need to be self-referencing (e.g., a Bond object will need to reference two Atom objects). The specific method for doing so is one of the outstanding design decisions (see below).

Molecule

A molecule has these fields:

  • name (string): name of the molecule (no particular meaning)
  • type (string): "Molecule"
  • provenance (Provenance object): where this molecule came from
  • topology (Topology object): specifies atomic data, bonds and biomolecular (or materials) hierarchy
  • states (list of State objects): dynamical states with position, momentum, and calculated properties at each point
  • forcefield (optional) (Forcefield object): forcefield specification

Provenance

TBD

Topology

TBD (see design decisions below)

State

TBD

Forcefield

TBD

Physical units

All physical quantities must have associated units. The units are defined in TBD (design decision)

  • Simple units such as "angstrom", "nm", "femtosecond", "kilogram", etc. may be written as a string.
  • Compound units such as "angstrom/fs" or "kcal/mol" should be specified as TBD (design decision)

Numbers and quantities

Note that javascript only has one numeric data type (all numbers are floating point).

  • unitless scalars: 1.0 OR {val:1.0, units:null}
  • unitless arrays: [1,2,3.0,4] OR {val:[1,2,3.0,4], units:null}
  • scalar with units: scalar = {val:2.0, units:'fs'}
  • array with units: array = {val:[1,2,3.0], units:'angstrom'}
  • complex numbers: {val: {real:0.0, imag:-1.0}, units: null} (Complex numbers should always be written with units, even if they are null.)

Data tables

TBD (see design decisions below)

References

The format will use require intramolecular references (for instance, a bond will need to reference 2 atoms); as well as external references (Provenance objects should be able to reference both external URLs and filesystem paths)

Format/data structure is TBD (see design decisions below)

Extensibility

Users should feel free to add metadata to these structures. However, a few notes of caution:

  1. Calculated quantities should go in "topology.properties" or "state.properties"; UNLESS they are so unambiguous as to be trivially calculable (such as atomic numbers).
  2. Method-dependent metadata should be prepended with a unique string to avoid namespace clashes. For instance, a state coefficient for a surface hopping method should be expressed as: mdt_surface_hopping_coeffs = [{type:complex, real=0.5, imag=-.1},...]

Design decisions and alternatives

Items here don't have a definite answer yet - there are multiple answers for each. Answers are currently ranked by AMV's capricious preferences.

How do we reference other objects?

JSON does not directly support object references. This makes it non-trivial to, say, maintain a list of bonds between atoms. Some solutions are:

  1. by array index (e.g., residue.atom_indices=[3,4,5,6])
  2. by JSON path reference (see, e.g., https://tools.ietf.org/html/draft-pbryan-zyp-json-ref-03)
  3. by a unique key. (e.g., residue.id='a83nd83', residue.atoms=['a9n3d9', '31di3'])

Array index is probably the best option - although they are a little fragile, they're no more fragile than path references, and require far less overhead than unique keys.

See also: http://stackoverflow.com/q/4001474/1958900

How do we uniquely specify physical units?

  1. Publicly-available JSON file with supported units and conversions
  2. Standardize to some externally-chosen database or web service

How to specify compound units?

For instance, velocity might be "angstrom/fs" Alternatives:

  1. Require units in the form {unit_name:exponent}, e.g. atom.velocity.units={'angstrom':1, 'fs':-1}
  2. Allow strings of the form atom.velocity.units="angstrom/fs", but require that units be chosen from a specific list of specifications
  3. Allow strings of the form atom.velocity.units="angstrom/fs", and require file parsers to parse the units according to a specified syntax

How do we represent time-dependent topology? (grand canonical, ReaxFF, etc.)

Possible answers:

  1. multiple topology objects (explicit but storage-intensive)
  2. states can store topology "patches" (saves memory, but confusing and hard to implement)
  3. Single global topology with all possible states (i.e., all possible bonds, all possible residues), states can include flags to turn elements on/off (NP-complete in some cases)

How do we store large lists of objects (such as lists of atoms or bonds?)

  1. As a table of values (these values do very well in HDF5 as well)
  2. As a set of arrays
  3. As a list of objects

Examples:

// 1) Storing fields as tables: creates an mmCIF/PDB-like layout
{atoms={type:'table[atom]',
        fields=['name', 'atomic_number', 'mass/Dalton', 'residue_index', 'position/angstrom', 'momentum/angstrom*amu*fs^-1']
        entries=[
          ['CA', 6, 12.0, 0, [0.214,12.124,1.12], [0,0,0]],
          ['N', 7, 14.20, 0, [0.214,12.124,1.12], [0,0,0]],
          ...}

// 2) Storing fields as arrays: much more compact, but harder to read and edit
{num_atoms=1234,
 atoms={names:['CA','CB','OP' ...],
        atomic_numbers:[6,6,8, ...],
        masses:{val:[12.0, 12.0, 16.12, ...], units:amu},
        residue_indices:[0,0,0,1,1, ...],
        positions:{val:[[0.214,12.124,1.12], [0.214,12.124,1.12], ...], units:angstrom},
        momenta:{val:[[0,0,0], [1,2,3], ...], units:angstrom*amu*fs^-1}
        }

// 3) Storing the fieldnames for each atom: readable, but makes the file huge
{atoms=[
  {name:'CA', atnum:6, residue_index:0,
   mass:{value:12.00, units:'Daltons'},
   position:{value:[0.214,12.124,1.12], units:'angstroms'},
   momentum:{value:[0.0, 0.0, 0.0], units:'angstrom*dalton*fs^-1'},
  },
  {name:'N', atnum:7, residue_index:0,
   mass:{value:14.20, units:'Daltons'},
   position:{value:[0.214,12.124,1.12], units:'angstroms'},
   momentum:{value:[0.0, 0.0, 0.0], units:'angstrom*dalton*fs^-1'},
  },
  ...
  }]
}