-
Notifications
You must be signed in to change notification settings - Fork 37
Molecular JSON Draft Spec
WARNING: WIP
- Aims
- Technical Specifications
- Scope
- JSON and HDF5
- Data types
- Extensibility
- Design decisions and alternatives
- Facilitate interchange between most computational chemistry/materials programs
- Store data in an unambiguous, easy-to-parse format
- Offer human-readable and -writable syntax
- Support extensible, self-describing, hierarchical storage with internal and external references
- Encourage rigorous data provenance tracking
- Text format: JSON
- High-performance format: HDF5
- Encoding: UTF-8 (the JSON standard)
- File size limits: ???
The file format is designed specifically to support common input and output for these applications:
- Molecular dynamics: simulation (OpenMM, DESMOND, etc.) and analysis packages (MDTraj, PyTraj, etc.)
- Excited state dynamics: Surface hopping, AIMS, MCTDH, etc.
- Coarse grained simulations: MARTINI, rigid body MD, etc.
- Quantum chemistry (PySCF, Psi4, NWChem, GAMESS, etc.)
- Docking (UCSF-, Auto-, GLIDE, etc.)
- Informatics (OpenBabel, RDKit, OEChem, etc.)
- Visualization (VMD, Chimera, etc.)
The object specifications in this document are tailored to JSON, but can be easily stored in an HDF5 file as well. HDF5 is, like JSON, hierarchical and self-describing. These similarities make it easy to perform 1-to-1 transformations between well-formed JSON and a corresponding HDF5 representation.
Unlike JSON, HDF5 is binary and requires custom libraries to read, but has far better performance and storage characteristics for numerical data. We will provide tools to easily interconvert files between JSON and HDF5. Applications that support this format should always provide JSON support; ones that require high performance should also support the HDF5 variant.
Note that this specification is intended to produce JSON files that are human-readable (and, to an extent, human-writable) ... especially by humans that have not read this specification.
Molecules structures will necessarily need to be self-referencing (e.g., a Bond
object will need to reference two Atom
objects). The specific method for doing so is one of the outstanding design decisions (see below).
A molecule has these fields:
-
name
(string): name of the molecule (no particular meaning) -
type
(string):"Molecule"
-
provenance
(Provenance object): where this molecule came from -
topology
(Topology object): specifies atomic data, bonds and biomolecular (or materials) hierarchy -
states
(list of State objects): dynamical states with position, momentum, and calculated properties at each point -
forcefield
(optional) (Forcefield object): forcefield specification
TBD
TBD (see design decisions
below)
TBD
TBD
All physical quantities must have associated units. The units are defined in TBD (design decision)
- Simple units such as "angstrom", "nm", "femtosecond", "kilogram", etc. may be written as a string.
- Compound units such as "angstrom/fs" or "kcal/mol" should be specified as TBD (design decision)
Note that javascript only has one numeric data type (all numbers are floating point).
-
unitless scalars:
1.0
OR{val:1.0, units:null}
-
unitless arrays:
[1,2,3.0,4]
OR{val:[1,2,3.0,4], units:null}
-
scalar with units:
scalar = {val:2.0, units:'fs'}
-
array with units:
array = {val:[1,2,3.0], units:'angstrom'}
-
complex numbers:
{val: {real:0.0, imag:-1.0}, units: null}
(Complex numbers should always be written with units, even if they arenull
.)
TBD (see design decisions
below)
The format will use require intramolecular references (for instance, a bond will need to reference 2 atoms); as well as external references (Provenance
objects should be able to reference both external URLs and filesystem paths)
Format/data structure is TBD (see design decisions
below)
Users should feel free to add metadata to these structures. However, a few notes of caution:
- Calculated quantities should go in "topology.properties" or "state.properties"; UNLESS they are so unambiguous as to be trivially calculable (such as atomic numbers).
- Method-dependent metadata should be prepended with a unique string to avoid namespace clashes. For instance, a state coefficient for a surface hopping method should be expressed as:
mdt_surface_hopping_coeffs = [{type:complex, real=0.5, imag=-.1},...]
Items here don't have a definite answer yet - there are multiple answers for each. Answers are currently ranked by AMV's capricious preferences.
JSON does not directly support object references. This makes it non-trivial to, say, maintain a list of bonds between atoms. Some solutions are:
- by array index (e.g.,
residue.atom_indices=[3,4,5,6]
) - by JSON path reference (see, e.g., https://tools.ietf.org/html/draft-pbryan-zyp-json-ref-03)
- by a unique key. (e.g.,
residue.id='a83nd83'
,residue.atoms=['a9n3d9', '31di3']
)
Array index is probably the best option - although they are a little fragile, they're no more fragile than path references, and require far less overhead than unique keys.
See also: http://stackoverflow.com/q/4001474/1958900
- Publicly-available JSON file with supported units and conversions
- Standardize to some externally-chosen database or web service
For instance, velocity might be "angstrom/fs" Alternatives:
- Require units in the form
{unit_name:exponent}
, e.g.atom.velocity.units={'angstrom':1, 'fs':-1}
- Allow strings of the form
atom.velocity.units="angstrom/fs"
, but require that units be chosen from a specific list of specifications - Allow strings of the form
atom.velocity.units="angstrom/fs"
, and require file parsers to parse the units according to a specified syntax
Possible answers:
- multiple topology objects (explicit but storage-intensive)
- states can store topology "patches" (saves memory, but confusing and hard to implement)
- Single global topology with all possible states (i.e., all possible bonds, all possible residues), states can include flags to turn elements on/off (NP-complete in some cases)
- As a table of values (these values do very well in HDF5 as well)
- As a set of arrays
- As a list of objects
Examples:
// 1) Storing fields as tables: creates an mmCIF/PDB-like layout
{atoms={type:'table[atom]',
fields=['name', 'atomic_number', 'mass/Dalton', 'residue_index', 'position/angstrom', 'momentum/angstrom*amu*fs^-1']
entries=[
['CA', 6, 12.0, 0, [0.214,12.124,1.12], [0,0,0]],
['N', 7, 14.20, 0, [0.214,12.124,1.12], [0,0,0]],
...}
// 2) Storing fields as arrays: much more compact, but harder to read and edit
{num_atoms=1234,
atoms={names:['CA','CB','OP' ...],
atomic_numbers:[6,6,8, ...],
masses:{val:[12.0, 12.0, 16.12, ...], units:amu},
residue_indices:[0,0,0,1,1, ...],
positions:{val:[[0.214,12.124,1.12], [0.214,12.124,1.12], ...], units:angstrom},
momenta:{val:[[0,0,0], [1,2,3], ...], units:angstrom*amu*fs^-1}
}
// 3) Storing the fieldnames for each atom: readable, but makes the file huge
{atoms=[
{name:'CA', atnum:6, residue_index:0,
mass:{value:12.00, units:'Daltons'},
position:{value:[0.214,12.124,1.12], units:'angstroms'},
momentum:{value:[0.0, 0.0, 0.0], units:'angstrom*dalton*fs^-1'},
},
{name:'N', atnum:7, residue_index:0,
mass:{value:14.20, units:'Daltons'},
position:{value:[0.214,12.124,1.12], units:'angstroms'},
momentum:{value:[0.0, 0.0, 0.0], units:'angstrom*dalton*fs^-1'},
},
...
}]
}