Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate alternative formats #13

Open
avirshup opened this issue Sep 24, 2016 · 6 comments
Open

Evaluate alternative formats #13

avirshup opened this issue Sep 24, 2016 · 6 comments

Comments

@avirshup
Copy link
Contributor

avirshup commented Sep 24, 2016

Speaking personally, the best outcome of this would be to find that someone has already solved the problems we're thinking about, or at least at least has a solution that can be extended to cover this project's specific application focuses (#1 and #10)

Below is a continuously-updated list of other projects. Everything here should be with a grain of salt, as it's an attempt to glean information from many different specifications :)

MOSAIC
Formats: XML, HDF5 (with straightforward extensions based on the data model)
License: Creative Commons 3.0
Units: List of supported units in spec
Design criteria: https://mosaic-data-model.github.io/design_criteria.html
Data stored: topology, CG info, selections (i.e., subsets of the file's data), references to other "universes"; "properties" (unclear if these are whole-system properties or atomic properties?)
Specification: https://mosaic-data-model.github.io/

Rich molecule format
todo

H5MD
Type: Binary (HDF5)
Self-describing: yes
Domain: molecular dynamics
Flexible units: yes
Human readable: not without HDF5 viewer
License: GPL (need to understand copyleft implications here)
Data stored: MD state data; atoms and their connectivity. Arbitrary atom lists/groups can be defined (nothing specific for chains/residues/etc)
Specification: http://nongnu.org/h5md/

Amber NetCDF
Type: Binary (HDF5)
Self-describing: Yes
Human readable: not directly (GUI viewers available)
Flexible units: yes
Domain: biomolecular dynamics
Data stored: Trajectory (no topology)
Specification: http://ambermd.org/netcdf/nctraj.xhtml

MDTraj HDF5
Type: HDF5
Self-describing: Yes
Human readable: no
Data stored: Trajectory (+ topology as a JSON string)
Flexible units: yes
Domain: biomolecular dynamics
Specification: https://github.com/mdtraj/mdtraj/wiki/HDF5-Trajectory-Format
Notes: This is an extension of Amber NetCDF format. FF-focused topology storage (as JSON)

Chemical Markup Language (CML)
Type: XML
Human readable: yes
Self-describing: sort of - must adhere to a schema
Specification: http://www.xml-cml.org/
Flexible units: yes
Domain: small molecule modeling
Data stored: coordinates, molecular properties, topology w/ stereochemistry, calculation parameters, electronic wavefunctions, computational metadata (i.e. hostname, programVersion, etc.). No support for biomolecules or trajectories.
Notes: I like this project's aims, but there's a LOT of conceptual overhead for understanding XML schema. I don't think I've ever used software that supports CML.

PDBx / MMCif
Type: CIF (text-based, see http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax)
Self-describing: yes
Flexible units: no
Domain: Crystallography / NMR
Specification: http://mmcif.wwpdb.org/docs/tutorials/content/atomic-description.html
http://mmcif.wwpdb.org/docs/tutorials/content/molecular-entities.html
Notes: Vast improvement over original PDB. Medium-to-high conceptual overhead. Parsers are still hard to come by.
Data stored: everything you'd expect in a PDB file: topology + coordinates + crystallographic metadata.

Chemical JSON
Type: JSON (text-based)
Self-describing: yes
Notes: I think this is more of a proof-of-principle (implemented in Avogadro) than a mature spec, but interesting nonetheless. JSON is by far the easiest language here to read and write, both with machines and by hand.
Specification: http://wiki.openchemistry.org/Chemical_JSON

@khinsen
Copy link

khinsen commented Sep 26, 2016

@avirshup Two comments on your summary of Mosaic:

  1. Mosaic properties are per-atom or per-site properties. Think masses, charges, force-field atom types (perhaps better served by Mosaic labels), etc.

  2. The list of units in the Mosaic spec can easily be extended if necessary. The point of having it is not to limit the list of units, but to ensure a unique spelling for each one.

That said, my experience is that units in a file format are a mixed blessing. The more flexibility a format provides for units, the easier it is to write data but at the same time it becomes harder to read data, because the reader must know all the units with their conversion factors and apply them properly. If I were to design a closed file format (i.e. fully specified without any possibility for extensions), I'd prescribe a single unit for everything. Google for "convention over configuration" for discussions of the advantages of such an approach. Mosaic being open-ended (properties, for example, can be anything and thus have unforeseeable units), that was not an option.

@khinsen
Copy link

khinsen commented Sep 26, 2016

Also worth looking at is the CCPN data model, developed for describing NMR data on biomolecular systems. Like the Rich Molecule Format, it is designed to be used as a software API rather than as a file format, but once you get used to the concept of a data model, that becomes an implementation detail.

@khinsen
Copy link

khinsen commented Sep 26, 2016

An important evaluation criterion missing from the above summary is openness for extensions. As an illustration for the utility of openness, Mosaic and H5MD were designed completely independently, but both were made open for extensions. Gluing the two together was almost trivial, as the very short spec of the interface demonstrates.

CML is as open as any XML-based format, meaning that you can embed it into a higher-level format, or define a derived schema which is no longer CML but shares features with it. CIF would be easy to make open from a technical point of view, but wwPDB retains full control over its evolution (which is probably a good thing in its specific environment). All the other formats are not open as far as I know, though I didn't look at each of them in much detail.

@arose
Copy link

arose commented Nov 2, 2016

I would like to add a format we recently put into production on http://www.rcsb.org, see http://mmtf.rcsb.org/

MMTF (macromolecular transmission format)
Type: messagepack/json
Self-describing:
Human readable: no
Data stored: coordinates, topology, some metadata
Flexible units: no
Domain: efficient transmission and parsing of macormolecules
Specification: https://github.com/rcsb/mmtf/blob/master/spec.md

@arose
Copy link

arose commented Nov 2, 2016

related rdkit discussion: rdkit/rdkit#1137

@avirshup
Copy link
Contributor Author

avirshup commented Nov 2, 2016

Thanks @arose! Also mentioned in the rdkit discussion you linked to is http://stuchalk.github.io/scidata/ , which seems extremely relevant, particularly http://stuchalk.github.io/scidata/contexts/scidata_molsystem.jsonld

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants