Use cases #10

avirshup · 2016-09-17T21:30:46Z

It would be good to have some fleshed out use cases to understand what this file format will (and won't!) support. Here's a brainstormed list that probably reflects my biases:

Biomolecular

A protein/ligand structure from PDB (e.g., 3AID)
A protein/ligand structure from a docking calculation (eg., 3AID with AutoDock)
5 ns of NPT lipid bilayer dynamics in a fully solvated periodic system
A series of snapshots from a QM/MM minimization of an activated enzyme complex

Late, embarassing edit:
Also, um, an alchemical FEP calculation

Molecular

Chemoinformatic descriptors calculated for a small molecule
A series of PES points (minima and transition states) along a small molecule reaction path, including QM energetics and vibrational modes
A single molecular geometry with QM wavefunctions from several theories (e.g., PM6, B3LYP, MP2, and CCSD)

davidlmobley · 2016-09-23T20:05:17Z

@avirshup -- I want to see your bullet points and raise them (generalize them) a bit:

A protein/ligand structure from PDB (e.g., 3AID)

A protein/ligand structure from whatever origin, such as found in PDB files (i.e. including refined structures which have not yet been deposited in the PDB, to be clear).

Note that we would presumably NOT support electron density files from refinement in this format, I would imagine.

A protein/ligand structure from a docking calculation (eg., 3AID with AutoDock)

Generalizing -- 3AID with some compound (let's say, nevirapine -- not that one would want to put that into 3AID, but...) docked into it via AutoDock after removal of the crystallographic ligand

For that matter, if you are doing docking, what if you want to represent many ligands docked to the same structure without wasting space by providing the structure of the same protein for each? Currently what people usually do is they provide their protein structure, and separately provide a structure of each ligand pose using the same frame of reference. In my view this is somewhat bad because it's easy to (a) lose track of one of the files in which case the other becomes useless, or (b) accidentally change the frame of reference of one of the files.

I'd suggest you might want to generalize this by also considering representing:

A single protein structure with multiple ligands docked into it separately (i.e. "we docked this library of ligands to this structure and here is the reference structure plus one pose for each ligand")
A single protein structure with multiple conformations of the same ligand docked to it (i.e. "we docked this ligand to that structure and here are all the poses we selected as likely")
A single protein structure with N ligands docked to it, each with up to M poses (i.e. "we docked this library of ligands to this structure and here is the reference structure plus up to M poses for each docked ligand")
OK, if we really generalize, you might as well consider the "multiple ligand binding" or "cofactor binding" cases as well -- i.e. what if I have a protein which binds a cofactor such as flavin, but I don't know where it binds, AND it also binds (after flavin) some series of ligands whose binding mode I don't know. So, I dock flavin first, and then I dock my ligand library one at a time. So now I have to represent two ligands bound at once, one which is present across all docked structures (flavin) and another which is a pose from a library of compounds.
This also should be generalized to handle binding of multiple copies of the same molecule simultaneously -- i.e. for some HIV integrase ligands, for example, three inhibitors bind in different sites (which are not symmetry equivalent) simultaneously
But, why restrict this to just a single protein structure? What if I've generated a bunch of protein structures of the same protein to dock to from an MSM? It seems like in the most general case (for docking) I'll have L protein structures, and to each of them I might dock up to M different molecules simultaneously (i.e. flavin plus another); I might have N(m, l) poses for each molecule (where N is a function of which molecule I'm looking at, m, and possibly a function of the structure I'm working on, l); and one or more of the M might be chosen from a library of up to O other molecules.

I suppose a question on the above is whether sometimes, the information might be better contained in multiple files -- i.e. if you've docked a library of ligands to a particular protein, might it be better to have each ligand in a separate file? But then that throws you back in the direction of splitting your results across files and having to pull things from multiple files to be able to visualize/process/further work with the results. Maybe that's OK -- I'm just pointing it out.

Most existing file formats tend to handle most of these things in cumbersome ways at best -- i.e. in a PDB, there is no way to represent that THIS thing (i.e. a static protein structure) should be held constant across all the models provided (i.e. if you provide a variety of different ligands and one protein), leading to the issue I mentioned above where people split things across multiple files.
.mol2 and .sdf files have the ability to contain multiple conformers of a single molecule as well as multiple molecules, and some processors (like OpenEye's) can manage to handle this. But these are geared towards small molecules and don't really represent protein structures in a helpful way. They also don't handle the case where I might want to represent combinations of things (two molecules docked to the same protein at the same time).

Some of the use cases above have some overlap with what I might want to represent from MD (or MC!) simulations -- i.e. if I have a way of representing multiple structures of the same protein because I docked the same ligand into multiple structures, that probably generalizes fairly well to representing simulation data.

5 ns of NPT lipid bilayer dynamics in a fully solvated periodic system

Sure. Also, though, what if it's MC? Or hybrid MC MD? Or, simulated tempering, or generalized ensemble (lambda varies with time)? Or if I have a set of replica exchange or lambda exchange simulations where I have multiple trajectories that are sort of linked together in a conceptual sense?

What if it's non-periodic? Partially periodic? Contains an interface to gas phase? A droplet? A crystal? (Simulating the crystal lattice of a protein in a crystal structure?)

(Not trying to say we necessarily want to accommodate everything -- I'm just trying to think of cases that I care about presently that you might not be thinking of.)

A series of snapshots from a QM/MM minimization of an activated enzyme complex

Sure.

Molecular

Chemoinformatic descriptors calculated for a small molecule

A series of PES points (minima and transition states) along a small molecule reaction path, including > * QM energetics and vibrational modes

A single molecular geometry with QM wavefunctions from several theories (e.g., PM6, B3LYP, MP2, and CCSD)

Those all sound reasonable. What about:

Electrostatic potentials for small molecules (i.e. from QM)
Many representations of molecules (SMILES, IUPAC name, 2D structures (i.e. a MOL2 without coordinates is an example of this), INCHI key, ...)
Small molecule crystal structures (i.e. in the crystal)
retaining other data about molecules when available (see comments elsewhere on partial bond order, for example; partial charge is of course relevant too and already commonly used...)

More complex:

Polymers (i.e. generalized proteins)
Other molecules which are composed of repeating subunits (i.e. some hosts as in host-guest binding)

avirshup · 2016-09-24T22:50:29Z

@davidlmobley - good stuff!

A few immediate responses:

Agree that electron density for crystallographic refinement is not a core use case
Almost everything else you mention sounds basically good to me, and it's all extremely valuable to see.
For determining whether a use case is in scope: One idea is that, at least for an initial spec, the scope should be limited to atomistic 2D and real-space 3D simulation. So it would need to store information for chemoinformatics, Docking, QM, MM, MD, MC. But it would not directly support QSAR models, or CG, definitely not crystalline semi-conductors, and probably not even repeating polymer subunits, unless you have explicit 3D coordinates for all of them. (This is completely up for discussion of course.)
In general, dealing with objects will come up in many contexts (especially the proposed Provenance object), so it will need to be addressed somehow. My only real thought on that is that there should be an ability to create a Link object that can specify a URL or relative file path - there are some standards for this, like in the google style guide. The problem then becomes making sure the links don't break! The Materials Project handles a similar problem by making unique DOIs available for each structure.

avirshup · 2016-09-24T22:56:57Z

A few more boundary-stretching use cases:

Umbrella sampling on a small molecule in multiple solid-state crystal forms
structure of a carbon nanotube wrapped by an organic polymer
an electron-transfer calculation on a dendritic organic light-emitting-diode structure

khinsen · 2016-09-26T07:51:18Z

Defining the scope is definitely an important step. But I am not sure I understand what scope you have in mind. You say "atomistic 2D and real-space 3D simulation". But then you mention cheminformatics (which isn't specifically about simulation) and QM (which I wouldn't call atomistic but "electron-level"). Moreover, I don't really understand "2D simulation" in this context.

My view of this application space consists of three domains, at the highest level:

The computational domain: simulations ranging from electronic state to coarse-grained point-mass models, including of course the atomic scale in the middle.
The experimental domain, covering a specific experiment (crystallography, NMR, EM, ...) and its results.
The cheminformatics domain: relations between experiments, simulations, and their wider scientific context

Covering all of this in one file format is a very ambitious project. That doesn't mean it's not worth attempting, but I'd want to have experience with good file formats for each individual domain before trying to do a synthesis.

My intention with Mosaic is to deal with the computational domain exclusively, but provide interface points to other formats dealing with the other two domains. In particular, each data item in a Mosaic record has a unique ID, so if you have a unique ID to the whole record (such as a DOI from Zenodo), you can refer Mosaic data items unambigously from the outside.

The distinction between domains 1 and 2 needs some explanation, because it has become blurred in particular in structural biology. Most people think of a PDB file a belonging to the experimental domain, although it contains more computational than experimental results. For me, the experimental domain covers everything that happens outside of computers, and the computational domain everything inside. For protein crystallography, the experimental domain ends with the clichés from a detector.

In my long-term vision for chemical information processing, Mosaic will grow to cover the whole computational domain. In parallel, a similarly extensible model should deal with the experimental domain. Cheminformatics data files could refer to data items in these two formats through unique IDs. Experimental provenance would be handled in the same way, with Mosaic records of, say, X-ray clichés containing the unique ID of the data items that describe the experiment.

This was referenced Sep 24, 2016

Evaluate alternative formats #13

Open

Write a justification for why we need a format? #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cases #10

Use cases #10

avirshup commented Sep 17, 2016 •

edited

Loading

davidlmobley commented Sep 23, 2016

avirshup commented Sep 24, 2016 •

edited

Loading

avirshup commented Sep 24, 2016 •

edited

Loading

khinsen commented Sep 26, 2016

Use cases #10

Use cases #10

Comments

avirshup commented Sep 17, 2016 • edited Loading

davidlmobley commented Sep 23, 2016

avirshup commented Sep 24, 2016 • edited Loading

avirshup commented Sep 24, 2016 • edited Loading

khinsen commented Sep 26, 2016

avirshup commented Sep 17, 2016 •

edited

Loading

avirshup commented Sep 24, 2016 •

edited

Loading

avirshup commented Sep 24, 2016 •

edited

Loading