-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cases #10
Comments
@avirshup -- I want to see your bullet points and raise them (generalize them) a bit:
A protein/ligand structure from whatever origin, such as found in PDB files (i.e. including refined structures which have not yet been deposited in the PDB, to be clear). Note that we would presumably NOT support electron density files from refinement in this format, I would imagine.
Generalizing -- 3AID with some compound (let's say, nevirapine -- not that one would want to put that into 3AID, but...) docked into it via AutoDock after removal of the crystallographic ligand For that matter, if you are doing docking, what if you want to represent many ligands docked to the same structure without wasting space by providing the structure of the same protein for each? Currently what people usually do is they provide their protein structure, and separately provide a structure of each ligand pose using the same frame of reference. In my view this is somewhat bad because it's easy to (a) lose track of one of the files in which case the other becomes useless, or (b) accidentally change the frame of reference of one of the files. I'd suggest you might want to generalize this by also considering representing:
I suppose a question on the above is whether sometimes, the information might be better contained in multiple files -- i.e. if you've docked a library of ligands to a particular protein, might it be better to have each ligand in a separate file? But then that throws you back in the direction of splitting your results across files and having to pull things from multiple files to be able to visualize/process/further work with the results. Maybe that's OK -- I'm just pointing it out. Most existing file formats tend to handle most of these things in cumbersome ways at best -- i.e. in a PDB, there is no way to represent that THIS thing (i.e. a static protein structure) should be held constant across all the models provided (i.e. if you provide a variety of different ligands and one protein), leading to the issue I mentioned above where people split things across multiple files. Some of the use cases above have some overlap with what I might want to represent from MD (or MC!) simulations -- i.e. if I have a way of representing multiple structures of the same protein because I docked the same ligand into multiple structures, that probably generalizes fairly well to representing simulation data.
Sure. Also, though, what if it's MC? Or hybrid MC MD? Or, simulated tempering, or generalized ensemble (lambda varies with time)? Or if I have a set of replica exchange or lambda exchange simulations where I have multiple trajectories that are sort of linked together in a conceptual sense? What if it's non-periodic? Partially periodic? Contains an interface to gas phase? A droplet? A crystal? (Simulating the crystal lattice of a protein in a crystal structure?) (Not trying to say we necessarily want to accommodate everything -- I'm just trying to think of cases that I care about presently that you might not be thinking of.)
Sure.
Those all sound reasonable. What about:
More complex:
|
@davidlmobley - good stuff! A few immediate responses:
|
A few more boundary-stretching use cases:
|
Defining the scope is definitely an important step. But I am not sure I understand what scope you have in mind. You say "atomistic 2D and real-space 3D simulation". But then you mention cheminformatics (which isn't specifically about simulation) and QM (which I wouldn't call atomistic but "electron-level"). Moreover, I don't really understand "2D simulation" in this context. My view of this application space consists of three domains, at the highest level:
Covering all of this in one file format is a very ambitious project. That doesn't mean it's not worth attempting, but I'd want to have experience with good file formats for each individual domain before trying to do a synthesis. My intention with Mosaic is to deal with the computational domain exclusively, but provide interface points to other formats dealing with the other two domains. In particular, each data item in a Mosaic record has a unique ID, so if you have a unique ID to the whole record (such as a DOI from Zenodo), you can refer Mosaic data items unambigously from the outside. The distinction between domains 1 and 2 needs some explanation, because it has become blurred in particular in structural biology. Most people think of a PDB file a belonging to the experimental domain, although it contains more computational than experimental results. For me, the experimental domain covers everything that happens outside of computers, and the computational domain everything inside. For protein crystallography, the experimental domain ends with the clichés from a detector. In my long-term vision for chemical information processing, Mosaic will grow to cover the whole computational domain. In parallel, a similarly extensible model should deal with the experimental domain. Cheminformatics data files could refer to data items in these two formats through unique IDs. Experimental provenance would be handled in the same way, with Mosaic records of, say, X-ray clichés containing the unique ID of the data items that describe the experiment. |
It would be good to have some fleshed out use cases to understand what this file format will (and won't!) support. Here's a brainstormed list that probably reflects my biases:
Biomolecular
Late, embarassing edit:
Also, um, an alchemical FEP calculation
Molecular
The text was updated successfully, but these errors were encountered: