-
Notifications
You must be signed in to change notification settings - Fork 6
add new data set
MDAnalysisData does not store files and trajectories. Instead, it provides accessor code to seamlessly download (and cache) files from archives.
When you contribute data then you have to do two things
-
deposit data in an archive under an Open Data compatible license (CC0 or CC-BY preferred)
We currently have code to work with figshare but it should be straightforward to add code to work with other archive-grade repositories such as zenodo or DataDryad.
-
write accessor code in MDAnalysisData
The accessor code needs the stable archive URL(s) for your files and SHA256 checksums to check the integrity for any downloaded files. You will also add a description of your data set.
Adding a new data set requires:
- Put the data on figshare (or another archive-grade repository such as zenodo or DataDryad; some university also provide digital repositories that are suitable). The site must provide stable download links and may not change the content during download because we store a SHA256 checksum. Make sure to choose an Open Data compatible license. (CC0 or CC-BY preferred)
- Add a Python module such as MDAnalysisData/adk_equilibrium.py; in many cases you can copy the module and adapt
- text
-
NAME
: name of the data set; will be used as a file name so do not use spaces etc -
DESCRIPTION
: filename of the description file (restructured text format, so has suffix.rst
) -
ARCHIVE
: dictionary containingRemoteFileMetadata
instances. Keys should describe the file type. Typically- topology: topology file (PSF, TPR, ...)
- trajectory: trajectory coordinate file (DCD, XTC, ...)
- structure (optional): system with single frame of coordinates (typically PDB, GRO, CRD, ...)
- name of the
fetch_NAME
function - docs of the
fetch_NAME
function
- Add a description file such as MDAnalysisData/descr/adk_equilibrium.rst; copy this file and adapt. Make sure to add license information.
- Import your
fetch_NAME
function in MDAnalysisData/datasets.py. - Add docs in restructured text format under
docs/
(take existing files as examples).
If your data set does not follow the same pattern as the example above (where each file is downloaded separately) then you have to write your own fetch_NAME()
function. E.g., you might download a tar file and then unpack the file yourself. Use scikit-learn's sklearn/datasets as examples, make sure that your function sets appropriate attributes in the returned Bunch of records, and fully document what is returned.
The RemoteFileMetadata
is used by base._fetch_remote()
. Typically you will have a local copy of the files during testing. You can compute the SHA256 with the following code:
import MDAnalysisData.base
MDAnalysisData.base._sha256(FILENAME)
or from the commandline
python -c 'import MDAnalysisData; print(MDAnalysisData.base._sha256("FILENAME"))'