-
Notifications
You must be signed in to change notification settings - Fork 6
add new data set
Oliver Beckstein edited this page Oct 4, 2018
·
11 revisions
Adding a new data set requires:
- Put the data on figshare (or another archive-grade repository such as zenodo or DataDryad; some university also provide digital repositories that are suitable). The site must provide stable download links and may not change the content during download because we store a SHA256 checksum. Make sure to choose an Open Data compatible license. (CC0 or CC-BY preferred)
- Add a Python module such as MDAnalysisData/adk_equilibrium.py; in many cases you can copy the module and adapt
- text
-
NAME
: name of the data set; will be used as a file name so do not use spaces etc -
DESCRIPTION
: filename of the description file (restructured text format, so has suffix.rst
) -
ARCHIVE
: dictionary containingRemoteFileMetadata
instances. Keys should describe the file type. Typically- topology: topology file (PSF, TPR, ...)
- trajectory: trajectory coordinate file (DCD, XTC, ...)
- structure (optional): system with single frame of coordinates (typically PDB, GRO, CRD, ...)
- name of the
fetch_NAME
function - docs of the
fetch_NAME
function
- Add a description file such as MDAnalysisData/descr/adk_equilibrium.rst; copy this file and adapt. Make sure to add license information.
- Import your
fetch_NAME
function in MDAnalysisData/datasets.py.
If your data set does not follow the same pattern as the example above (where each file is downloaded separately) then you have to write your own fetch_NAME()
function. E.g., you might download a tar file and then unpack the file yourself. Use scikit-learn's sklearn/datasets as examples, make sure that your function sets appropriate attributes in the returned Bunch of records, and fully document what is returned.
The RemoteFileMetadata
is used by base._fetch_remote()
. Typically you will have a local copy of the files during testing. You can compute the SHA256 with the following code:
import MDAnalysisData.base
MDAnalysisData.base._sha256(FILENAME)
or from the commandline
python -c 'import MDAnalysisData; print(MDAnalysisData.base._sha256("FILENAME"))'