Download Bundle #59

bbengfort · 2016-04-12T18:40:50Z

Add bundle (as in Sckit-Learn bundle) download mechanism to the interface. This mechanism should export:

cleaned data files
readme.md
license.txt
citation.bib

looselycoupled · 2016-06-03T00:11:32Z

I'm on this one.

looselycoupled · 2016-07-22T06:25:47Z

@bbengfort and @rebeccabilbro, please check out my proposal below and let me know if anything is counter to project requirements.

Proposal:

There are a number of long term issues with the proposal below but it gets us closer to what we want. Some would require answering outstanding questions or perhaps unassigned/unidentified issues.

S3 Buckets / Storage
Prepackaged download bundles are stored in S3 bucket that already holds the base files. Bucket is open to world but no browsing allowed. According to current code, each dataset has its own folder specified by datasets/<account>/<dataset> and each datafile is stored here. We will create a bundles directory with sub-directories for each version as in datasets/<account>/<dataset>/bundles/12. The bundle filename will always be <dataset>-bundle-v<version>.zip ala floompa-bundle-v12.zip.

Question: Alternatively we could create bundles/<account>/<dataset>/<version> and keep them somewhat separate. Thoughts? Is that even needed long term for security or other reasons?

Question: Should we keep bundles on a different bucket and use a UUID as folder name to obfuscate so that only those who have the link for a private dataset can download? Should we just use a UUID as the bundle name or is there a requirement that it be friendly filename in some way?

Security
For the moment, if a user has a link then they can download the bundle even if it's a private dataset.

Bundle generation
Whenever an update is needed a new celery task is enqueued to replace (or initially add) a bundle. Presumably one could trigger a bunch of updates relatively quickly. There is a timing problem here that only the latest bundle is ever generated. I'd like to punt this problem until I have a better idea of how we are versioning the individual files (seems easily solvable in the future).

User Interface
Users can use the download link in the project page. If no bundle is yet available then a pop-up message is displayed (I can also color code the download button yellow until ready). Else the download link is direct to s3 http download. I'll likely make a new dataset field to determine if the bundle is ready - either a simple boolean or perhaps something more informative. What might be best is a DatasetVersion model to map DataFiles to Datasets. That would be a natural place for status and give us more flexibility in the future.

looselycoupled · 2016-07-23T17:15:10Z

Develop new DatasetVersion model
Develop migration file for existing data?
Modify upload code to increment dataset version
Develop celery task to bundle content, update, version record
Color code Download link
Provide popup with download links for available versions

bbengfort · 2016-07-24T17:23:40Z

Point on security: at the moment (I believe) the bucket requires a token to give up the goods, and that token is generated via boto through the Django Storages app. The token grants the user a download, and the link only lasts for 6 hours or something. Meaning that the link isn't created for a user who doesn't have permission.

If this is not the case; then I must have manually edited the bucket for development reasons, and we should go back to the token method above.

bbengfort · 2016-07-24T17:27:37Z

Also, I'm happy to store the bundles on S3 if that's what you think we should do. However, I was planning to generate the zip file on demand with the things that are in the database via the zipfile library and StringIO objects, sort of like Use compressed data directly – from ZIP files or gzip http response

Maybe you're thinking this doesn't scale, which is fair; so bunldes/account/dataset-version.zip seems fine to me. All the rest of your proposal looks good to me.

looselycoupled · 2016-08-18T05:40:28Z

Current status:
A new bundle is created whenever a file is added and the download link works correctly.

Todo:
Only major item left is to create a new many-to-many so that we can keep track of which files go with which versions. Right now everything maps to the latest version which is the only download provided. Goal is to keep track of the dataset at every version and offer downloads for each.

bbengfort · 2016-08-23T15:09:13Z

I like the idea of being able to download a dataset at previous versions - that will help with estimator reproducibility and a host of other items.

…bs#59

bbengfort added this to the Version 0.3 milestone Apr 12, 2016

bbengfort added type: feature priority: medium ready labels Apr 12, 2016

looselycoupled added in progress and removed ready labels Jun 3, 2016

bbengfort mentioned this issue Jul 9, 2016

Citation #57

Open

looselycoupled self-assigned this Jul 22, 2016

rebeccabilbro added the Advanced label Dec 9, 2016

This was referenced Dec 10, 2016

Data file uploading - multiple files #54

Open

Dataset Overwrite/Versioning System #7

Open

looselycoupled added a commit to looselycoupled/cultivar that referenced this issue Dec 13, 2016

adds missing celery dependencies to requirements files DistrictDataLa…

af7a215

…bs#59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download Bundle #59

Download Bundle #59

bbengfort commented Apr 12, 2016

looselycoupled commented Jun 3, 2016

looselycoupled commented Jul 22, 2016

looselycoupled commented Jul 23, 2016 •

edited

Loading

bbengfort commented Jul 24, 2016 •

edited

Loading

bbengfort commented Jul 24, 2016

looselycoupled commented Aug 18, 2016 •

edited

Loading

bbengfort commented Aug 23, 2016

Download Bundle #59

Download Bundle #59

Comments

bbengfort commented Apr 12, 2016

looselycoupled commented Jun 3, 2016

looselycoupled commented Jul 22, 2016

looselycoupled commented Jul 23, 2016 • edited Loading

bbengfort commented Jul 24, 2016 • edited Loading

bbengfort commented Jul 24, 2016

looselycoupled commented Aug 18, 2016 • edited Loading

bbengfort commented Aug 23, 2016

looselycoupled commented Jul 23, 2016 •

edited

Loading

bbengfort commented Jul 24, 2016 •

edited

Loading

looselycoupled commented Aug 18, 2016 •

edited

Loading