Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download Bundle #59

Open
bbengfort opened this issue Apr 12, 2016 · 7 comments
Open

Download Bundle #59

bbengfort opened this issue Apr 12, 2016 · 7 comments

Comments

@bbengfort
Copy link
Member

Add bundle (as in Sckit-Learn bundle) download mechanism to the interface. This mechanism should export:

  • cleaned data files
  • readme.md
  • license.txt
  • citation.bib
@looselycoupled
Copy link
Member

I'm on this one.

@bbengfort bbengfort mentioned this issue Jul 9, 2016
@looselycoupled looselycoupled self-assigned this Jul 22, 2016
@looselycoupled
Copy link
Member

@bbengfort and @rebeccabilbro, please check out my proposal below and let me know if anything is counter to project requirements.

Proposal:

There are a number of long term issues with the proposal below but it gets us closer to what we want. Some would require answering outstanding questions or perhaps unassigned/unidentified issues.

S3 Buckets / Storage
Prepackaged download bundles are stored in S3 bucket that already holds the base files. Bucket is open to world but no browsing allowed. According to current code, each dataset has its own folder specified by datasets/<account>/<dataset> and each datafile is stored here. We will create a bundles directory with sub-directories for each version as in datasets/<account>/<dataset>/bundles/12. The bundle filename will always be <dataset>-bundle-v<version>.zip ala floompa-bundle-v12.zip.

Question: Alternatively we could create bundles/<account>/<dataset>/<version> and keep them somewhat separate. Thoughts? Is that even needed long term for security or other reasons?

Question: Should we keep bundles on a different bucket and use a UUID as folder name to obfuscate so that only those who have the link for a private dataset can download? Should we just use a UUID as the bundle name or is there a requirement that it be friendly filename in some way?

Security
For the moment, if a user has a link then they can download the bundle even if it's a private dataset.

Bundle generation
Whenever an update is needed a new celery task is enqueued to replace (or initially add) a bundle. Presumably one could trigger a bunch of updates relatively quickly. There is a timing problem here that only the latest bundle is ever generated. I'd like to punt this problem until I have a better idea of how we are versioning the individual files (seems easily solvable in the future).

User Interface
Users can use the download link in the project page. If no bundle is yet available then a pop-up message is displayed (I can also color code the download button yellow until ready). Else the download link is direct to s3 http download. I'll likely make a new dataset field to determine if the bundle is ready - either a simple boolean or perhaps something more informative. What might be best is a DatasetVersion model to map DataFiles to Datasets. That would be a natural place for status and give us more flexibility in the future.

@looselycoupled
Copy link
Member

looselycoupled commented Jul 23, 2016

  • Develop new DatasetVersion model
  • Develop migration file for existing data?
  • Modify upload code to increment dataset version
  • Develop celery task to bundle content, update, version record
  • Color code Download link
  • Provide popup with download links for available versions

@bbengfort
Copy link
Member Author

bbengfort commented Jul 24, 2016

Point on security: at the moment (I believe) the bucket requires a token to give up the goods, and that token is generated via boto through the Django Storages app. The token grants the user a download, and the link only lasts for 6 hours or something. Meaning that the link isn't created for a user who doesn't have permission.

If this is not the case; then I must have manually edited the bucket for development reasons, and we should go back to the token method above.

@bbengfort
Copy link
Member Author

Also, I'm happy to store the bundles on S3 if that's what you think we should do. However, I was planning to generate the zip file on demand with the things that are in the database via the zipfile library and StringIO objects, sort of like Use compressed data directly – from ZIP files or gzip http response

Maybe you're thinking this doesn't scale, which is fair; so bunldes/account/dataset-version.zip seems fine to me. All the rest of your proposal looks good to me.

@looselycoupled
Copy link
Member

looselycoupled commented Aug 18, 2016

Current status:
A new bundle is created whenever a file is added and the download link works correctly.

Todo:
Only major item left is to create a new many-to-many so that we can keep track of which files go with which versions. Right now everything maps to the latest version which is the only download provided. Goal is to keep track of the dataset at every version and offer downloads for each.

@bbengfort
Copy link
Member Author

I like the idea of being able to download a dataset at previous versions - that will help with estimator reproducibility and a host of other items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants