Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo dataset infrastructure #246

Open
GenevieveBuckley opened this issue Nov 4, 2021 · 6 comments
Open

Demo dataset infrastructure #246

GenevieveBuckley opened this issue Nov 4, 2021 · 6 comments

Comments

@GenevieveBuckley
Copy link
Collaborator

GenevieveBuckley commented Nov 4, 2021

I think demo dataset infrastructure would be useful.

I made a PR proposal for napari here: napari/napari#3580 (it's based on scikit-image: they use pooch and like it)

We could have a combination of:

  1. Experimental datasets, and
  2. Synthetic datasets (might be quicker to generate very large images than it is to download them - they just need to have interesting structures)
@GenevieveBuckley
Copy link
Collaborator Author

@jakirkham
Copy link
Member

Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?

@GenevieveBuckley
Copy link
Collaborator Author

Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?

Pooch is only for downloading & extracting data. You give it a filename/url, and pooch fetches it for you.

If you want to query potions of a dataset, you'd need that dataset to be stored in some kind of chunked format to begin with, and some idea about how you want to do that querying. So it could be possible with a remote HDF5 (or zarr?) array.

One thing to consider would be download speed. I haven't done a bunch of testing, but it seems pretty common sense that zipped/tarred datasets will probably be transferred over the network quicker. So even with the extra time it takes to extract the data once it arrives, it might be quicker overall. That doesn't mean you have to do it that way, just one more thing to consider.

@jakirkham
Copy link
Member

Yeah Zarr supports Zstandard, which is pretty efficiently compressed. There are some filesystems that use Zstandard. It's also something being explored with Conda packages as well for the same reason (faster downloads, smaller packages, etc.).

We can also query datasets directly from the cloud with Zarr. Here's an example dataset on S3 ( zarr-developers/zarr-python#385 (comment) ).

We can also cache downloaded chunks locally to ensure we only pull from a cloud store once.

I think this really comes down to what size datasets would be used here. If they are small, maybe pooch is fine. If they are large, maybe Zarr would be better.

@GenevieveBuckley
Copy link
Collaborator Author

+1 for zarr wherever applicable

@GenevieveBuckley
Copy link
Collaborator Author

A discussion about synthetic data generation is here: napari/napari#3608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants