Demo dataset infrastructure #246

GenevieveBuckley · 2021-11-04T07:17:26Z

I think demo dataset infrastructure would be useful.

I made a PR proposal for napari here: napari/napari#3580 (it's based on scikit-image: they use pooch and like it)

We could have a combination of:

Experimental datasets, and
Synthetic datasets (might be quicker to generate very large images than it is to download them - they just need to have interesting structures)

GenevieveBuckley · 2021-11-04T07:20:52Z

There are a bunch of other issues discussing ideas for specific example data, I'm linking to them here:

jakirkham · 2021-11-11T19:27:20Z

Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?

GenevieveBuckley · 2021-11-12T04:28:17Z

Get the sense that pooch is mainly used to download data. Is that correct? Or can it also be made to query portions of data directly from the cloud?

Pooch is only for downloading & extracting data. You give it a filename/url, and pooch fetches it for you.

If you want to query potions of a dataset, you'd need that dataset to be stored in some kind of chunked format to begin with, and some idea about how you want to do that querying. So it could be possible with a remote HDF5 (or zarr?) array.

One thing to consider would be download speed. I haven't done a bunch of testing, but it seems pretty common sense that zipped/tarred datasets will probably be transferred over the network quicker. So even with the extra time it takes to extract the data once it arrives, it might be quicker overall. That doesn't mean you have to do it that way, just one more thing to consider.

jakirkham · 2021-11-12T04:56:36Z

Yeah Zarr supports Zstandard, which is pretty efficiently compressed. There are some filesystems that use Zstandard. It's also something being explored with Conda packages as well for the same reason (faster downloads, smaller packages, etc.).

We can also query datasets directly from the cloud with Zarr. Here's an example dataset on S3 ( zarr-developers/zarr-python#385 (comment) ).

We can also cache downloaded chunks locally to ensure we only pull from a cloud store once.

I think this really comes down to what size datasets would be used here. If they are small, maybe pooch is fine. If they are large, maybe Zarr would be better.

GenevieveBuckley · 2021-11-12T06:25:24Z

+1 for zarr wherever applicable

GenevieveBuckley · 2021-11-12T10:00:55Z

A discussion about synthetic data generation is here: napari/napari#3608

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo dataset infrastructure #246

Demo dataset infrastructure #246

GenevieveBuckley commented Nov 4, 2021 •

edited

Loading

GenevieveBuckley commented Nov 4, 2021

jakirkham commented Nov 11, 2021

GenevieveBuckley commented Nov 12, 2021

jakirkham commented Nov 12, 2021

GenevieveBuckley commented Nov 12, 2021

GenevieveBuckley commented Nov 12, 2021

Demo dataset infrastructure #246

Demo dataset infrastructure #246

Comments

GenevieveBuckley commented Nov 4, 2021 • edited Loading

GenevieveBuckley commented Nov 4, 2021

jakirkham commented Nov 11, 2021

GenevieveBuckley commented Nov 12, 2021

jakirkham commented Nov 12, 2021

GenevieveBuckley commented Nov 12, 2021

GenevieveBuckley commented Nov 12, 2021

GenevieveBuckley commented Nov 4, 2021 •

edited

Loading