-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with Hugging Face Datasets #60
Comments
Not to hijack this thread, but just found out about Will be happy to hear any thoughts on this, I might pop in for the Pangeo ML Working Group meeting to discuss this. |
Oh hi Meghan! It always surprises me how small the open source world is 😆 Will definitely see what others are up to next Monday. My initial impression was to think of it in terms of a Pytorch/Tensorflow split, or to have the two libraries specialize in terms of loading from a GeoTIFF/Zarr file vs in-memory xarray objects. But the lines aren't quite as clear cut, and given that Pytorch 1.11 recently introduced TorchData/DataPipes, it'll be good to put some smart people together and think about what's the best way forward. |
There might not be such a different between these two approaches, if you remove the "in-memory" part. When you open data with Xarray it is automatically "lazy" about loading it into memory. It just puts a light "lazy indexing" wrapper around the underlying array in a GeoTiff / Zarr / NetCDF / Grib file. A downstream library (xbatcher, pytorch, etc.) can use these arrays in a streaming fashion. The advantage of using Xarray as a loader is that it already speaks all the weird file formats. The disadvantage is that there is some overhead creating Dataset, particularly around eager loading of coordinates. There may be workarounds for those, particularly post-Xarray-indexes-refactor. |
Twitter thread related to huggingface and Zarr: https://twitter.com/rabernat/status/1517182069943713792 |
I've recently been learning about Hugging Face Datasets. It's a great data sharing platform for ML. The
datasets
package is based on tensorflow datasets.It would be great to think about how to best integrate Xarray and Xbatcher with huggingface datasets. Opening this issue just as a placeholder. Will update with more detail as I explore.
The text was updated successfully, but these errors were encountered: