Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TorchData #576

Closed
austinmw opened this issue Jun 13, 2022 · 7 comments
Closed

TorchData #576

austinmw opened this issue Jun 13, 2022 · 7 comments
Labels
datasets Geospatial or benchmark datasets

Comments

@austinmw
Copy link

Hi, do you plan to support TorchData iterable-style and map-style datapipes in the future?

I ask since eventually the PyTorch DataLoader V2 will, "only be responsible for multiprocessing, distributed, and similar functionalities, not data processing logic. All data processing features, such as the shuffling and batching, will be moved out of DataLoader to DataPipe."

https://github.com/pytorch/data#frequently-asked-questions-faq

@adamjstewart
Copy link
Collaborator

Yes, we definitely plan to support DataPipes in the future. When I first talked to the torchvision devs, they mentioned the plan to rework their datasets to use DataPipes. At the time, the DataPipe stuff seemed too bleeding edge for us to use directly, but things are definitely more stable now. I need to take another look and see just how different things are.

@austinmw
Copy link
Author

That's awesome, really glad to hear!

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Jun 25, 2022
@adamjstewart
Copy link
Collaborator

Still need to dig deeper into how TorchData works and how torchvision is planning to migrate to TorchData, but I think this will be a good opportunity to refactor.

Right now, we have two class/subclass hierarchies:

  • GeoDataset
    • RasterDataset (uses rasterio)
    • VectorDataset (uses fiona)
    • bunch of custom datasets (uses rasterio + fiona, pandas, etc.)
  • NonGeoDataset
    • bunch of custom datasets (uses pillow, pandas, etc.)

I think it would make more sense to do something like:

  • GeoDataset:
    • RasterioDataset
    • FionaDataset
    • XarrayDataset
    • RioxarrayDataset
    • PandasDataset
  • NonGeoDataset:
    • PillowDataset
    • OpenCVDataset
    • PandasDataset

If I understand correctly, this seems to be the intention of TorchData, to create pluggable pipelines for each file format to improve reuse and avoid code duplication.

@adamjstewart
Copy link
Collaborator

Another area where TorchData may help: we have a lot of datasets that can either be loaded from files on local disk, or streamed from a STAC API like on the Planetary Computer. I believe that was one of the main driving factors behind TorchData, so I'm interested to see if they've found a good way to have a single dataset that optionally loads from different sources like this.

@adamjstewart adamjstewart added this to the 0.4.0 milestone Jul 22, 2022
@adamjstewart
Copy link
Collaborator

Looked through the documentation a bit. From what I can tell, my first comment is definitely supported by TorchData. I opened an issue to see if my second comment is/could be supported as well: pytorch/data#672

@austinmw
Copy link
Author

@adamjstewart One more format that might be good to support down the line is simple tar iterable format (like webdataset, only using torchdata). For your second comment, I wonder if you're looking for something like AIStore with torchdata loaders?

https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLoader.html#torchdata.datapipes.iter.AISFileLoader
https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLister.html#torchdata.datapipes.iter.AISFileLister

@adamjstewart
Copy link
Collaborator

Seems like TorchData is dead: pytorch/data#1196

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

No branches or pull requests

2 participants