Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding S3 support when PyTorch framework is selected. #138

Open
zhenghh04 opened this issue Jan 12, 2024 · 7 comments
Open

Adding S3 support when PyTorch framework is selected. #138

zhenghh04 opened this issue Jan 12, 2024 · 7 comments
Assignees

Comments

@zhenghh04
Copy link
Member

zhenghh04 commented Jan 12, 2024

Check whether we can adopt the PyTorch S3 support:
https://pytorch.org/data/main/generated/torchdata.datapipes.iter.S3FileLoader.html

@zhenghh04
Copy link
Member Author

@hariharan-devarajan could you take a look whether this is good to include?

@hariharan-devarajan
Copy link
Collaborator

I think it is good but the only concern I have that in PyTorch data loaders, Interable input pipelines are less parallelizable than indexed. We can probably convert this into a indexed pipeline

We can use get_object and put_object To build our own pipeline and compare against a iterable version using our native data loader implementations.

@krehm
Copy link
Contributor

krehm commented Jan 23, 2024

FWIW, I have had S3 working for a while now with torch in my test setup, but I used a different method. My task was to get DAOS working with fsspec so that DAOS pathnames can be used with DLIO without requiring the dfuse layer. I modified the readers and generators to open files with fsspec rather than with (kernel-only) pathnames, then everything else after that works the same, but now I can provide paths like s3:://my-bucket/my-file and daos::/my-pool/my-cont/my-file and use them with DLIO. There are lots of other backends available with fsspec besides S3 and DAOS (and posix files) that would automatically work with DLIO.

@zhenghh04
Copy link
Member Author

@krehm would you mind sending a PR?

@krehm
Copy link
Contributor

krehm commented Jan 23, 2024

See the following, it is a bit out of date, but should give you the idea.

main...krehm:dlio_benchmark:feature/fsspec-storage

@hariharan-devarajan
Copy link
Collaborator

@zhenghh04 I have seen this PR, I think, we should create a PR and then I can take a look at it.

From my memory, the main thing is that our storage interface is a little weird right now. Ideally all I/O should happen through storage interface and the storage interface should support fsspec for different options of backends.

But I am on-board with the fsspec approach for sure.

@krehm
Copy link
Contributor

krehm commented Jan 23, 2024

I will work on cleaning up the code and making a PR, seems to me that there were a couple of loose ends when I last tested with it, I need to dust off my notes. Note also that I will be in a car Wednesday through Friday, so I will be unresponsive until early next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants