[DataPipe] pipe opener #403

tmbdev · 2022-05-13T19:56:59Z

This PR adds a filter that works similarly to FileOpener but uses subprocesses for opening files. The filter also opens local files.

The use of subprocesses allows for easy asynchronous I/O and avoids having to use third party libraries to access object stores and cloud storage.

Files are specified as URLs and the subprocess command line is constructed based on the URL schema.

Out of the box, PipeOpener permits access to local files, S3, GCS, HTTP, HTTPS, and AIStore. Other URL schemas can be supported simply by adding "schema=commandline" arguments to the pipeline. The generic "pipe:" schema allows arbitrary commands to be used as data sources.

PipeOpener is particularly useful with WebDataset, since WebDataset benefits greatly from asynchronous I/O and location independent datasets. But PipeOpener can be used with any remote files and provides a simple, portable way of accessing S3 and other such servers.

FileCache is a useful complement for PipeOpener, allowing the construction of pipelines that incrementally download datasets and still provide local file-based access to the dataset.

VitalyFedyunin

Putting placeholder review to return back later. Several concerns:

We are not controlling pool size, meaning any buffer type DataPipe (ex shuffle) will spawn crazy amount of subprocesses.
Popen is error prone in terms of security (ex: pipe: rm -rf /)
It will be hard to implement snapshotting as we would need to create synchronization points somehow.

tmbdev · 2022-05-20T01:41:40Z

(1) I'm not sure why you think that. You can watch the subprocesses being spawned one at a time by using the verbose flag.

(2) Other parts of torch aren't safe against malicious arguments. Furthermore, use of the class is under developer control.

In any case, how about I add an allow_pipe flag to make this feature more explicit to users? The pipe: feature really is very useful in practice, allowing data access with scripts like "pipe:ssh host cat /..." etc.

(3) If you want to support snapshotting of either compressed or streaming files, snapshots always need to restart opening and reading the current file at the start in the opener class. It's the tar archive reader that needs to deal with restarting in the middle. It can do that by keeping a set of files in the current stream it has already returned and skipping them.

Tom added 2 commits May 13, 2022 12:32

merged

f03ad15

added pipeopener

3e68d6c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2022

VitalyFedyunin changed the title ~~pipe opener~~ [DataPipe] pipe opener May 19, 2022

VitalyFedyunin reviewed May 19, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataPipe] pipe opener #403

[DataPipe] pipe opener #403

tmbdev commented May 13, 2022 •

edited

Loading

VitalyFedyunin left a comment

tmbdev commented May 20, 2022 •

edited

Loading

[DataPipe] pipe opener #403

Are you sure you want to change the base?

[DataPipe] pipe opener #403

Conversation

tmbdev commented May 13, 2022 • edited Loading

VitalyFedyunin left a comment

Choose a reason for hiding this comment

tmbdev commented May 20, 2022 • edited Loading

tmbdev commented May 13, 2022 •

edited

Loading

tmbdev commented May 20, 2022 •

edited

Loading