Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce DALI proxy #5726

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

jantonguirao
Copy link
Contributor

@jantonguirao jantonguirao commented Nov 27, 2024

Category:

New feature

Description:

  • DALI proxy is a new way to integrate DALI pipelines with existing torch data loading pipelines
  • The idea is that torch data processes send data to the main process, where it is processed by DALI before handling it over to the training loop.
  • The solution allows for mixing data loading from Pytorch with partial processing on DALI

Co-author: @mdabek-nvidia

Additional information:

Affected modules and functionalities:

  • Torch plugin

Key points relevant for the review:

dali/python/nvidia/dali/plugin/pytorch/init.py

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A
    Added options to run with DALI proxy to RN50 and EfficientNet examples

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • [?] RST
    • [?] Jupyter
    • Other
  • N/A
    TODO

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [20865442]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [20865442]: BUILD PASSED

@jantonguirao jantonguirao changed the title Dali proxy2 Introduce DALI proxy Nov 28, 2024
@jantonguirao jantonguirao marked this pull request as ready for review November 28, 2024 09:21
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
@szkarpinski
Copy link
Collaborator

I see there are no tests except for the resnet50 example. I believe we should have normal TL0 tests as well.

@szkarpinski
Copy link
Collaborator

Did you test error propagation between the processes? Multiprocessing doesn't automagically propagate exceptions afaik. Maybe we should have tests to check how are the errors in particular processes reported?

@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 2 times, most recently from caf2d0b to 0472490 Compare November 29, 2024 10:28
Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting few comments as the code started moving.
I also feel like we should limit the scope of the API and hide most of the implementation.

dali/python/setup.py.in Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
dali/python/nvidia/dali/plugin/pytorch/__init__.py Outdated Show resolved Hide resolved
@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 7 times, most recently from c4a7f74 to 1a50983 Compare December 3, 2024 17:52
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21050826]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21050826]: BUILD FAILED

@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 2 times, most recently from 3bdfd37 to 014032d Compare December 3, 2024 18:32
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21052236]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21052236]: BUILD FAILED

Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
@jantonguirao jantonguirao force-pushed the dali_proxy2 branch 3 times, most recently from 75f5921 to 32983fe Compare December 24, 2024 12:16
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21819002]: BUILD STARTED

Signed-off-by: Joaquin Anton Guirao <[email protected]>
Signed-off-by: Joaquin Anton Guirao <[email protected]>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21823100]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [21823100]: BUILD PASSED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants