Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DID-parallelize a loop split in Python. #3503

Merged
merged 4 commits into from
Dec 10, 2024
Merged

DID-parallelize a loop split in Python. #3503

merged 4 commits into from
Dec 10, 2024

Conversation

wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Dec 1, 2024

For #2563

List of major changes:

  1. Add FusionDefinition.sched.set_allocation_as_loop to set the allocation domain of a TensorView to be the same as loop.
  2. Add a convenience helper shard_tensor to be used in unit tests. This leads to some refactoring on the C++ side.
  3. Remove _create_device_mesh with a DeviceMesh constructor. This way, mesh construction doesn't require self.sched and is made more flexible.
  4. Add setup/finalizeMultideviceSchedule to set the active fusion for multidevice_schedule.
  5. Add a Python test to exercise DID parallelization of loop domains.

@wujingyue wujingyue marked this pull request as draft December 1, 2024 23:22
@wujingyue wujingyue changed the title Attempt to DID-parallelize a loop split in Python. DID-parallelize a loop split in Python. Dec 1, 2024
@wujingyue wujingyue force-pushed the wjy/split branch 2 times, most recently from 50ccaac to 860791e Compare December 2, 2024 04:12
@wujingyue wujingyue marked this pull request as ready for review December 2, 2024 04:12
@wujingyue
Copy link
Collaborator Author

!test

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Failing tests looks like something with CI machines? I restarted the tests 🤞

@wujingyue
Copy link
Collaborator Author

lgtm. Failing tests looks like something with CI machines? I restarted the tests 🤞

H100 tests have been failing in CI for quite a while. I already tagged @xwang233 in #3284 (comment).

Base automatically changed from wjy/comm to main December 9, 2024 23:30
@wujingyue
Copy link
Collaborator Author

!test

@wujingyue
Copy link
Collaborator Author

!test

@wujingyue wujingyue merged commit 98352c4 into main Dec 10, 2024
48 checks passed
@wujingyue wujingyue deleted the wjy/split branch December 10, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants