Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eraseInputDistinctRootDomains supports general logical-to-allocation transforms #3458

Merged
merged 10 commits into from
Nov 26, 2024

Conversation

wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Nov 21, 2024

This is a spin-off from #3444.

The current code assumes that logical-to-allocation has to be a permutation. This assumption won't hold any more with #2563. So this PR tries to extend eraseInputDistinctRootDomains to support more general transforms.

This can happen to single-GPU, although not as common. The tests added in this PR are for single-GPU because #3444 hasn't landed. #3444 will add some multi-GPU tests.

@wujingyue
Copy link
Collaborator Author

!test

csrc/fusion_segmenter.cpp Show resolved Hide resolved
Comment on lines +1441 to +1444
// The test fails as is. The symbolic IterDomains in loop/allocation are not
// concretized. I tried to change DynamicTransformConcretizer::mutate to grab
// all expressions between root and allocation but still couldn't get it to
// work.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this could cause that problem. It may be just because of the concern I mentioned below.

Copy link
Collaborator Author

@wujingyue wujingyue Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue persisted after I moved the split after addInput/addOutput. FYI, below is my failed attempt to fix dynamic transform for this test.

diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
index 24404db8..d287149d 100644
--- a/csrc/dynamic_transform.cpp
+++ b/csrc/dynamic_transform.cpp
@@ -1056,7 +1056,7 @@ void DynamicTransformConcretizer::mutate(TensorView* tv) {
     // beyond the logical domain as asserted above
     auto all_id_exprs = StmtSort::getExprsBetween(
         {tv->getRootDomain().begin(), tv->getRootDomain().end()},
-        {tv->getLogicalDomain().begin(), tv->getLogicalDomain().end()});
+        {tv->getMaybeAllocationDomain().begin(), tv->getMaybeAllocationDomain().end()});
     for (auto expr : all_id_exprs) {
       // Assume outputs of IterDomain exprs are always IterDomains. If
       // the assumption is invalidated, the logic here would need to

Copy link
Collaborator

@jacobhinkle jacobhinkle Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we had assumed there were no loop or allocation transforms so that each of those would just be permutations of logical. I think what we should do is actually use IRBFS here to propagate from root to all of the other domains (logical, loop, and allocation). IRBFS is not needed actually since we can safely assume at concretization that we don't have an uncommon situation like a loop domain that's a producer for transforms that lead to the root, instead of the other direction. If we assume that root is a producer for all of the other domains we can just use StmtSort::getExprsBetween like above, but we need to add not just the logical or allocation, but all three other domains as the "to" argument. i.e.

std::unordered_set<Val*> to{tv->getLogicalDomain().begin(), tv->getLogicalDomain().end()};
to.insert(tv->getMaybeAllocationDomain().begin(), tv->getMaybeAllocationDomain().end());
to.insert(tv->getLoopDomain().begin(), tv->getLoopDomain().end());
auto all_id_exprs =
    StmtSort::getExprsBetween({tv->getRootDomain().begin(), tv->getRootDomain().end()}, to);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try it? While I agree with what you said, I doubt that helps this particular test case where root->maybeallocation includes all expressions. Btw, I believe

std::vector<Expr*> TensorDomain::allExprs() const {
can be used to capture all Exprs in a TensorView.

Copy link
Collaborator

@jacobhinkle jacobhinkle Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue here is that allocation and logical are disconnected because there's been a replacement performed on logical, but not on allocation. I think the issue is that StmtSort::getStmts is only giving us the logical domains of the input TVs whereas we should expect it to provide all IDs for processing before the input TensorDomain and the TV itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, looks like it's actually just loop domains.

https://github.com/NVIDIA/Fuser/blob/main/csrc/iter_visitor.cpp#L73

We should probably change this to visit all of root, logical and loop domains.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case though, loop is equal to allocation. I originally thought this was the issue here. I agree that that line is an issue if we have unconventional tensor domains like loops that are producers of logical, and we should ensure all the domains are available there, but I think in this case the bad behavior is specific to input TVs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wujingyue Is this a blocker?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR. I can stick with concrete sizes for the tests for now

@wujingyue
Copy link
Collaborator Author

!test

@wujingyue
Copy link
Collaborator Author

!test

csrc/fusion_segmenter.cpp Outdated Show resolved Hide resolved
csrc/fusion_segmenter.cpp Outdated Show resolved Hide resolved
csrc/fusion_segmenter.cpp Outdated Show resolved Hide resolved
@wujingyue wujingyue requested a review from naoyam November 22, 2024 07:12
csrc/ir/nodes.cpp Show resolved Hide resolved
tests/cpp/test_allocation_domain.cpp Show resolved Hide resolved
@wujingyue
Copy link
Collaborator Author

!test

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wujingyue
Copy link
Collaborator Author

!test

@wujingyue wujingyue merged commit 58e1514 into main Nov 26, 2024
36 of 37 checks passed
@wujingyue wujingyue deleted the wjy/root branch November 26, 2024 03:49
@wujingyue wujingyue added the enhancement New feature or request label Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants