Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eraseInputDistinctRootDomains supports general logical-to-allocation transforms #3458

Merged
merged 10 commits into from
Nov 26, 2024
6 changes: 0 additions & 6 deletions csrc/dynamic_transform.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1048,12 +1048,6 @@ void DynamicTransformConcretizer::mutate(TensorView* tv) {
// check the root to logical transforms to be sure we have concretized any
// intermediate IterDomains.

// At this point, there should be no expr beyond rfactor root
NVF_ERROR(
tv->getLoopDomain() == tv->getLogicalDomain(),
"Invalid tensor: ",
tv->toString());

// If it has an root domain, the IterTypes of the logical
// IDs may need to be updated as well. Traverse the rfactor exprs
// and mutate the IterTypes of output IDs if symbolic.
Expand Down
51 changes: 27 additions & 24 deletions csrc/fusion_segmenter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
* SPDX-License-Identifier: BSD-3-Clause
*/
// clang-format on
#include <algorithm>
#include <sstream>

#include <debug.h>
#include <fusion.h>
#include <fusion_segmenter.h>
Expand All @@ -20,9 +23,7 @@
#include <options.h>
#include <scheduler/debug_utils.h>
#include <scheduler/normalization_utils.h>
#include <algorithm>

#include <sstream>
#include <transform_iter.h>

namespace nvfuser {

Expand Down Expand Up @@ -1860,34 +1861,36 @@ void eraseInputDistinctRootDomains(Fusion* fusion) {
}
}

NVF_ERROR(new_logical_domain.size() == tv->domain()->contiguity().size());
TensorDomain* new_td = nullptr;

if (tv->domain()->hasAllocation()) {
// we need to reorder the logical domain into allocation domain
// consistently with the mapping from the old TensorView logical domain to
// its allocation domain
const auto& alloc = tv->getAllocationDomain();
NVF_ERROR(
alloc.size() == logical.size(),
"size between logical and alloc doesn't match");
const auto rank = alloc.size();
std::vector<int64_t> stride_order(rank, -1);
for (auto i : c10::irange(rank)) {
bool found_match = false;
for (auto j : c10::irange(rank)) {
if (alloc[i] == logical[j]) {
stride_order[j] = static_cast<int64_t>(rank - 1 - i);
found_match = true;
break;
}
}
NVF_ERROR(
found_match,
"cannot match IterDomain between allocation domain to logical domain");
std::unordered_map<IterDomain*, IterDomain*> old_to_new;
for (const auto i : c10::irange(logical.size())) {
old_to_new.emplace(logical[i], new_logical_domain[i]);
}

ReplayTransformations replay(tv->getAllocationDomain(), old_to_new);
// Without this,
// https://github.com/NVIDIA/Fuser/blob/e613929a6c21b3095c8817b01b8f177096a26e60/csrc/transform_iter.cpp#L299
// tries to look for root IDs in the map, which shouldn't exist because
// the whole purpose of this function is to remove the root domain.
replay.setErrorOnFailure(false);
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
// Should we replay.setReplayRFactor(true)? I guess the logical domain
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
// shouldn't be rfactor any more because it becomes the root, but maybe
// other IterDomains should inherit rfactor?
std::vector<IterDomain*> new_alloc;
new_alloc.reserve(tv->getAllocationDomain().size());
for (IterDomain* alloc_id : tv->getAllocationDomain()) {
new_alloc.push_back(replay.getReplay().at(alloc_id));
}
new_td = IrBuilder::create<TensorDomain>(
new_logical_domain, stride_order, tv->domain()->contiguity());
/*root_domain=*/std::vector<IterDomain*>(),
new_logical_domain,
new_alloc,
/*loop_domain=*/new_alloc,
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
tv->domain()->contiguity());
} else {
new_td = IrBuilder::create<TensorDomain>(
new_logical_domain, tv->domain()->contiguity());
Expand Down
2 changes: 1 addition & 1 deletion csrc/ir/nodes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3251,7 +3251,7 @@ std::string TensorDomain::toString(const int indent_size, const bool loop_only)
}
ss << "," << std::endl;
indent(ss, indent_size + 1)
<< "rfactor=[ " << toDelimitedString(logical()) << " ]";
<< "logical=[ " << toDelimitedString(logical()) << " ]";
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
if (!allocation_domain_.empty()) {
ss << "," << std::endl;
indent(ss, indent_size + 1)
Expand Down
53 changes: 51 additions & 2 deletions tests/cpp/test_allocation_domain.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1384,17 +1384,20 @@ TEST_F(AllocationDomainTest, ReductionVectorization) {
}

TEST_F(AllocationDomainTest, ClearReductionIterDomainsPatch) {
auto fusion = std::make_unique<Fusion>();
FusionGuard fg(fusion.get());
Fusion fusion;
FusionGuard fg(&fusion);

auto tv0 = TensorViewBuilder()
.ndims(3)
.shape({-1, 1, -1})
.contiguity({true, std::nullopt, true})
.build();
auto tv1 = sum(tv0, {2});

tv1->setAllocationDomain(
{tv1->axis(1), tv1->axis(2), tv1->axis(0)},
{std::nullopt, std::nullopt, true});

// copy entries from old domain for validation later
std::vector<IterDomain*> logical_copy = tv1->getLogicalDomain();
std::vector<IterDomain*> alloc_copy = tv1->getAllocationDomain();
Expand All @@ -1414,4 +1417,50 @@ TEST_F(AllocationDomainTest, ClearReductionIterDomainsPatch) {
tv1->getContiguity(), ElementsAre(contig_copy[0], contig_copy[2]));
}

TEST_F(AllocationDomainTest, InputAllocationIsSplit_Concrete) {
auto fusion = std::make_unique<Fusion>();
FusionGuard fg(fusion.get());

TensorView* in = makeContigConcreteTensor({6});
TensorView* out = set(in);
in->split(0, 2);
in->setAllocationDomain(in->getLoopDomain(), true);
wujingyue marked this conversation as resolved.
Show resolved Hide resolved

fusion->addInput(in);
fusion->addOutput(out);

FusionExecutorCache executor_cache(std::move(fusion));
auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA);
at::Tensor in_tensor = at::randn({6}, options);
auto out_tensors = executor_cache.runFusionWithInputs({in_tensor});

testValidate(
executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);
}

// The test fails as is. The symbolic IterDomains in loop/allocation are not
// concretized. I tried to change DynamicTransformConcretizer::mutate to grab
// all expressions between root and allocation but still couldn't get it to
// work.
Comment on lines +1441 to +1444
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this could cause that problem. It may be just because of the concern I mentioned below.

Copy link
Collaborator Author

@wujingyue wujingyue Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue persisted after I moved the split after addInput/addOutput. FYI, below is my failed attempt to fix dynamic transform for this test.

diff --git a/csrc/dynamic_transform.cpp b/csrc/dynamic_transform.cpp
index 24404db8..d287149d 100644
--- a/csrc/dynamic_transform.cpp
+++ b/csrc/dynamic_transform.cpp
@@ -1056,7 +1056,7 @@ void DynamicTransformConcretizer::mutate(TensorView* tv) {
     // beyond the logical domain as asserted above
     auto all_id_exprs = StmtSort::getExprsBetween(
         {tv->getRootDomain().begin(), tv->getRootDomain().end()},
-        {tv->getLogicalDomain().begin(), tv->getLogicalDomain().end()});
+        {tv->getMaybeAllocationDomain().begin(), tv->getMaybeAllocationDomain().end()});
     for (auto expr : all_id_exprs) {
       // Assume outputs of IterDomain exprs are always IterDomains. If
       // the assumption is invalidated, the logic here would need to

Copy link
Collaborator

@jacobhinkle jacobhinkle Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we had assumed there were no loop or allocation transforms so that each of those would just be permutations of logical. I think what we should do is actually use IRBFS here to propagate from root to all of the other domains (logical, loop, and allocation). IRBFS is not needed actually since we can safely assume at concretization that we don't have an uncommon situation like a loop domain that's a producer for transforms that lead to the root, instead of the other direction. If we assume that root is a producer for all of the other domains we can just use StmtSort::getExprsBetween like above, but we need to add not just the logical or allocation, but all three other domains as the "to" argument. i.e.

std::unordered_set<Val*> to{tv->getLogicalDomain().begin(), tv->getLogicalDomain().end()};
to.insert(tv->getMaybeAllocationDomain().begin(), tv->getMaybeAllocationDomain().end());
to.insert(tv->getLoopDomain().begin(), tv->getLoopDomain().end());
auto all_id_exprs =
    StmtSort::getExprsBetween({tv->getRootDomain().begin(), tv->getRootDomain().end()}, to);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try it? While I agree with what you said, I doubt that helps this particular test case where root->maybeallocation includes all expressions. Btw, I believe

std::vector<Expr*> TensorDomain::allExprs() const {
can be used to capture all Exprs in a TensorView.

Copy link
Collaborator

@jacobhinkle jacobhinkle Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the issue here is that allocation and logical are disconnected because there's been a replacement performed on logical, but not on allocation. I think the issue is that StmtSort::getStmts is only giving us the logical domains of the input TVs whereas we should expect it to provide all IDs for processing before the input TensorDomain and the TV itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, looks like it's actually just loop domains.

https://github.com/NVIDIA/Fuser/blob/main/csrc/iter_visitor.cpp#L73

We should probably change this to visit all of root, logical and loop domains.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case though, loop is equal to allocation. I originally thought this was the issue here. I agree that that line is an issue if we have unconventional tensor domains like loops that are producers of logical, and we should ensure all the domains are available there, but I think in this case the bad behavior is specific to input TVs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wujingyue Is this a blocker?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR. I can stick with concrete sizes for the tests for now

TEST_F(AllocationDomainTest, DISABLED_InputAllocationIsSplit_Symbolic) {
auto fusion = std::make_unique<Fusion>();
FusionGuard fg(fusion.get());

TensorView* in = makeContigTensor(1);
TensorView* out = set(in);
in->split(0, 2);
in->setAllocationDomain(in->getLoopDomain(), true);
wujingyue marked this conversation as resolved.
Show resolved Hide resolved

fusion->addInput(in);
fusion->addOutput(out);

FusionExecutorCache executor_cache(std::move(fusion));
auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA);
at::Tensor in_tensor = at::randn({6}, options);
auto out_tensors = executor_cache.runFusionWithInputs({in_tensor});

testValidate(
executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);
}

} // namespace nvfuser
Loading