Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allgather with DID loop split #3284

Merged
merged 23 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions csrc/multidevice/communication.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -325,8 +325,8 @@ c10::intrusive_ptr<c10d::Work> postAllgather(
c10d::Backend* backend,
at::Tensor input_tensor,
at::Tensor output_tensor) {
auto splits = at::split(output_tensor, /*split_size=*/1, /*dim=*/0);
assertBufferCount(splits, communication->team().size());
auto splits =
samnordmann marked this conversation as resolved.
Show resolved Hide resolved
at::tensor_split(output_tensor, communication->team_size(), /*dim=*/0);
assertBuffersHaveSameSize({input_tensor}, splits);

// allgather primitive in c10d induces extra buffering time to copy out the
Expand Down
5 changes: 5 additions & 0 deletions csrc/multidevice/communication.h
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,11 @@ class Communication : public Expr {
return attribute<Team>(1);
}

// A convenience helper so the user doesn't need to convert size_t to int64_t.
int64_t team_size() const {
return static_cast<int64_t>(team().size());
}

DeviceIdxType root() const {
return attribute<DeviceIdxType>(2);
}
Expand Down
2 changes: 1 addition & 1 deletion csrc/multidevice/lower_communication.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ void lowerToReduceScatter(
std::vector<Communication*>& comms) {
const DeviceMesh& mesh = input_tv->getDeviceMesh();
auto reduction_axis = output_tv->getReductionAxis().value();
auto scattered_axis = getShardedAxis(output_tv);
auto scattered_axis = getShardedAxis(output_tv, ParallelType::DIDx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, however if the sharded dimension is split, then scatted_axis is not valid here, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of an immediate problem and #3504 apparently works fine. Could be incidental and I'm happy to hear what you think is problematic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but I think this is an example where we see the problem:

d=num_devices;

tv0 [d, i1];
tv1 = sum(tv0, axis=0); // tv1 [r{i0}, i1]

tv0->axis(0)->parallelize(DIDx);
tv1->axis(1)->split(d); // [r{i0}, i1/4, d]
tv1->axis(2)->parallelize(DIDx) 

In this case, the scattered axis is 2 but getShardedAxis returns 1.

Copy link
Collaborator Author

@wujingyue wujingyue Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your case,

tv0:
  logical: [iDID{i0}, i{i1}]
tv1:
  logical: [r{i0}, i{i1}]
  allocation: [r{i0}, i{i1/d}, iDID{d}]

getShardedLogicalAxis will return 0, the tensor axis being sharded. This is correct because the output at::Tensor for tv1 will be of shape [i1/d] and indeed axis 0 is the sharded dimension. Then, scattered_axis=0 will be used to compute which input tensor axis will be sharded (which will be 1). Finally, that input scattered axis (1) will be used to split the input tensor of shape [1, i1].

Caveat: With 7cf2384, DID'ing an inner split is disallowed by code. So the above case will actually throw an exception. But what I said should be correct after we lift that limitation.

// The output tensor is sharded on scattered_axis and needs to be mapped
// back onto the input. The input has an reduced axis, so the scattered axis
// is adjusted to account for this. Ex: [DIDx(i0), i1] -> [r0, DIDx(i1)] The
Expand Down
134 changes: 66 additions & 68 deletions csrc/multidevice/utils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -121,48 +121,77 @@ bool isSharded(const TensorView* tv) {
return is_sharded;
}

std::vector<int64_t> unshardedSizes(
const TensorView* tv,
c10::IntArrayRef sizes) {
std::vector<int64_t> unsharded_sizes = sizes.vec();

for (IterDomain* alloc_id : tv->getMaybeAllocationDomain()) {
const ParallelType parallel_type = alloc_id->getParallelType();
namespace {
// Collect device-parallel IterDomains in `domain` and return them as a
// ParallelType-to-IterDomain map.
std::unordered_map<ParallelType, IterDomain*> mapDeviceParallelTypeToId(
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
const std::vector<IterDomain*>& domain) {
std::unordered_map<ParallelType, IterDomain*> parallel_type_to_id;
parallel_type_to_id.reserve(kParallelTypeDIDs.size());
for (IterDomain* id : domain) {
const ParallelType parallel_type = id->getParallelType();
if (!isParallelTypeDeviceDim(parallel_type)) {
continue;
}

const auto inputs = IterVisitor::getInputsTo(
{alloc_id},
{tv->getLogicalDomain().begin(), tv->getLogicalDomain().end()});
NVF_ERROR(
!inputs.empty(),
"IterVisitor::getInputsTo shouldn't return empty unless `of` is empty.");
NVF_ERROR(
inputs.size() == 1,
"Failed to find the single logical input to ",
alloc_id,
". This is likely because there's a Merge expression from logical to allocation, which isn't supported. Inputs are: ",
toDelimitedString(inputs));

const auto iter = std::find(
tv->getLogicalDomain().begin(),
tv->getLogicalDomain().end(),
inputs[0]);
NVF_ERROR(
iter != tv->getLogicalDomain().end(),
"The found input IterDomain isn't logical. This is likely because logical doesn't dominate allocation: ",
inputs[0]);

// Count the number of non-reduction IterDomains before `iter`. Reduction
// IterDomains are not materialized in the at::Tensor's shape.
const auto index = std::count_if(
tv->getLogicalDomain().begin(), iter, [](IterDomain* id) -> bool {
return !id->isReduction();
});
unsharded_sizes.at(index) *= tv->getDeviceMesh().size(parallel_type);
parallel_type_to_id.try_emplace(parallel_type, id).second,
"Found multiple loop IterDomains with the same parallel type (",
parallel_type,
"): ",
toDelimitedString(domain));
}
return parallel_type_to_id;
}
} // namespace

int64_t getShardedAxis(const TensorView* tv, const ParallelType parallel_type) {
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
std::unordered_map<ParallelType, IterDomain*> parallel_type_to_id =
mapDeviceParallelTypeToId(tv->getMaybeAllocationDomain());
IterDomain* alloc_id = getOrDefault(parallel_type_to_id, parallel_type);
if (alloc_id == nullptr) {
return -1;
}

const auto inputs = IterVisitor::getInputsTo(
{alloc_id},
{tv->getLogicalDomain().begin(), tv->getLogicalDomain().end()});
NVF_ERROR(
!inputs.empty(),
"IterVisitor::getInputsTo shouldn't return empty unless `of` is empty.");
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
NVF_ERROR(
inputs.size() == 1,
"Failed to find the single logical input to ",
alloc_id,
". This is likely because there's a Merge expression from logical to allocation, which isn't supported. Inputs are: ",
toDelimitedString(inputs));

const auto iter = std::find(
tv->getLogicalDomain().begin(), tv->getLogicalDomain().end(), inputs[0]);
NVF_ERROR(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand why this check is needed. Isn't it true that by assumption what is returned by getInputsTo is an element of tv->getLogicalDomain()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am not sure what is meant by "dominate" in the error message

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re dominate: https://en.wikipedia.org/wiki/Dominator_(graph_theory) and I extended the concept to a set of nodes dominating another set.

Re the check: I heard from @naoyam that logical won't always dominate allocation with "the new indexing system".

iter != tv->getLogicalDomain().end(),
"The found input IterDomain isn't logical. This is likely because logical doesn't dominate allocation: ",
inputs[0]);

// Count the number of non-reduction IterDomains before `iter`. Reduction
// IterDomains are not materialized in the at::Tensor's shape.
return std::count_if(
tv->getLogicalDomain().begin(), iter, [](IterDomain* id) -> bool {
return !id->isReduction();
});
}

std::vector<int64_t> unshardedSizes(
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
const TensorView* tv,
c10::IntArrayRef sizes) {
std::vector<int64_t> unsharded_sizes = sizes.vec();
for (ParallelType parallel_type : kParallelTypeDIDs) {
const int64_t sharded_axis = getShardedAxis(tv, parallel_type);
if (sharded_axis == -1) {
continue;
}
unsharded_sizes.at(sharded_axis) *= tv->getDeviceMesh().size(parallel_type);
}
return unsharded_sizes;
}

Expand All @@ -174,27 +203,6 @@ int64_t numDeviceDims(const TensorView* tv) {
}

namespace {
// Collect device-parallel IterDomains in `loop_domain` and return them as a
// ParallelType-to-IterDomain map.
std::unordered_map<ParallelType, IterDomain*> mapParallelTypeToId(
const std::vector<IterDomain*>& loop_domain) {
std::unordered_map<ParallelType, IterDomain*> parallel_type_to_id;
parallel_type_to_id.reserve(kParallelTypeDIDs.size());
for (IterDomain* loop_id : loop_domain) {
const ParallelType parallel_type = loop_id->getParallelType();
if (!isParallelTypeDeviceDim(parallel_type)) {
continue;
}

NVF_ERROR(
parallel_type_to_id.try_emplace(parallel_type, loop_id).second,
"Found multiple loop IterDomains with the same parallel type (",
parallel_type,
"): ",
toDelimitedString(loop_domain));
}
return parallel_type_to_id;
}

std::vector<IterDomain*> getInputsInTargetDomain(
IterDomain* loop_id,
Expand Down Expand Up @@ -294,9 +302,9 @@ bool haveDifferentShardings(
// 3. Check if the two loop IterDomains are almost-exactly mapped in the
// IdModel.
std::unordered_map<ParallelType, IterDomain*> p_parallel_type_to_id =
mapParallelTypeToId(producer->getLoopDomain());
mapDeviceParallelTypeToId(producer->getLoopDomain());
std::unordered_map<ParallelType, IterDomain*> c_parallel_type_to_id =
mapParallelTypeToId(consumer->getLoopDomain());
mapDeviceParallelTypeToId(consumer->getLoopDomain());

for (const auto parallel_type : kParallelTypeDIDs) {
IterDomain* p_loop_id = getOrDefault(p_parallel_type_to_id, parallel_type);
Expand Down Expand Up @@ -502,16 +510,6 @@ std::set<DeviceIdxType> involvedDevices(Expr* expr) {
return ret;
}

int64_t getShardedAxis(TensorView* tv) {
auto ids = TensorDomain::noReductions(tv->getLogicalDomain());
for (size_t i = 0; i < ids.size(); ++i) {
if (ids[i]->getParallelType() == ParallelType::DIDx) {
return static_cast<int64_t>(i);
}
}
return -1;
}

void reorderDIDToFront(TensorView* tv) {
// new position to old position
std::unordered_map<int64_t, int64_t> order_map;
Expand Down
6 changes: 3 additions & 3 deletions csrc/multidevice/utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -123,9 +123,9 @@ int64_t requestedNumberOfDevices(Fusion*);
void unshard(Fusion*);
void unshard(TensorView*);

// Returns the index of the a sharded axis if none return -1.
// TODO: Assumes no merges/splits on sharded axis.
int64_t getShardedAxis(TensorView*);
// Returns the index of the sharded logical axis corresponding to
// `parallel_type`. If `tv` isn't sharded on the parallel type, returns -1.
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
int64_t getShardedAxis(const TensorView* tv, ParallelType parallel_type);
wujingyue marked this conversation as resolved.
Show resolved Hide resolved

// Reorders a TensorView so that the DID parallelized axis are in front.
void reorderDIDToFront(TensorView*);
Expand Down
11 changes: 3 additions & 8 deletions tests/cpp/multidevice.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,8 @@ at::Tensor MultiDeviceTest::shardTensor(at::Tensor tensor, TensorView* tv) {
return tensor;
}
NVF_ERROR(tv->hasDeviceMesh(), "`tv` has no DeviceMesh: ", tv);
return shardTensor(tensor, getShardedAxis(tv), tv->getDeviceMesh());
return shardTensor(
tensor, getShardedAxis(tv, ParallelType::DIDx), tv->getDeviceMesh());
}

at::Tensor MultiDeviceTest::shardTensor(
Expand All @@ -144,13 +145,7 @@ at::Tensor MultiDeviceTest::shardTensor(
auto stride = extent / nslices;
// TODO: returning slice 0 temporarily when device is not in the mesh.
i = (i < 0) ? 0 : i;
auto slice = tensor.slice(axis, i * stride, (i + 1) * stride).contiguous();
// Temporary until https://github.com/NVIDIA/Fuser/issues/2563. Adds DIDx
// axis in front representing the sharded extent of the tensor.
wujingyue marked this conversation as resolved.
Show resolved Hide resolved
if (stride > 1) {
slice = slice.unsqueeze(0);
}
return slice;
return tensor.slice(axis, i * stride, (i + 1) * stride).contiguous();
}

} // namespace nvfuser
Expand Down
32 changes: 32 additions & 0 deletions tests/cpp/test_multidevice_lower_communication.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,38 @@ TEST_F(LowerCollectiveTest, Allgather) {
EXPECT_TRUE(at::equal(out_tensor, unsharded_tensor));
}

TEST_F(LowerCollectiveTest, Allgather_LoopSplit) {
auto fusion = std::make_unique<Fusion>();
FusionGuard fg(fusion.get());

const auto num_devices = communicator_->size();
auto mesh = DeviceMesh::createForNumDevices(num_devices);

TensorView* in = makeContigTensor(1);
in->setDeviceMesh(mesh);
TensorView* out = set(in);
fusion->addInput(in);
fusion->addOutput(out);

in->split(0, num_devices, /*inner_split=*/false);
in->axis(0)->parallelize(ParallelType::DIDx);
in->setAllocationDomain(in->getLoopDomain(), true);

out->split(0, num_devices, /*inner_split=*/false);
out->setAllocationDomain(out->getLoopDomain(), true);

at::Tensor unsharded_tensor =
at::randn({num_devices * kTensorSize}, at::kFloat);
at::Tensor in_tensor =
shardTensor(unsharded_tensor, in).to(communicator_->device());

Copy link
Collaborator

@samnordmann samnordmann Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<int64_t> ref_in_tensor_shape = {kTensorSize};
EXPECT_EQ(in_tensor.sizes(), ref_in_tensor_shape);

Copy link
Collaborator

@samnordmann samnordmann Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how shardTensor can be correct here if it never replays the split backwards... But I might be missing something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I think there are two problems with the PR as is:

  1. shardTensor may slice wrong numbers. For example, if an inner split is DID'ed, the slicing needs to be strided per the outer split.
  2. nvFuser doesn't error out when Allgather is not along the outermost allocated dimension. This was guaranteed by ReorderShardedAxisPass by checking isInnerResharding. However, getShardingChanges, one of its dependencies, hasn't been updated to read loop/allocation:
    auto rootmap = PairwiseLogicalDomainMap(input, output).mapBroadcast(false);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the suggested change: I manually checked the shape is as expected. I added some extra unit tests for shardTensor alone, so we don't have to verify it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a couple of changes to address the problems I said in #3284 (comment).

  1. 7cf2384. It's an overkill but will probably be OK for quite some time. I had a hard time finding a concrete use case that has to mix DID and host ID within one logical dimension. I agree that to properly support inner splits we'll need to "replay the split backwards". It's not a trivial change anyhow so I'll postpone it to a separate PR.
  2. I wrote Harden assertBuffersHaveSameSize to check shapes. #3531 to harden runtime checks for allgather and added to this PR one more allgather test (Allgather_LoopSplit_Noncontiguous). These extra checks will fire when we trigger some most common limitations before properly fixing ReorderShardedAxisPass, which will take several decent-size PRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hard time finding a concrete use case that has to mix DID and host ID within one logical dimension.

In fact, there's

// A has shape (S, sharded(D), M/(S*D), K)
. So I'll try to file a feature request after this PR.

FusionExecutorCache fec(std::move(fusion));
at::Tensor out_tensor = fec.runFusionWithInputs({in_tensor})[0];
assertIsCompiledToHostIrContainer(fec);

EXPECT_TRUE(at::equal(out_tensor.cpu(), unsharded_tensor));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use validate here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed allgather's lowering was not changed...I'm a bit surprised it didn't need any modifications for inputs with DID loop split! I might have missed a few earlier PRs though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use validate here?

Since validate allows for (small) differences, if two tensors are supposed to be exactly the same, just using the simpler validation method, i.e., at::equal, would be more preferable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised it didn't need any modifications for inputs with DID loop split!

Whether we call lowerToAllGather depends on I/O meshes and whether I/O is sharded:

lowerToAllgather(input_tv, output_tv, comms);
. isSharded have been reading the allocation domain ince #3444.

That being said, I think this PR as is is a bit too permissive and may lower a set to Allgather without properly checking its allocation domain. For example,

auto rootmap = PairwiseLogicalDomainMap(input, output).mapBroadcast(false);
reads root and logical and needs to be updated. I'll try to fix that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I think this PR as is is a bit too permissive and may lower a set to Allgather without properly checking its allocation domain.

I tried to address this in #3284 (comment).

}

TEST_F(LowerCollectiveTest, Broadcast) {
auto fusion = std::make_unique<Fusion>();
FusionGuard fg(fusion.get());
Expand Down
Loading