Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CCL] Support rank 2 tensor for reduce_scatter TTNN cpp api #15010

Open
wooseokTT opened this issue Nov 13, 2024 · 1 comment
Open

[CCL] Support rank 2 tensor for reduce_scatter TTNN cpp api #15010

wooseokTT opened this issue Nov 13, 2024 · 1 comment
Assignees
Labels
bug Something isn't working op_cat: ccl

Comments

@wooseokTT
Copy link

wooseokTT commented Nov 13, 2024

Current TTNN call to reduce_scatter with 2-dim tensor causes runtime error as follows. The error can be alleviated with front and back reshape operation that changes dimension from 2 to 4 and 4 to 2, respectively.

Original MLIR op description that causes the problem.

%9 = "ttnn.reduce_scatter"(%8) <{math_op = #tt.reduce_type, num_links = 1 : si32, scatter_split_dim = 1 : si32}> : (tensor<8192x16384xf32, #layout7>) -> tensor<8192x8192xf32, #layout7>

Updated MLIR ops with reshape front and back.

%9 = "ttnn.reshape"(%8) <{shape = [1 : i32, 1 : i32, 8192 : i32, 16384 : i32]}> : (tensor<8192x16384xf32, #layout7>) -> tensor<1x1x8192x16384xf32, #layout7>
%10 = "ttnn.reduce_scatter"(%9) <{math_op = #tt.reduce_type, num_links = 1 : si32, scatter_split_dim = 3 : si32}> : (tensor<1x1x8192x16384xf32, #layout7>) -> tensor<1x1x8192x8192xf32, #layout7>
%11 = "ttnn.all_gather"(%10) <{dim = 3 : si32, num_links = 1 : si32}> : (tensor<1x1x8192x8192xf32, #layout7>) -> tensor<1x1x8192x16384xf32, #layout7>
%12 = "ttnn.reshape"(%11) <{shape = [8192 : i32, 16384 : i32]}> : (tensor<1x1x8192x16384xf32, #layout7>) -> tensor<8192x16384xf32, #layout7>


2024-11-13 16:59:57,311 - ERROR - ERROR: test=./test_ttnn.ttnn experienced an error with exception=TT_FATAL @ /proj_sw/user_dev/wooseoklee/tt-mlir/third_party/tt-metal/src/tt-metal/ttnn/cpp/ttnn/tensor/types.cpp:239: normalized_index >= 0 and normalized_index < rank
info:
Index is out of bounds for the rank, should be between 0 and 1 however is 3
backtrace:
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x121b448) [0x7e58b812a448]
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x140d63b) [0x7e58b831c63b]
--- ttnn::ccl::RingReduceScatterBaseTensorSlicerttnn::ccl::RingReduceScatterWrappedTensorSlicer::RingReduceScatterBaseTensorSlicer(tt::tt_metal::Tensor const&, tt::tt_metal::Tensor const&, int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int)
--- ttnn::ccl::RingReduceScatterWrappedTensorSlicer::RingReduceScatterWrappedTensorSlicer(tt::tt_metal::Tensor const&, tt::tt_metal::Tensor const&, int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int)
--- ttnn::ccl::reduce_scatter_detail::reduce_scatter_with_workers(tt::tt_metal::Tensor const&, tt::tt_metal::Tensor const&, ttnn::operations::binary::BinaryOpType, unsigned int, unsigned int, unsigned int, unsigned int, std::optional, std::optional, ttnn::ccl::Topology, std::optional, std::optional)
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x4b08ea) [0x7e58b73bf8ea]
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x40e9c0) [0x7e58b731d9c0]
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x41981d) [0x7e58b732881d]
--- void ttnn::device_operation::detail::launch_on_worker_thread<tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >, unsigned char, long, tt::tt_metal::operation::DeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >, tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >::tensor_args_t, std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor >, tt::tt_metal::v0::Device*>(unsigned char, long, tt::tt_metal::operation::DeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > > const&, tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >::tensor_args_t const&, std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor >&, tt::tt_metal::v0::Device*&)
--- tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >::tensor_return_value_t ttnn::device_operation::detail::launch_on_single_device<tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > > >(unsigned char, tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >::operation_attributes_t const&, tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >::tensor_args_t const&)
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x40b72e) [0x7e58b731a72e]
--- tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >::tensor_return_value_t ttnn::device_operation::detail::invoke<tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > > >(unsigned char, tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >::operation_attributes_t const&, tt::tt_metal::operation::OldInfraDeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >::tensor_args_t const&)
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x40a27f) [0x7e58b731927f]
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x4061d5) [0x7e58b73151d5]
--- std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > tt::tt_metal::operation::run<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >(tt::tt_metal::operation::DeviceOperation<std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > >&&, std::vector<tt::tt_metal::Tensor, std::allocatortt::tt_metal::Tensor > const&, std::vector<std::optional<tt::tt_metal::Tensor const>, std::allocator<std::optional<tt::tt_metal::Tensor const> > > const&, std::vector<std::optionaltt::tt_metal::Tensor, std::allocator<std::optionaltt::tt_metal::Tensor > > const&, unsigned char)
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x4ae769) [0x7e58b73bd769]
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x4af223) [0x7e58b73be223]
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_ttnn.so(+0x4b00c4) [0x7e58b73bf0c4]
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/libtt_metal.so(+0x172c8a) [0x7e58b6dd1c8a]
--- tt::tt_metal::v0::Device::push_work(std::shared_ptr<std::function<void ()> >, bool)
--- ttnn::operations::ccl::reduce_scatter(tt::tt_metal::Tensor const&, unsigned int, ttnn::operations::reduction::ReduceType, unsigned int, tt::tt_metal::MemoryConfig const&, ttnn::ccl::Topology, std::optional, std::optional)
--- ttnn::operations::ccl::ExecuteReduceScatter::invoke(tt::tt_metal::Tensor const&, unsigned int, ttnn::operations::reduction::ReduceType, unsigned int, std::optionaltt::tt_metal::MemoryConfig const&, ttnn::ccl::Topology, std::optional, std::optional)
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_C.cpython-310-x86_64-linux-gnu.so(+0xa8293) [0x7e58b979c293]
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_C.cpython-310-x86_64-linux-gnu.so(+0xa7cbb) [0x7e58b979bcbb]
--- tt::runtime::ttnn::operations::ccl::run(tt::target::ttnn::ReduceScatterOp const*, tt::runtime::ttnn::ProgramContext&)
--- tt::runtime::ttnn::runProgram(tt::tt_metal::distributed::MeshDevice&, tt::target::ttnn::Program const*, std::vector<tt::tt_metal::Tensor*, std::allocatortt::tt_metal::Tensor* > const&, std::vector<tt::tt_metal::Tensor*, std::allocatortt::tt_metal::Tensor* > const&)
--- tt::runtime::ttnn::submit(tt::runtime::Device, tt::runtime::Binary, unsigned int, std::vector<tt::runtime::Tensor, std::allocatortt::runtime::Tensor > const&, std::vector<tt::runtime::Tensor, std::allocatortt::runtime::Tensor > const&)
--- tt::runtime::submit(tt::runtime::Device, tt::runtime::Binary, unsigned int, std::vector<tt::runtime::Tensor, std::allocatortt::runtime::Tensor > const&, std::vector<tt::runtime::Tensor, std::allocatortt::runtime::Tensor > const&)
--- /opt/ttmlir-toolchain/venv/lib/python3.10/site-packages/ttrt/runtime/_C.cpython-310-x86_64-linux-gnu.so(+0x6d185) [0x7e58b9761185]

@SeanNijjar SeanNijjar added bug Something isn't working op_cat: ccl labels Nov 14, 2024
@SeanNijjar
Copy link
Contributor

SeanNijjar commented Nov 14, 2024

Marking this as P2 as: a) there is a workaround b) VVL team will revisit this after our upgrade to V2 CCLs (which is currently underway for the next couple weeks). After CCLs migrate, this will become higher priority

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working op_cat: ccl
Projects
None yet
Development

No branches or pull requests

2 participants