IDevice interface extraction to enable MeshDevice propagation into TT-NN #16500
ayerofieiev-tt
started this conversation in
General
Replies: 2 comments 1 reply
-
Hey Artem, awesome change, this will make it easier to use for sure! To confirm if I understand correctly, does this imply that all the TTNN APIs will be adjusted to use Device* device
MeshDevice* device
const MeshDevice&
ttnn::AnyDevice
detail::OptionalAnyDevice device
etc. |
Beta Was this translation helpful? Give feedback.
1 reply
-
The change is now merged |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
IDevice extraction
Hi everyone! Please take a look at this change. It is in the active review and I encourage you to take a look, specially if you consume tt-metal in a C++ project.
#16482
Background
As we ramp up larger WH/BH-based systems, it is critical to enable teams to work with them. To achieve this, we need to integrate new TT-Distributed entities into TT-NN. This requires us to enable existing functions to work with new structures like MeshDevice.
Who dispatches OP workload?
tldr: Today - TT-NN. Desired - Metal.
TT-NN today supports execution of workloads on multiple devices within a single host. It implements a Single Program Multiple Devices (SPMD) computation model. In essence it works like this:
The dispatch in TT-NN looks like this - see the for loop at the end.
TT-Distributed proposes to lower the dispatch of the workload from TT-NN to Metal. So instead of looping over devices, the code might looks like this
where device is an instance of a MeshDevice and workload is a MeshWorkload described in the TT-Distributed spec.
What IDevice has to do with all this?
Metal Device is an abstraction that can hide the difference between a single device, multiple devices or even a mocked device.
Such an abstraction is currently missing. Existing entity only covers a single device. Device and MeshDevice do not implement a common interface (in a general sense - they do not provide a common set of functions). Besides that, Metal API provides numerous functions that accept a Device, but do not work with MeshDevice.
It must be our first step, to define such an abstraction and to update our codebase to use this new abstraction.
Note!: It is possible that we need 2 separate abstractions: Single Device (real PCIE vs mock) and a Cluster (1 device, N devices on a single or multiple hosts). We start with an assumption that a single interface is enough, but if no - will review the need to have 2 separate concepts.
Existing Device class review
In preparation to this work, we reviewed the existing Device class.
Whats in the PR?
The main change - we extract an IDevice interface.
In past days we moved all public members and some public methods to the private sections and removed friends from the class.
Now we want to extract an interface to further hide details of implementation.
How this impacts me?
This will likely impact your C++ code.
Compilation of your project will break in places where you do something similar to
Change it to
or
auto device =...
Except places where you create the device or work with the specific type, I also recommend to change the include from
to
Whats next?
in short
Please read the PR description for more details.
500 files, dude
The change spans 500 files. I highlighted main points in comments.
I retrospectively see the way to split the change into multiple individual pieces.
It might get split into pieces:
using DeviceHandle = IDevice*
Nevertheless, I am sharing this as soon as possible, to include internal and external developers in the discussion.
Please let me know if you have any questions or concerns!
Beta Was this translation helpful? Give feedback.
All reactions