Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add initial fabric erisc data mover (EDM) impl #14923

Merged
merged 5 commits into from
Nov 14, 2024
Merged

Conversation

SeanNijjar
Copy link
Contributor

@SeanNijjar SeanNijjar commented Nov 10, 2024

Ticket

Link to Github Issue

Problem description

To support new "Async CCLs" we require some sort of fabric component that can independently route and flow control messages that can be sent across multiple hops without worker involvement. Current EDM doesn't support this.

What's changed

The fabric Erisc Data Mover (EDM) is a component that can be used to build very simple linear topology fabrics.
One of these EDMs can be instantiated on each ethernet link. It is built from 3 "channels" (though the definition
of channel here is a little loose since two of the 3 will merge traffic, so this setup could be interpreted as a
two channel setup.). This EDM implements packet based packets only - concepts like sockets are not supported.

EDM Structure

There are two sender channels and one receiver channel. "Sender" and "receiver" are relative to the Ethernet link,
not the chip. Sender sends over the link and receiver receives from the link.

Each sender channel serves a different purpose:

  • Sender channel 0 : Accepts packets from a workers on the local chip
  • Sender channel 1: accepts packets from an upstream EDM (i.e. an upstream
    EDM receiver channel on the same chip but different core)

The receiver channel accepts packets from the Ethernet link and can do one (or both) of:

  • Write the packet to local chhip if it is the intended destination (unicast or mcast)
  • Forward the packet to the next chip in the line if:
    • Unicast and not the target chip
    • Multicast and this chip is in the multicast target range

Sender channels will merge traffic into the remote EDM's receiver channel.

Below is a diagram that shows how EDMs can be connected over an ethernet link. In this case, the two
EDM kernels are run on separate, but connected ethernet link cores.

 ┌───────────────────────┐           ┌───────────────────────┐
 │    Sender Channel 0   │           │    Receiver Channel   │
 │   ┌────────────────┐  │           │   ┌────────────────┐  │
 │   │                ┼──┼───┬───────┼───►                │  │
 │   │                │  │   │       │   │                │  │
 │   └────────────────┘  │   │       │   └────────────────┘  │
 │    Sender Channel 1   │   │       │    Sender Channel 1   │
 │   ┌────────────────┐  │   │       │   ┌────────────────┐  │
 │   │                ┼──┼───┘       │   │                │  │
 │   │                │  │         ┌─┼───┼                │  │
 │   └────────────────┘  │         │ │   └────────────────┘  │
 │    Receiver Channel   │         │ │    Sender Channel 0   │
 │   ┌────────────────┐  │         │ │   ┌────────────────┐  │
 │   │                │  │         │ │   │                │  │
 │   │                ◄──┼─────────┴─┼───┼                │  │
 │   └────────────────┘  │           │   └────────────────┘  │
 │                       │           │                       │
 │                       │           │                       │
 └───────────────────────┘           └───────────────────────┘

Building a "Fabric"

Only linear topologies are and will be supported, and one per ethernet link along that given line.
Below shows the intended connectivity of EDMs across chips in a hypothetical 3-chip fabric. For longer
lines, the pattern would be extended.

           CHIP 0                              CHIP 1                             CHIP 2
     ┌─────────────────┐                ┌─────────────────┐                ┌─────────────────┐
     │                 │                │                 │                │                 │
┌────┴─────┐ ▲   ┌─────┴────┐      ┌────┴─────┐ ▲   ┌─────┴────┐      ┌────┴─────┐ ▲   ┌─────┴────┐
│   EDM    │ │   │   EDM    │      │   EDM    │ │   │   EDM    │      │   EDM    │ │   │   EDM    │
│ ┌──────┐ │ │   │ ┌──────┐ │      │ ┌──────┐ │ │   │ ┌──────┐ │      │ ┌──────┐ │ │   │ ┌──────┐ │
│ │ Rx   ┼─┼─┴───┼─► S1   ┼─┼─┬────┼─► Rx   ┼─┼─┴───┼─► S1   ┼─┼┬─────┼─► Rx   ┼─┼─┘   | | S1   │ │
│ └──────┘ │     │ └──────┘ │ │    │ └──────┘ │     │ └──────┘ ││     │ └──────┘ │     │ └──────┘ │
│ ┌──────┐ │     │ ┌──────┐ │ │    │ ┌──────┐ │     │ ┌──────┐ ││     │ ┌──────┐ │     │ ┌──────┐ │
│ │ S0   ◄─┼──┬──┼─► S0   ┼─┼─┘   ┌┼─┼ S0   ◄─┼──┬──┼─► S0   ┼─┼┘    ┌┼─┼ S0   ◄─┼──┬──┼─► S0   │ │
│ └──────┘ │  │  │ └──────┘ │     ││ └──────┘ │  │  │ └──────┘ │     ││ └──────┘ │  │  │ └──────┘ │
│ ┌──────┐ │  │  │ ┌──────┐ │     ││ ┌──────┐ │  │  │ ┌──────┐ │     ││ ┌──────┐ │  │  │ ┌──────┐ │
│ │ S1   | |  │ ┌┼─┼ Rx   ◄─┼─────┴┼─┼ S1   ◄─┼─┐│ ┌┼─┼ Rx   ◄─┼─────┴┼─┼ S1   ◄─┼─┐│ ┌┼─┼ Rx   │ │
│ └──────┘ │  | |│ └──────┘ │      │ └──────┘ │ └┼─┤│ └──────┘ │      │ └──────┘ │ └┼─┤│ └──────┘ │
└────┬─────┘  │ │└─────┬────┘      └────┬─────┘  │ │└─────┬────┘      └────┬─────┘  │ │└─────┬────┘
     │          ▼      │                │          ▼      │                │          ▼      │
     └─────────────────┘                └─────────────────┘                └─────────────────┘

Connecting Workers to Channels

As mentioned, only one worker can push to a given EDM sender channel at a time. In order to send to an EDM
sender channel, the worker must establish a connection. The connection protocol is as follows and is started
by the worker (the EDM is a slave in this protocol).

NOTE: If multiple workers try to connect to the same EDM sender channel at the same time, the behavior is undefined.
NOTE: Additionally, if a worker pushes packets to a channel it isn't connected to, behaviour is undefined.
NOTE: Undefined == likely hang

The WorkerToFabricEdmSender from ttnn/cpp/ttnn/operations/ccl/kernels/edm_fabric/edm_fabric_worker_adapters.hpp
provides an implementation of the connection protocol. WorkerToFabricEdmSender also acts as a wrapper around that
protocol so workers can simply call open() to execute the connection protocol without having to manually reimplement
for each kernel.

Protocol

Worker:

  • Read from EDM sender channel buffer_index address
    • Required so that the worker knows where to write its first packet (since the channel may already contain packets from
      a previous connection)
  • Write worker core X/Y (NOC 0 based)
  • Write worker flow control semaphore L1 address

EDM Sender Channel:

  • Check local connection valid semaphore for new established connection
    • When the connection semaphore indicates an active connection, the channel assumes all other relevant fields were
      correctly populated by the worker:
      • Worker core_x (on NOC 0)
      • Worker core_y (on NOC 0)
      • Worker flow control semaphore L1 address

Tearing Down Connections

Every worker is required to explicitly teardown its connection with the EDM before terminating. To do this, the worker
must simply write a 0 to the EDM sender channel's connection semaphore address. As long as the worker has sent all
of its packets to the EDM before this, then the EDM will guarantee to forward the messages correctly.

At this point, it is safe for another kernel to establish a connection.

Packet Structure

Workers are responsible for populating packet headers before sending to the EDM. The packet header structure is defined
in ttnn/cpp/ttnn/operations/ccl/kernels/edm_fabric/fabric_edm_packet_header.hpp.

Channel structure

Each EDM channel is built from one or more buffers. Each buffer is the same size and can hold atmost one packet.
Neighbouring packets occupy nehighouring buffers - with the exception of the last buffer index. The next packet after a write
into the last buffer index will wrap around to the first buffer index. Even if packets do not occupy the full buffer, subsequent
packets will always be written into the next logical buffer. A gap will exist in memory but the EDM will not send that padded data
(unless it is more performant - which is possible in some special cases)

Example channel with 8 buffers

┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
│       │       │       │       │       │       │       │       │
│       │       │       │       │       │       │       │       │
└───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
buf 0   buf 1   buf 2   buf 3   buf 4   buf 5   buf 6   buf 7

Here we have an example of a channel with 4 buffers, filled with some number of packets. Each packet is a different size.
Packets 0, 2, and 3 are smaller than the full buffer size, while packet 1 is the full buffer size.

┌───────────────┬───────────────┬───────────────┬───────────────┐
│H|Payload| / / │H|Payload      │H|Pyld| / / / /│H|Payload  |/ /│
│ |       |/ / /│ |             │ |    |/ / / / │ |         | / │
└───────────────┴───────────────┴───────────────┴───────────────┘
  buf 0           buf 1           buf 2           buf 3

A detail of the channel structure is omitted from the above diagram, namely the EDM <-> EDM flow control region for each buffer.
Each buffer really looks something like this:

             &header->  |----------------| channel_base_address
                        |    header      |
            &payload->  |----------------|
                        |                |
                        |    payload     |
                        |                |
       &channel_sync->  |----------------|
                        |  channel_sync  |  // This is new
                        ------------------

The "channel_sync" is an eth_channel_sync_t and is internal to the EDM implementation and is used to indicate packet
transmission state between sender and receiver EDMs.

The protocol for its use is:

  1. Sender updates the field indicating new data:
    • set bytes_sent to a non-zero value indicating new data
    • clear receiver_ack to 0
    • set src_id to the sender channel id so the receiver knows who the sender was (and where the ack should go)
  2. Sender sends this channel sync to the corresponding location in the receiver channel (either in the same transmission
    as the packet or separately)
  3. Receiver sees that bytes_sent is non-zero, indicating a new packet. It sends back an acknowledgement (first level):
    • set receiver_ack to non-zero
      NOTE IMPORTANT: To avoid a race, the receiver must be sure to send its channel_sync_t from a different address it uses
      as for the second level acknowledgement
      3b) When sender receives an ack, it understands it can overwrite its local copy of the packet with new data
  4. After receiver properly writes out its packet, it sends a second level acknowledgement, indicating it can receive new
    data into this specific buffer index:
    • clear the bytes_sent and receiver_ack fields and send back the channel_sync to the sender

Sending Packets

Sending a packet is done as follows:

  1. Worker waits for flow control semaphore increment from EDM sender channel
  • Indicates there is space at the next buffer index for a packet
  1. Worker performs a noc write of its packet to the EDM sender channel at the buffer index

NOTE: !!!ALL PACKETS MUST CONTAIN DESTINATION NOC X/Y AS NOC 0 COORDINATES, REGARDLESS OF THE noc_index OF THE SENDER!!!

Building a Line Fabric

Building a simple fabric for testing with operations:

  1. First build it:
    Build a bidirectional fabric along a line of devices:
    ttnn::ccl::EdmLineFabricOpInterface(devices, program_ptrs, 1);
    where the devices and program_ptrs correspond to each other by index.
    The third argument is an optional field the specifies the number of links
    (wide) to make the fabric span. By default, this will choose the largest
    number of links possible for the provided span of devices.

  2. Next connect to your workers. For each worker, connect to the fabric like:

auto chip0_worker_fabric_connection =
    line_fabric.uniquely_connect_worker(
        devices[0],
        ttnn::ccl::EdmLineFabricOpInterface::FORWARD);

where the valid directions are FORWARD and BACKWARD. FORWARD is
in the direction of ascending device indices (from the provided device
list during the constructor call) and BACKWARD is toward the front.

Note that for the time being, if a worker wishes to broadcast in both
directions of the line, they will need to call connect twice:
once in the forward direction and once in the backward direction

  1. Collect the termination info
    For proper teardown of the fabric. This will only be needed temporarily until
    a create_persistent_fabric that launches the fabric on persistent subcore
    meshes is provided. A worker will be required to send terminate signals to
    all the fabric endpoints to let the workload complete.
auto const& edm_termination_infos =
    line_fabric.generate_ordered_termination_info_farthest_to_nearest()

These termination infos specify the fabric locations for each endpoint,
relative to the first chip in the fabric.

  1. Finally, build the EDM kernels:
    line_fabric.build_kernels();

Future work

This functionalilty is still partly work in progress from a functional perspective. Namely, chip multi-cast support must still be validated. Update: mcast support has been tested (basic) in this PR now

However, it not strictly required for functional bringup of operations such as all-gather, reduce-scatter, all-reduce, send/receive, etc. Update: mcast support has been tested (basic) in this PR now

Additional functional needs:

  • graceful termination support Update: added
  • Maybe:
    • Stats gathering region that can be queried and potentially useful for debug
      • num messages committed locally
      • num messages forwarded
      • "live" throughput

From a performance perspective there are a handful of items to work through (non-exhaustive):

  • General (careful) review of use of volatiles. Cache, coelesce, and/or remove volatile reads/writes where possible
  • Conditionally send padding for packets < buffer size (identify the breakeven point and don't split channel sync above that size
  • Consider busy wait when splitting eth_channel_sync_t from payload and we are waiting for eth_tx_cmd_q to clear.
  • Make local chip writes non-blocking
  • Align enum and state values so we can do sender_states[sender_channel_index] += send_status for SENDER_WAITING_FOR_WORKER state update
  • Short-circuit from RECEIVER_WAITING_FOR_ETH to RECEIVER_SENDING_PAYLOAD without having to exit the receiver state machine step
    • Currently only avoided for code-size reasons and for the time being want to have flexibility to add asserts/dprints while all features are enabled
  • For mcasted writes, let EDM commit local writes and EDM forward writes independently
    • Currently we require that we can do both together in the same state step
  • Optimize sender channel index increment
  • Remove redundant safe_to_send branch in forward_payload_to_downstream_edm
  • Flatten CommandType and NocSendType enums to flatten handful of switch-case blocks, reduce number of branch condition checks and depth of branching in some cases
  • Tune context switch timeouts and conditionality
  • Move to word-granularity rd/wr pointer to improve L1 utilization (increase effective buffering capacity, should help perf in some cases, especially with amortizing sends of multiple packets at a time over the link)

Checklist

@SeanNijjar
Copy link
Contributor Author

SeanNijjar commented Nov 10, 2024

FYI @davorchap @jvegaTT @xuncaiTT @avoraTT. A little later than anticipated but here's the first iteration that should be sufficient for prototyping all of our async CCLs. Mcast support is added but untested, so for now, with all-gather, we'll need to emulate mcast on the sender side by looping over the packets (e.g. in all-gather case).

Next on the list is mcast support tested and then a function for instantiating the EDMs in the line topologies so the ops can call a simple function for all setup, then perf characterization.

@davorchap
Copy link
Collaborator

FYI @davorchap @jvegaTT @xuncaiTT @avoraTT. A little later than anticipated but here's the first iteration that should be sufficient for prototyping all of our async CCLs. Mcast support is added but untested, so for now, with all-gather, we'll need to emulate mcast on the sender side by looping over the packets (e.g. in all-gather case).

Next on the list is mcast support tested and then a function for instantiating the EDMs in the line topologies so the ops can call a simple function for all setup, then perf characterization.

Great!

@SeanNijjar
Copy link
Contributor Author

Update: mcast functionality has been tested now and is functional for tests added so far

@SeanNijjar SeanNijjar force-pushed the snijjar/async-ccl branch 2 times, most recently from 7c1c1f3 to 4dfef5b Compare November 12, 2024 04:44
@SeanNijjar
Copy link
Contributor Author

Note my pipelines are failing due to a build failure in slice. I'm looking into why but this is unexpected

@SeanNijjar
Copy link
Contributor Author

SeanNijjar commented Nov 13, 2024

Somehow this is compiling on main...:

tt-metal/ttnn/cpp/ttnn/operations/data_movement/tilize/device/tilize_program_factory.cpp:98:75: note: in instantiation of function template specialization 'ttnn::decorators::registered_operation_t<reflect::fixed_string<char, 11UL - 1>{{{116, 116, 110, 110, 58, 58, 108, 111, 103, 50}}}, ttnn::operations::unary::ExecuteUnary<ttnn::operations::unary::UnaryOpType::LOG2>, true>::operator()<unsigned int &>' requested here
   98 |     uint32_t log2_stick_size = stick_size_is_power_of_two ? **(uint32_t)log2(stick_size)** : 0;


/home/ubuntu/actions-runner-2/_work/tt-metal/tt-metal/ttnn/cpp/ttnn/operations/eltwise/unary/unary.hpp:38:19: note: candidate function not viable: no known conversion from 'unsigned int' to 'const Tensor' (aka 'const tt::tt_metal::Tensor') for 1st argument
   38 |     static Tensor invoke(
      |                   ^
   39 |         const Tensor& input_tensor,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/ubuntu/actions-runner-2/_work/tt-metal/tt-metal/ttnn/cpp/ttnn/operations/eltwise/unary/unary.hpp:43:71: note: candidate function not viable: requires 2 arguments, but 1 was provided
   43 |     static typename ExecuteUnaryInvokeResult<unary_op_types...>::type invoke(
      |                                                                       ^
   44 |         const ComplexTensor& input_tensor,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   45 |         const MemoryConfig& memory_config);
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/ubuntu/actions-runner-2/_work/tt-metal/tt-metal/ttnn/cpp/ttnn/operations/eltwise/unary/unary.hpp:32:19: note: candidate function not viable: requires at least 2 arguments, but 1 was provided

... What?

but then ttnn/cpp/ttnn/operations/eltwise/unary/unary.hpp

#define REGISTER_UNARY_OPERATION(operation_name, operation_type) \
    constexpr auto operation_name = ttnn::register_operation_with_auto_launch_op<    \
        "ttnn::" #operation_name,                                \
        ttnn::operations::unary::ExecuteUnary<ttnn::operations::unary::UnaryOpType::operation_type>>();

and

REGISTER_UNARY_OPERATION(log2, LOG2);

So log2 (intended from std from math header) I think was resolving to this completely unrelated symbol. So PSA, scope these math function calls folks :)

How is this working on main???

Copy link
Contributor

@ntarafdar ntarafdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes to concat and tilize look good

Copy link
Contributor

@tt-aho tt-aho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metal changes look fine

@@ -292,6 +292,7 @@ inline __attribute__((always_inline)) void noc_fast_write_dw_inline(uint32_t noc

uint32_t be32 = be;
uint32_t be_shift = (dest_addr & (NOC_WORD_BYTES-1));
// If we're given a misaligned address, don't write to the bytes in the word below the address
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we want to keep this comment?
Would apply to gs/bh as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added the comment because I was confused at first about what was happening here. I can add the corresponding comment there too if it is common in those implementations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does misaligned here mean not aligned to L1_ALIGNMENT?

Copy link
Contributor Author

@SeanNijjar SeanNijjar Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest, there isn't any documentation on these byte enable register - I was told we need to look at the RTL to figure out exactly how it behaves, but from the line of code here, it is relative to the noc word size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer if I reword it to say "An address not a multiple of noc word size". Again I don't know why this matters but really the comment is just to describe what the shift itself is trying to do. Maybe it's obvious in hindsight but I had to sit there for a minute and think about it before I realized what was going on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think either is fine, just wanted clarification since we typically only dealt with noc alignment

Note only supports line topologies. Fabric mcast currently untested and
is work in progress. In the mean-time for functional bringup of fabric
EDM users, replace mcast with looped unicasts.

The fabric Erisc Data Mover (EDM) is a component that can be used to build *very* simple linear topology fabrics.
One of these EDMs can be instantiated on each ethernet link. It is built from 3 "channels" (though the definition
of channel here is a little loose since two of the 3 will merge traffic, so this setup could be interpreted as a
two channel setup.). This EDM implements packet based packets only - concepts like sockets are not supported.

!! EDM Structure

There are two sender channels and one receiver channel. "Sender" and "receiver" are relative to the Ethernet link,
not the chip. Sender sends over the link and receiver receives from the link.

Each sender channel serves a different purpose:
- Sender channel 0 : Accepts packets from a workers on the local chip
- Sender channel 1: accepts packets from an upstream EDM (i.e. an upstream
  EDM receiver channel on the same chip but different core)

The receiver channel accepts packets from the Ethernet link and can do one (or both) of:
- Write the packet to local chhip if it is the intended destination (unicast or mcast)
- Forward the packet to the next chip in the line if:
  - Unicast and not the target chip
  - Multicast and this chip is in the multicast target range

Sender channels will merge traffic into the remote EDM's receiver channel.

!! Building a "Fabric"

At present, only linear topologies are supported, and one per ethernet link along that given line.
Below shows the intended connectivity of EDMs across chips in a hypothetical 3-chip fabric. For longer
lines, the pattern would be extended.

!! Connecting Workers to Channels

As mentioned, only one worker can push to a given EDM sender channel at a time. In order to send to an EDM
sender channel, the worker must establish a connection. The connection protocol is as follows and is started
by the worker (the EDM is a slave in this protocol).

*NOTE*: If multiple workers try to connect to the same EDM sender channel at the same time, the behavior is undefined.
*NOTE*: Additionally, if a worker pushes packets to a channel it isn't connected to, behaviour is undefined.
*NOTE*: Undefined == likely hang

The `WorkerToFabricEdmSender` from `ttnn/cpp/ttnn/operations/ccl/kernels/edm_fabric/edm_fabric_worker_adapters.hpp`
provides an implementation of the connection protocol. `WorkerToFabricEdmSender` also acts as a wrapper around that
protocol so workers can simply call `open()` to execute the connection protocol without having to manually reimplement
for each kernel.

!!! Protocol
Worker:
- Read from EDM sender channel buffer_index address
  - Required so that the worker knows where to write its first packet (since the channel may already contain packets from
    a previous connection)
- Write worker core X/Y (NOC 0 based)
- Write worker flow control semaphore L1 address

EDM Sender Channel:
- Check local connection valid semaphore for new established connection
  - When the connection semaphore indicates an active connection, the channel assumes all other relevant fields were
    correctly populated by the worker:
    - Worker core_x (on NOC 0)
    - Worker core_y (on NOC 0)
    - Worker flow control semaphore L1 address

!! Tearing Down Connections

Every worker is required to explicitly teardown its connection with the EDM before terminating. To do this, the worker
must simply write a `0` to the EDM sender channel's connection semaphore address. As long as the worker has sent all
of its packets to the EDM before this, then the EDM will guarantee to forward the messages correctly.

At this point, it is safe for another kernel to establish a connection.

!! Packet Structure

Workers are responsible for populating packet headers before sending to the EDM. The packet header structure is defined
in `ttnn/cpp/ttnn/operations/ccl/kernels/edm_fabric/fabric_edm_packet_header.hpp`.

!! Channel structure

Each EDM channel is built from one or more buffers. Each buffer is the same size and can hold atmost one packet.
Neighbouring packets occupy nehighouring buffers - with the exception of the last buffer index. The next packet after a write
into the last buffer index will wrap around to the first buffer index. Even if packets do not occupy the full buffer, subsequent
packets will always be written into the next logical buffer. A gap will exist in memory but the EDM will not send that padded data
(unless it is more performant - which is possible in some special cases)

A detail of the channel structure is omitted from the above description, namely the EDM <-> EDM flow control region for each buffer.
Each buffer really looks something like this:

             &header->  |----------------| channel_base_address
                        |    header      |
            &payload->  |----------------|
                        |                |
                        |    payload     |
                        |                |
       &channel_sync->  |----------------|
                        |  channel_sync  |  // This is new
                        ------------------

The "channel_sync" is an `eth_channel_sync_t` and is internal to the EDM implementation and is used to indicate packet
transmission state between sender and receiver EDMs.

The protocol for its use is:
1) Sender updates the field indicating new data:
   - set `bytes_sent` to a non-zero value indicating new data
   - clear `receiver_ack` to 0
   - set `src_id` to the sender channel id so the receiver knows who the sender was (and where the ack should go)
2) Sender sends this channel sync to the corresponding location in the receiver channel (either in the same transmission
   as the packet or separately)
3) Receiver sees that `bytes_sent` is non-zero, indicating a new packet. It sends back an acknowledgement (first level):
   - set `receiver_ack` to non-zero
   *NOTE* IMPORTANT: To avoid a race, the receiver must be sure to send its channel_sync_t from a different address it uses
   as for the second level acknowledgement
   3b) When sender receives an ack, it understands it can overwrite its local copy of the packet with new data
4) After receiver properly writes out its packet, it sends a second level acknowledgement, indicating it can receive new
   data into this specific buffer index:
   - clear the bytes_sent and receiver_ack fields and send back the `channel_sync` to the sender

!! Sending Packets

Sending a packet is done as follows:

1) Worker waits for flow control semaphore increment from EDM sender channel
  - Indicates there is space at the next buffer index for a packet
2) Worker performs a noc write of its packet to the EDM sender channel at the buffer index

*NOTE*: !!!ALL PACKETS MUST CONTAIN DESTINATION NOC X/Y AS NOC 0 COORDINATES, REGARDLESS OF THE `noc_index` OF THE SENDER!!!

For more diagrams, see `fabric_erisc_datamover.cpp`

!! Building a Fabric

Building a simple fabric for testing with operations:

1) First build it:
Build a bidirectional fabric along a line of devices:
`ttnn::ccl::EdmLineFabricOpInterface(devices, program_ptrs, 1);`
where the devices and program_ptrs correspond to each other by index.
The third argument is an optional field the specifies the number of links
(wide) to make the fabric span. By default, this will choose the largest
number of links possible for the provided span of devices.

2) Next connect to your workers. For each worker, connect to the fabric like:

```
auto chip0_worker_fabric_connection =
    line_fabric.uniquely_connect_worker(
        devices[0],
        ttnn::ccl::EdmLineFabricOpInterface::FORWARD);
```

where the valid directions are FORWARD and BACKWARD. FORWARD is
in the direction of ascending device indices (from the provided device
list during the constructor call) and BACKWARD is toward the front.

Note that for the time being, if a worker wishes to broadcast in both
directions of the line, they will need to call connect twice:
once in the forward direction and once in the backward direction

3) Collect the termination info
For proper teardown of the fabric. This will only be needed temporarily until
a `create_persistent_fabric` that launches the fabric on persistent subcore
meshes is provided. A worker will be required to send terminate signals to
all the fabric endpoints to let the workload complete.

```
auto const& edm_termination_infos =
    line_fabric.generate_ordered_termination_info_farthest_to_nearest()
```

These termination infos specify the fabric locations for each endpoint,
relative to the first chip in the fabric.

4) Finally, build the EDM kernels:
`line_fabric.build_kernels();`
somehow this doesn't result in build issues on main
}

void FabricEriscDatamoverBuilder::connect_to_downstream_edm(FabricEriscDatamoverBuilder const& downstream_edm) {
auto const& adapter_spec = downstream_edm.build_connection_to_fabric_channel();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should ideally assert that no connection is already established?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe - I'm keeping this in the wrapper fabric builder.

Copy link
Member

@ayerofieiev-tt ayerofieiev-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to not block the progress. Made a quick sanity pass.

  • A lot of volatile usage, I don't have enough background to say if it is necessary everywhere.
  • A lot of uninitialized member variables (I can't find where they are initialized)

@SeanNijjar , who's high quality review do you need the most here?

@SeanNijjar
Copy link
Contributor Author

SeanNijjar commented Nov 13, 2024

Approving to not block the progress. Made a quick sanity pass. A lot of volatile usage, I don't have enough background to say if its needed everywhere.

@SeanNijjar , who's high quality review do you need the most here?

This review of volatile use is on the perf todo list mentioned in the description (understandably, it's easy to miss in the large description):

General (careful) review of use of volatiles. Cache, coelesce, and/or remove volatile reads/writes where possible

Since volatile is viral, I opted to be safe with it's use (use it probably more than necessary, in places where I can otherwise cache the reads of it to a register and make it non-volatile). There are definitely places where I mark the packet header pointer volatile where I could otherwise be safe with caching the packet header (or parts of it) to registers for the duration of an (inlined) call chain.

@SeanNijjar SeanNijjar merged commit ce6ff4c into main Nov 14, 2024
182 of 259 checks passed
@SeanNijjar SeanNijjar deleted the snijjar/async-ccl branch November 14, 2024 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.