Skip to content

Commit

Permalink
[skip ci] #0: Prog Examples edits (#12972)
Browse files Browse the repository at this point in the history
#0: Prog Examples readme
  • Loading branch information
mbahnasTT authored Sep 21, 2024
1 parent c6d822f commit ec317dc
Show file tree
Hide file tree
Showing 15 changed files with 531 additions and 905 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Get started with [simple kernels](https://docs.tenstorrent.com/tt-metalium/lates
- [Dram Loopback Data Movement](./tech_reports/prog_examples/dram_loopback/dram_loopback.md)
### Eltwise
- [Eltwise Unary OP in Vector Engine (SFPU)](./tech_reports/prog_examples/eltwise_sfpu/eltwise_sfpu.md)
- [Eltwise Binary OP in MAtrix Engine (FPU)](./tech_reports/prog_examples/eltwise_binary/eltwise_binary.md)
- [Eltwise Binary OP in Matrix Engine (FPU)](./tech_reports/prog_examples/eltwise_binary/eltwise_binary.md)
### Matmul
- [Matmul OP on a Single_core](./tech_reports/prog_examples/matmul_single_core/matmul_single_core.md)
- [Matmul OP on Multi_core (Basic)](./tech_reports/prog_examples/matmul_multi_core/matmul_multi_core.md)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,24 +1,19 @@
---
title: Add 2 Integers in Compute Kernel
---
# Add 2 Integers in Compute Kernel

In this example, we will build a TT-Metal program that will add two
vectors containing integers together, using data movement and compute
kernels.
In this example, we will build a TT-Metal program that will add two vectors containing integers together, using data movement and compute kernels.

This program can be found in
`tt_metal/programming_examples/add_2_integers_in_compute/add_2_integers_in_compute.cpp`.
[tt_metal/programming_examples/add_2_integers_in_compute/add_2_integers_in_compute.cpp](../../../tt_metal/programming_examples/add_2_integers_in_compute/add_2_integers_in_compute.cpp).

To build and execute, you may use the following commands. Note that we
include the necessary environment variables here, but you may possibly
need more depending on the most up-to-date installation methods.
To build and execute, you may use the following commands. Note that we include the necessary environment variables here, but you may possibly need more depending on the most up-to-date installation methods.

```bash
export ARCH_NAME=<arch name>
export TT_METAL_HOME=<this repo dir>
./build_metal.sh
./build/programming_examples/add_2_integers_in_compute

# Set up accelerator and program/collaboration mechanisms
```
## Set up accelerator and program/collaboration mechanisms

``` cpp
Device *device = CreateDevice(0);
Expand All @@ -27,13 +22,9 @@ Program program = CreateProgram();
constexpr CoreCoord core = {0, 0};
```
We follow the standard procedure for the initial steps in setting up the
host program. The device that the program will execute on is identified,
and the corresponding command queue is accessed. The program is
initialized, and the core indicated for utilization in this example is
at the coordinates `{0, 0}` in accordance with the logical mesh layout.
We follow the standard procedure for the initial steps in setting up the host program. The device that the program will execute on is identified, and the corresponding command queue is accessed. The program is initialized, and the core indicated for utilization in this example is at the coordinates `{0, 0}` in accordance with the logical mesh layout.
# Configure and initialize DRAM buffer
## Configure and initialize DRAM buffer
``` cpp
constexpr uint32_t single_tile_size = 2 * 1024;
Expand All @@ -45,26 +36,15 @@ tt_metal::InterleavedBufferConfig dram_config{
};
```

We define the tile size to fit BFloat16 values before setting up the
configuration for the DRAM buffer. Each tile is 32x32 = 1024 bytes;
doubling this allows us to tile up BFloat16 values. We specify the
device to create the buffers on as well as the size of the buffers. Our
DRAM configuration will be interleaved for this example, which makes the
data layout row-based. Note that our choice of data format and buffer
configuration has significant impact on the performance of the
application, as we are able to reduce data traffic by packing values.
We define the tile size to fit BFloat16 values before setting up the configuration for the DRAM buffer. Each tile is 32x32 = 1024 bytes; doubling this allows us to tile up BFloat16 values. We specify the device to create the buffers on as well as the size of the buffers. Our DRAM configuration will be interleaved for this example, which makes the data layout row-based. Note that our choice of data format and buffer configuration has significant impact on the performance of the application, as we are able to reduce data traffic by packing values.

``` cpp
std::shared_ptr<tt::tt_metal::Buffer> src0_dram_buffer = CreateBuffer(dram_config);
std::shared_ptr<tt::tt_metal::Buffer> src1_dram_buffer = CreateBuffer(dram_config);
std::shared_ptr<tt::tt_metal::Buffer> dst_dram_buffer = CreateBuffer(dram_config);
```

Next, we allocate memory for each buffer with the specified
configuration for each of the input vectors and another buffer for the
output vector. The source data will be sent to the corresponding DRAM
buffers to be accessed by the cores, and the results of the computation
will be sent to the DRAM to be read by the destination vector.
Next, we allocate memory for each buffer with the specified configuration for each of the input vectors and another buffer for the output vector. The source data will be sent to the corresponding DRAM buffers to be accessed by the cores, and the results of the computation will be sent to the DRAM to be read by the destination vector.

``` cpp
auto src0_dram_noc_coord = src0_dram_buffer->noc_coordinates();
Expand All @@ -78,9 +58,7 @@ uint32_t dst_dram_noc_x = dst_dram_noc_coord.x;
uint32_t dst_dram_noc_y = dst_dram_noc_coord.y;
```

For this example, we will also specify the NoC coordinates to pass into
the kernel functions as runtime arguments. We will use this to ensure
that the kernels will access the data at the correct NoC addresses.
For this example, we will also specify the NoC coordinates to pass into the kernel functions as runtime arguments. We will use this to ensure that the kernels will access the data at the correct NoC addresses.

``` cpp
constexpr uint32_t src0_cb_index = CB::c_in0;
Expand All @@ -98,14 +76,9 @@ CircularBufferConfig cb_output_config = CircularBufferConfig(num_output_tiles *
CBHandle cb_output = tt_metal::CreateCircularBuffer(program, core, cb_output_config);
```
L1 circular buffers will be used communicate data to and from the
compute engine. We create circular buffers for the source vectors and
destination vector. The source data will be sent from the DRAM buffers
to the circular buffer of each specified core, then the results for a
given core will be stored at another circular buffer index before being
sent to DRAM.
L1 circular buffers will be used communicate data to and from the compute engine. We create circular buffers for the source vectors and destination vector. The source data will be sent from the DRAM buffers to the circular buffer of each specified core, then the results for a given core will be stored at another circular buffer index before being sent to DRAM.
# Kernel setup
## Kernel setup
``` cpp
KernelHandle binary_reader_kernel_id = CreateKernel(
Expand All @@ -122,11 +95,7 @@ KernelHandle unary_writer_kernel_id = CreateKernel(
```

Data movement kernels are used for reading to and writing from the DRAM.
A kernel is initialized for each of these operations, with a unique
RISC-V processor assigned to each kernel. These kernels will read the
data from the DRAM buffers into the circular buffers prior to the
addition operation, then write the output data to the DRAM from the
circular buffers so that they may be accessed by the host.
A kernel is initialized for each of these operations, with a unique RISC-V processor assigned to each kernel. These kernels will read the data from the DRAM buffers into the circular buffers prior to the addition operation, then write the output data to the DRAM from the circular buffers so that they may be accessed by the host.

``` cpp
vector<uint32_t> compute_kernel_args = {};
Expand All @@ -143,12 +112,9 @@ KernelHandle eltwise_binary_kernel_id = CreateKernel(
);
```
In addition to the data movement kernels, we need to create a compute
kernel for the addition operation. We use the kernel code for adding 2
tiles as specified in the above code block. The kernel function will use
the data provided in the circular buffers for the computation.
In addition to the data movement kernels, we need to create a compute kernel for the addition operation. We use the kernel code for adding 2 tiles as specified in the above code block. The kernel function will use the data provided in the circular buffers for the computation.
# Program execution
## Program execution
``` cpp
std::vector<uint32_t> src0_vec;
Expand All @@ -160,9 +126,7 @@ EnqueueWriteBuffer(cq, src0_dram_buffer, src0_vec, false);
EnqueueWriteBuffer(cq, src1_dram_buffer, src1_vec, false);
```

Next, we create two source vectors, each loaded with a constant value,
before queueing the command to feed it to the corresponding DRAM buffers
using `EnqueueWriteBuffer`.
Next, we create two source vectors, each loaded with a constant value, before queueing the command to feed it to the corresponding DRAM buffers using `EnqueueWriteBuffer`.

``` cpp
SetRuntimeArgs(program, binary_reader_kernel_id, core, { src0_dram_buffer->address(), src1_dram_buffer->address(), src0_dram_noc_x, src0_dram_noc_y, src1_dram_noc_x, src1_dram_noc_y});
Expand All @@ -173,12 +137,9 @@ EnqueueProgram(cq, program, false);
Finish(cq);
```
For each of the kernels, we will set up the corresponding runtime
arguments before executing the program on the device. The reader kernel
reads the source data into the circular buffers before having the
compute kernel run the tile addition operation.
For each of the kernels, we will set up the corresponding runtime arguments before executing the program on the device. The reader kernel reads the source data into the circular buffers before having the compute kernel run the tile addition operation.
# Reader kernel function
## Reader kernel function
``` cpp
uint32_t ublock_size_bytes_0 = get_tile_size(cb_id_in0);
Expand All @@ -198,12 +159,9 @@ noc_async_read_barrier();
cb_push_back(cb_id_in1, 1);
```

The reader kernel reads in a one tile from each of the two source
vectors that are stored in the DRAM, and stores these values in circular
buffers in the given core, with each source vector having its own
corresponding circular buffer.
The reader kernel reads in a one tile from each of the two source vectors that are stored in the DRAM, and stores these values in circular buffers in the given core, with each source vector having its own corresponding circular buffer.

# Compute kernel function
## Compute kernel function

``` cpp
binary_op_init_common(cb_in0, cb_in1, cb_out0);
Expand All @@ -229,14 +187,9 @@ cb_pop_front(cb_in1, 1);
cb_push_back(cb_out0, 1);
```
In the compute kernel, a single tile is read from each of the circular
buffers corresponding to the source data. These values are unpacked from
their original data formats into unsigned integers. Then, `add_tiles()`
computes the result of the addition between the two retrieved tiles. The
result is then packed back into the original data format and written
back to the corresponding circular buffer.
In the compute kernel, a single tile is read from each of the circular buffers corresponding to the source data. These values are unpacked from their original data formats into unsigned integers. Then, `add_tiles()` computes the result of the addition between the two retrieved tiles. The result is then packed back into the original data format and written back to the corresponding circular buffer.
# Writer kernel function
## Writer kernel function
``` cpp
uint64_t dst_noc_addr = get_noc_addr(dst_dram_noc_x, dst_dram_noc_y, dst_addr);
Expand All @@ -251,15 +204,12 @@ noc_async_write_barrier();
cb_pop_front(cb_id_out0, 1);
```

At this point, the results of the addition are computed and stored in
the circular buffers. We can now write these values to DRAM so that they
can be accessed by the host.
At this point, the results of the addition are computed and stored in the circular buffers. We can now write these values to DRAM so that they can be accessed by the host.

``` cpp
std::vector<uint32_t> result_vec;
EnqueueReadBuffer(cq, dst_dram_buffer, result_vec, true);
CloseDevice(device);
```
When the program is finished with execution, the output data is stored
in the DRAM and must be read from the device using `EnqueueReadBuffer`.
When the program is finished with execution, the output data is stored in the DRAM and must be read from the device using `EnqueueReadBuffer`.
Loading

0 comments on commit ec317dc

Please sign in to comment.