[skip ci] #0: Prog Examples edits (#12972)

#0: Prog Examples readme
tenstorrent · Sep 21, 2024 · ec317dc · ec317dc
1 parent c6d822f
commit ec317dc
Show file tree

Hide file tree

Showing 15 changed files with 531 additions and 905 deletions.
diff --git a/README.md b/README.md
@@ -110,7 +110,7 @@ Get started with [simple kernels](https://docs.tenstorrent.com/tt-metalium/lates
 - [Dram Loopback Data Movement](./tech_reports/prog_examples/dram_loopback/dram_loopback.md)
 ### Eltwise
 - [Eltwise Unary OP in Vector Engine (SFPU)](./tech_reports/prog_examples/eltwise_sfpu/eltwise_sfpu.md)
-- [Eltwise Binary OP in MAtrix Engine (FPU)](./tech_reports/prog_examples/eltwise_binary/eltwise_binary.md)
+- [Eltwise Binary OP in Matrix Engine (FPU)](./tech_reports/prog_examples/eltwise_binary/eltwise_binary.md)
 ### Matmul
 - [Matmul OP on a Single_core](./tech_reports/prog_examples/matmul_single_core/matmul_single_core.md)
 - [Matmul OP on Multi_core (Basic)](./tech_reports/prog_examples/matmul_multi_core/matmul_multi_core.md)

diff --git a/tech_reports/prog_examples/add_2_integers_in_compute/add_2_integers_in_compute.md b/tech_reports/prog_examples/add_2_integers_in_compute/add_2_integers_in_compute.md
@@ -1,24 +1,19 @@
----
-title: Add 2 Integers in Compute Kernel
----
+# Add 2 Integers in Compute Kernel
 
-In this example, we will build a TT-Metal program that will add two
-vectors containing integers together, using data movement and compute
-kernels.
+In this example, we will build a TT-Metal program that will add two vectors containing integers together, using data movement and compute kernels.
 
 This program can be found in
-`tt_metal/programming_examples/add_2_integers_in_compute/add_2_integers_in_compute.cpp`.
+[tt_metal/programming_examples/add_2_integers_in_compute/add_2_integers_in_compute.cpp](../../../tt_metal/programming_examples/add_2_integers_in_compute/add_2_integers_in_compute.cpp).
 
-To build and execute, you may use the following commands. Note that we
-include the necessary environment variables here, but you may possibly
-need more depending on the most up-to-date installation methods.
+To build and execute, you may use the following commands. Note that we include the necessary environment variables here, but you may possibly need more depending on the most up-to-date installation methods.
 
+```bash
     export ARCH_NAME=<arch name>
     export TT_METAL_HOME=<this repo dir>
     ./build_metal.sh
     ./build/programming_examples/add_2_integers_in_compute
-
-# Set up accelerator and program/collaboration mechanisms
+```
+## Set up accelerator and program/collaboration mechanisms
 
 ``` cpp
 Device *device = CreateDevice(0);
@@ -27,13 +22,9 @@ Program program = CreateProgram();
 constexpr CoreCoord core = {0, 0};
 ```
 
-We follow the standard procedure for the initial steps in setting up the
-host program. The device that the program will execute on is identified,
-and the corresponding command queue is accessed. The program is
-initialized, and the core indicated for utilization in this example is
-at the coordinates `{0, 0}` in accordance with the logical mesh layout.
+We follow the standard procedure for the initial steps in setting up the host program. The device that the program will execute on is identified, and the corresponding command queue is accessed. The program is initialized, and the core indicated for utilization in this example is at the coordinates `{0, 0}` in accordance with the logical mesh layout.
 
-# Configure and initialize DRAM buffer
+## Configure and initialize DRAM buffer
 
 ``` cpp
 constexpr uint32_t single_tile_size = 2 * 1024;
@@ -45,26 +36,15 @@ tt_metal::InterleavedBufferConfig dram_config{
 };
 ```
 
-We define the tile size to fit BFloat16 values before setting up the
-configuration for the DRAM buffer. Each tile is 32x32 = 1024 bytes;
-doubling this allows us to tile up BFloat16 values. We specify the
-device to create the buffers on as well as the size of the buffers. Our
-DRAM configuration will be interleaved for this example, which makes the
-data layout row-based. Note that our choice of data format and buffer
-configuration has significant impact on the performance of the
-application, as we are able to reduce data traffic by packing values.
+We define the tile size to fit BFloat16 values before setting up the configuration for the DRAM buffer. Each tile is 32x32 = 1024 bytes; doubling this allows us to tile up BFloat16 values. We specify the device to create the buffers on as well as the size of the buffers. Our DRAM configuration will be interleaved for this example, which makes the data layout row-based. Note that our choice of data format and buffer configuration has significant impact on the performance of the application, as we are able to reduce data traffic by packing values.
 
 ``` cpp
 std::shared_ptr<tt::tt_metal::Buffer> src0_dram_buffer = CreateBuffer(dram_config);
 std::shared_ptr<tt::tt_metal::Buffer> src1_dram_buffer = CreateBuffer(dram_config);
 std::shared_ptr<tt::tt_metal::Buffer> dst_dram_buffer = CreateBuffer(dram_config);
 ```
 
-Next, we allocate memory for each buffer with the specified
-configuration for each of the input vectors and another buffer for the
-output vector. The source data will be sent to the corresponding DRAM
-buffers to be accessed by the cores, and the results of the computation
-will be sent to the DRAM to be read by the destination vector.
+Next, we allocate memory for each buffer with the specified configuration for each of the input vectors and another buffer for the output vector. The source data will be sent to the corresponding DRAM buffers to be accessed by the cores, and the results of the computation will be sent to the DRAM to be read by the destination vector.
 
 ``` cpp
 auto src0_dram_noc_coord = src0_dram_buffer->noc_coordinates();
@@ -78,9 +58,7 @@ uint32_t dst_dram_noc_x = dst_dram_noc_coord.x;
 uint32_t dst_dram_noc_y = dst_dram_noc_coord.y;
 ```
 
-For this example, we will also specify the NoC coordinates to pass into
-the kernel functions as runtime arguments. We will use this to ensure
-that the kernels will access the data at the correct NoC addresses.
+For this example, we will also specify the NoC coordinates to pass into the kernel functions as runtime arguments. We will use this to ensure that the kernels will access the data at the correct NoC addresses.
 
 ``` cpp
 constexpr uint32_t src0_cb_index = CB::c_in0;
@@ -98,14 +76,9 @@ CircularBufferConfig cb_output_config = CircularBufferConfig(num_output_tiles *
 CBHandle cb_output = tt_metal::CreateCircularBuffer(program, core, cb_output_config);
 ```
 
-L1 circular buffers will be used communicate data to and from the
-compute engine. We create circular buffers for the source vectors and
-destination vector. The source data will be sent from the DRAM buffers
-to the circular buffer of each specified core, then the results for a
-given core will be stored at another circular buffer index before being
-sent to DRAM.
+L1 circular buffers will be used communicate data to and from the compute engine. We create circular buffers for the source vectors and destination vector. The source data will be sent from the DRAM buffers to the circular buffer of each specified core, then the results for a given core will be stored at another circular buffer index before being sent to DRAM.
 
-# Kernel setup
+## Kernel setup
 
 ``` cpp
 KernelHandle binary_reader_kernel_id = CreateKernel(
@@ -122,11 +95,7 @@ KernelHandle unary_writer_kernel_id = CreateKernel(
 ```
 
 Data movement kernels are used for reading to and writing from the DRAM.
-A kernel is initialized for each of these operations, with a unique
-RISC-V processor assigned to each kernel. These kernels will read the
-data from the DRAM buffers into the circular buffers prior to the
-addition operation, then write the output data to the DRAM from the
-circular buffers so that they may be accessed by the host.
+A kernel is initialized for each of these operations, with a unique RISC-V processor assigned to each kernel. These kernels will read the data from the DRAM buffers into the circular buffers prior to the addition operation, then write the output data to the DRAM from the circular buffers so that they may be accessed by the host.
 
 ``` cpp
 vector<uint32_t> compute_kernel_args = {};
@@ -143,12 +112,9 @@ KernelHandle eltwise_binary_kernel_id = CreateKernel(
 );
 ```
 
-In addition to the data movement kernels, we need to create a compute
-kernel for the addition operation. We use the kernel code for adding 2
-tiles as specified in the above code block. The kernel function will use
-the data provided in the circular buffers for the computation.
+In addition to the data movement kernels, we need to create a compute kernel for the addition operation. We use the kernel code for adding 2 tiles as specified in the above code block. The kernel function will use the data provided in the circular buffers for the computation.
 
-# Program execution
+## Program execution
 
 ``` cpp
 std::vector<uint32_t> src0_vec;
@@ -160,9 +126,7 @@ EnqueueWriteBuffer(cq, src0_dram_buffer, src0_vec, false);
 EnqueueWriteBuffer(cq, src1_dram_buffer, src1_vec, false);
 ```
 
-Next, we create two source vectors, each loaded with a constant value,
-before queueing the command to feed it to the corresponding DRAM buffers
-using `EnqueueWriteBuffer`.
+Next, we create two source vectors, each loaded with a constant value, before queueing the command to feed it to the corresponding DRAM buffers using `EnqueueWriteBuffer`.
 
 ``` cpp
 SetRuntimeArgs(program, binary_reader_kernel_id, core, { src0_dram_buffer->address(), src1_dram_buffer->address(), src0_dram_noc_x, src0_dram_noc_y, src1_dram_noc_x, src1_dram_noc_y});
@@ -173,12 +137,9 @@ EnqueueProgram(cq, program, false);
 Finish(cq);
 ```
 
-For each of the kernels, we will set up the corresponding runtime
-arguments before executing the program on the device. The reader kernel
-reads the source data into the circular buffers before having the
-compute kernel run the tile addition operation.
+For each of the kernels, we will set up the corresponding runtime arguments before executing the program on the device. The reader kernel reads the source data into the circular buffers before having the compute kernel run the tile addition operation.
 
-# Reader kernel function
+## Reader kernel function
 
 ``` cpp
 uint32_t ublock_size_bytes_0 = get_tile_size(cb_id_in0);
@@ -198,12 +159,9 @@ noc_async_read_barrier();
 cb_push_back(cb_id_in1, 1);
 ```
 
-The reader kernel reads in a one tile from each of the two source
-vectors that are stored in the DRAM, and stores these values in circular
-buffers in the given core, with each source vector having its own
-corresponding circular buffer.
+The reader kernel reads in a one tile from each of the two source vectors that are stored in the DRAM, and stores these values in circular buffers in the given core, with each source vector having its own corresponding circular buffer.
 
-# Compute kernel function
+## Compute kernel function
 
 ``` cpp
 binary_op_init_common(cb_in0, cb_in1, cb_out0);
@@ -229,14 +187,9 @@ cb_pop_front(cb_in1, 1);
 cb_push_back(cb_out0, 1);
 ```
 
-In the compute kernel, a single tile is read from each of the circular
-buffers corresponding to the source data. These values are unpacked from
-their original data formats into unsigned integers. Then, `add_tiles()`
-computes the result of the addition between the two retrieved tiles. The
-result is then packed back into the original data format and written
-back to the corresponding circular buffer.
+In the compute kernel, a single tile is read from each of the circular buffers corresponding to the source data. These values are unpacked from their original data formats into unsigned integers. Then, `add_tiles()` computes the result of the addition between the two retrieved tiles. The result is then packed back into the original data format and written back to the corresponding circular buffer.
 
-# Writer kernel function
+## Writer kernel function
 
 ``` cpp
 uint64_t dst_noc_addr = get_noc_addr(dst_dram_noc_x, dst_dram_noc_y, dst_addr);
@@ -251,15 +204,12 @@ noc_async_write_barrier();
 cb_pop_front(cb_id_out0, 1);
 ```
 
-At this point, the results of the addition are computed and stored in
-the circular buffers. We can now write these values to DRAM so that they
-can be accessed by the host.
+At this point, the results of the addition are computed and stored in the circular buffers. We can now write these values to DRAM so that they can be accessed by the host.
 
 ``` cpp
 std::vector<uint32_t> result_vec;
 EnqueueReadBuffer(cq, dst_dram_buffer, result_vec, true);
 CloseDevice(device);
 ```
 
-When the program is finished with execution, the output data is stored
-in the DRAM and must be read from the device using `EnqueueReadBuffer`.
+When the program is finished with execution, the output data is stored in the DRAM and must be read from the device using `EnqueueReadBuffer`.