Skip to content

Commit

Permalink
Update 9-non-portable-kernel-models.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
csccva authored Sep 17, 2024
1 parent 3249de4 commit fa5c276
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions content/9-non-portable-kernel-models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -801,7 +801,7 @@ The shared memory can be used to improve performance in two ways. It is possible


Matrix Transpose
^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~
Matrix transpose is a classic example where shared memory can significantly improve the performance. The use of shared memory reduces global memory accesses and exploits the high bandwidth and low latency of shared memory.

.. figure:: img/concepts/transpose_img.png
Expand Down Expand Up @@ -1222,7 +1222,7 @@ By padding the array the data is slightly shifting it resulting in no bank confl


Reductions
^^^^^^^^^^
~~~~~~~~~~

`Reductions` refer to operations in which the elements of an array are aggregated in a single value through operations such as summing, finding the maximum or minimum, or performing logical operations.

Expand Down Expand Up @@ -1333,8 +1333,8 @@ For a detail analysis of how to optimize reduction operations in CUDA/HIP check
- Parallel reduction on GPUs involves dividing the problem into subsets, performing reductions within blocks of threads using shared memory, and repeatedly reducing the number of elements (two per GPU thread) until only one remains.


CUDA/HIP Streams
^^^^^^^^^^^^^^^^
Overlapping Computations and Memory transfer. CUDA/HIP Streams
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Modern GPUs can overlap independent operations. They can do transfers between CPU and GPU and execute kernels in the same time, or they can execute kernels concurrently. CUDA/HIP streams are independent execution units, a sequence of operations that execute in issue-order on the GPU. The operations issue in different streams can be executed concurrently.

Consider the previous case of vector addition, which involves copying data from CPU to GPU, computations and then copying back the result to GPU. In this way nothing can be overlap.
Expand Down

0 comments on commit fa5c276

Please sign in to comment.