Merge branch 'develop' into amber/workspace-abstraction

ROCm · Dec 12, 2023 · 9880704 · 9880704
2 parents 0b9a725 + a64e600
commit 9880704
Show file tree

Hide file tree

Showing 17 changed files with 7,295 additions and 1,662 deletions.
diff --git a/.github/workflows/update_develop_nightly.yml b/.github/workflows/update_develop_nightly.yml
@@ -0,0 +1,20 @@
+name: Sync branch
+
+on:
+  pull_request:
+  workflow_dispatch:
+  push:
+    branches:
+      - develop
+jobs:
+  sync-branch:
+    name: Update nightly branch
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@main
+      - uses: connor-baer/action-sync-branch@main
+        with:
+          branch: develop_nightly
+          token: ${{ secrets.GITHUB_TOKEN }}
+          force: false
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 AMD's library for high performance machine learning primitives.
 Sources and binaries can be found at [MIOpen's GitHub site](https://github.com/ROCmSoftwarePlatform/MIOpen).
-The latest released documentation can be read online [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/index.html).
+The latest released documentation can be read online [here](https://rocm.docs.amd.com/projects/MIOpen/en/latest/index.html).
 
 MIOpen supports two programming models
 
@@ -28,7 +28,7 @@ python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
 ## Prerequisites
 
 * More information about ROCm stack via [ROCm Information Portal](https://docs.amd.com/).
-* A ROCm enabled platform, more info [here](https://rocm.github.io/install.html).
+* A ROCm enabled platform, more info [here](https://rocmdocs.amd.com/en/latest/).
 * Base software stack, which includes:
   * HIP -
     * HIP and HCC libraries and header files.
@@ -39,7 +39,7 @@ python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
 * [ROCm cmake](https://github.com/RadeonOpenCompute/rocm-cmake) - provide cmake modules for common build tasks needed for the ROCM software stack.
 * [Half](http://half.sourceforge.net/) - IEEE 754-based half-precision floating point library
 * [Boost](http://www.boost.org/)
-  * MIOpen uses `boost-system` and `boost-filesystem` packages to enable persistent [kernel cache](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/cache.html)
+  * MIOpen uses `boost-system` and `boost-filesystem` packages to enable persistent [kernel cache](https://rocm.docs.amd.com/projects/MIOpen/en/latest/cache.html)
   * Version 1.79 is recommended, older version may need patches to work on newer systems, e.g. boost1{69,70,72} w/glibc-2.34
 * [SQLite3](https://sqlite.org/index.html) - reading and writing performance database
 * lbzip2 - multi-threaded compress or decompress utility
@@ -174,7 +174,7 @@ cmake -DMIOPEN_BACKEND=OpenCL -DBUILD_DEV=On ..
 
 Database paths can be explicitly customized by means of `MIOPEN_SYSTEM_DB_PATH` (System PerfDb) and `MIOPEN_USER_DB_PATH` (User PerfDb) cmake variables.
 
-More information about the performance database can be found [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/perfdatabase.html).
+More information about the performance database can be found [here](https://rocm.docs.amd.com/projects/MIOpen/en/latest/perfdatabase.html).
 
 ### Persistent Program Cache
 
@@ -184,7 +184,7 @@ Users can also disable the cache during runtime using the environmental variable
 
 #### For MIOpen version 2.3 and earlier
 
-If the compiler changes, or the user modifies the kernels then the cache must be deleted for the MIOpen version in use; e.g., `rm -rf ~/.cache/miopen/<miopen-version-number>`. More information about the cache can be found [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/cache.html).
+If the compiler changes, or the user modifies the kernels then the cache must be deleted for the MIOpen version in use; e.g., `rm -rf ~/.cache/miopen/<miopen-version-number>`. More information about the cache can be found [here](https://rocm.docs.amd.com/projects/MIOpen/en/latest/cache.html).
 
 #### For MIOpen version 2.4 and later
 
@@ -218,7 +218,7 @@ The driver can be built using the `MIOpenDriver` target:
 
 ` cmake --build . --config Release --target MIOpenDriver ` **OR** ` make MIOpenDriver `
 
-Documentation on how to run the driver is [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/driver.html).
+Documentation on how to run the driver is [here](https://rocm.docs.amd.com/projects/MIOpen/en/latest/driver.html).
 
 ## Running the tests
 

diff --git a/driver/rnn_seq_driver.hpp b/driver/rnn_seq_driver.hpp
@@ -948,7 +948,7 @@ int RNNSeqDriver<Tgpu, Tref>::AllocateBuffersAndCopy()
     }
 
     // Unless seed is persistent between runs validation using cache stored in file is impossible.
-    srand(0);
+    prng::reset_seed();
 
     auto fill_array_via_gen = [](auto& dst, size_t dst_sz, double range_l, double range_r) {
         for(size_t it = 0; it < dst_sz; it++)

diff --git a/include/miopen/miopen.h b/include/miopen/miopen.h
@@ -649,7 +649,7 @@ MIOPEN_EXPORT miopenStatus_t miopenSet4dTensorDescriptor(
 
 /*! @brief Set shape of ND tensor with specific layout
  *
- * Interface for setting N-D tensor shape. This interface support NHWC, NCHW, NCHWc*, CHWNc*
+ * Interface for setting N-D packed tensor shape. This interface support NHWC, NCHW, NCHWc*, CHWNc*
  * @param tensorDesc   Tensor descriptor (input/output)
  * @param dataType     MIOpen datatype (input)
  * @param tensorLayout Tensor layout (input)
@@ -665,7 +665,10 @@ miopenSetNdTensorDescriptorWithLayout(miopenTensorDescriptor_t tensorDesc,
                                       int num_lens);
 /*! @brief Set shape and stride of 4D tensor
  *
- * Interface for setting 4-D tensor shape and stride.
+ * Interface for setting 4-D tensor shape and stride. It allows to create the non-packed tensor.
+ * A non-packed tensor refers to the tensor where the elements are not compressed or packed in any
+ * specific way. Each element in the tensor is stored individually, and there is no special
+ * compression applied to the storage.
  *
  * @param tensorDesc Tensor descriptor (input/output)
  * @param dataType   MIOpen datatype (input)
@@ -719,8 +722,7 @@ MIOPEN_EXPORT miopenStatus_t miopenGet4dTensorDescriptor(miopenTensorDescriptor_
 
 /*! @brief Set shape of N-dimensional tensor
  *
- * Interface for setting tensor shape. MIOpen has support for 1, 2, 3, 4, 5 dimensional tensor of
- * layout.
+ * Interface for setting non-packed tensor shape.
  * @param tensorDesc   Tensor descriptor (input/output)
  * @param dataType     MIOpen datatype (input)
  * @param nbDims       Number of dimensions in the dimsA array (input)
@@ -1731,6 +1733,14 @@ miopenFindConvolutionForwardAlgorithm(miopenHandle_t handle,
  * The scaling parameter alpha (float) and shift parameter beta (float) are only supported for
  * alpha = 1 and beta = 0.
  *
+ * The forward convolution is designed to accommodate both packed and non-packed tensor strides for
+ * multiple data types and dimensions across various platforms. This flexibility ensures optimal
+ * performance in handling diverse computational scenarios. To configure tensor parameters,
+ * including strides, users can utilize the APIs miopenSetTensorDescriptor() and
+ * miopenGetTensorDescriptor(). These APIs empower developers to seamlessly set and retrieve tensor
+ * information, facilitating a more intuitive and efficient workflow. The tensor strides are
+ * non-packed by default.
+ *
  * If using Group/Depthwise convolution mode, call miopenSetConvolutionGroupCount() before running
  * this.
  *
@@ -1875,6 +1885,14 @@ miopenFindConvolutionBackwardDataAlgorithm(miopenHandle_t handle,
  * determine the required memory needed for the workspace and the best convolutional
  * algorithm.
  *
+ * The backward data convolution is designed to accommodate both packed and non-packed tensor
+ * strides for multiple data types and dimensions across various platforms. This flexibility ensures
+ * optimal performance in handling diverse computational scenarios. To configure tensor parameters,
+ * including strides, users can utilize the APIs miopenSetTensorDescriptor() and
+ * miopenGetTensorDescriptor(). These APIs empower developers to seamlessly set and retrieve tensor
+ * information, facilitating a more intuitive and efficient workflow. The tensor strides are
+ * non-packed by default.
+ *
  * If using Group/Depthwise convolution mode, call miopenSetConvolutionGroupCount() before running
  * this.
  *
@@ -1999,6 +2017,14 @@ miopenFindConvolutionBackwardWeightsAlgorithm(miopenHandle_t handle,
  * been executed previously to determine the required memory needed for the workspace and the
  * best convolutional algorithm.
  *
+ * The backward weights convolution is designed to accommodate both packed and non-packed tensor
+ * strides for multiple data types and dimensions across various platforms. This flexibility ensures
+ * optimal performance in handling diverse computational scenarios. To configure tensor parameters,
+ * including strides, users can utilize the APIs miopenSetTensorDescriptor() and
+ * miopenGetTensorDescriptor(). These APIs empower developers to seamlessly set and retrieve tensor
+ * information, facilitating a more intuitive and efficient workflow. The tensor strides are
+ * non-packed by default.
+ *
  * If using Group/Depthwise convolution mode, call miopenSetConvolutionGroupCount() before running
  * this.
  *

diff --git a/src/ocl/rnnocl.cpp b/src/ocl/rnnocl.cpp
@@ -570,6 +570,26 @@ void RNNDescriptor::RNNForwardTraining_MS(Handle& handle,
             const std::vector<size_t> hcy_dst_stride{
                 static_cast<size_t>(hidden_size * max_batch), static_cast<size_t>(hidden_size), 1};
 
+            if(in_n.at(0) < max_batch)
+            {
+                float beta = 0.;
+                const std::vector<size_t> zero_set_size{1,
+                                                        static_cast<size_t>(max_batch - in_n.at(0)),
+                                                        static_cast<size_t>(hidden_size)};
+                auto set_batch_offset = in_n.at(0) * hidden_size;
+
+                auto set_desc =
+                    miopen::TensorDescriptor(wDesc.GetType(), zero_set_size, hcy_dst_stride);
+                if(hy != nullptr)
+                {
+                    SetTensor(handle, set_desc, hy, &beta, hcy_layer_offset + set_batch_offset);
+                }
+                if(cy != nullptr)
+                {
+                    SetTensor(handle, set_desc, cy, &beta, hcy_layer_offset + set_batch_offset);
+                }
+            }
+
             for(int time_i = seq_len - 1; time_i >= 0; time_i--)
             {
                 auto copy_batch = (time_i == seq_len - 1) ? in_n.at(time_i)
@@ -2879,86 +2899,89 @@ void RNNDescriptor::RNNForwardTrainingPackedTensors(
             }
             else
             {
-                sp_size[1] = batch_n - in_n.at(0);
-                sp_size[2] = wei_len;
-                sp_desc    = miopen::TensorDescriptor(wDesc.GetType(), sp_size, sp_stride);
-                w_size[1]  = 1;
-                w_size[2]  = wei_len;
-                w_desc     = miopen::TensorDescriptor(wDesc.GetType(), w_size, w_stride);
+                if(batch_n - in_n.at(0) > 0)
+                {
+                    sp_size[1] = batch_n - in_n.at(0);
+                    sp_size[2] = wei_len;
+                    sp_desc    = miopen::TensorDescriptor(wDesc.GetType(), sp_size, sp_stride);
+                    w_size[1]  = 1;
+                    w_size[2]  = wei_len;
+                    w_desc     = miopen::TensorDescriptor(wDesc.GetType(), w_size, w_stride);
 
-                OpTensor(handle,
-                         miopenTensorOpAdd,
-                         &alpha0,
-                         sp_desc,
-                         reserveSpace,
-                         &alpha1,
-                         w_desc,
-                         w,
-                         &beta_t,
-                         sp_desc,
-                         reserveSpace,
-                         hid_shift + in_n.at(0) * hy_stride,
-                         wei_shift_bias_temp,
-                         hid_shift + in_n.at(0) * hy_stride,
-                         true);
-                // Update time
-                profileRNNkernels(handle, 1, ctime);
+                    OpTensor(handle,
+                             miopenTensorOpAdd,
+                             &alpha0,
+                             sp_desc,
+                             reserveSpace,
+                             &alpha1,
+                             w_desc,
+                             w,
+                             &beta_t,
+                             sp_desc,
+                             reserveSpace,
+                             hid_shift + in_n.at(0) * hy_stride,
+                             wei_shift_bias_temp,
+                             hid_shift + in_n.at(0) * hy_stride,
+                             true);
+                    // Update time
+                    profileRNNkernels(handle, 1, ctime);
 
-                if(dirMode != 0u)
-                {
-                    if(in_n.at(0) == in_n.at(seqLen - 1))
-                    {
-                        OpTensor(handle,
-                                 miopenTensorOpAdd,
-                                 &alpha0,
-                                 sp_desc,
-                                 reserveSpace,
-                                 &alpha1,
-                                 w_desc,
-                                 w,
-                                 &beta_t,
-                                 sp_desc,
-                                 reserveSpace,
-                                 hid_shift + wei_len,
-                                 wei_shift_bias_temp + wei_len,
-                                 hid_shift + wei_len,
-                                 true);
-                        // Update time
-                        profileRNNkernels(handle, 1, ctime);
-                    }
-                    else
+                    if(dirMode != 0u)
                     {
-                        int cur_batch = 0;
-                        for(int ti = 0; ti < seqLen; ti++)
+                        if(in_n.at(0) == in_n.at(seqLen - 1))
                         {
-                            if(ti != (seqLen - 1))
+                            OpTensor(handle,
+                                     miopenTensorOpAdd,
+                                     &alpha0,
+                                     sp_desc,
+                                     reserveSpace,
+                                     &alpha1,
+                                     w_desc,
+                                     w,
+                                     &beta_t,
+                                     sp_desc,
+                                     reserveSpace,
+                                     hid_shift + wei_len,
+                                     wei_shift_bias_temp + wei_len,
+                                     hid_shift + wei_len,
+                                     true);
+                            // Update time
+                            profileRNNkernels(handle, 1, ctime);
+                        }
+                        else
+                        {
+                            int cur_batch = 0;
+                            for(int ti = 0; ti < seqLen; ti++)
                             {
-                                offset = hid_shift + cur_batch * hy_stride;
+                                if(ti != (seqLen - 1))
+                                {
+                                    offset = hid_shift + cur_batch * hy_stride;
 
-                                sp_size[1] = in_n.at(ti + 1);
-                                sp_size[2] = wei_len;
-                                sp_desc =
-                                    miopen::TensorDescriptor(wDesc.GetType(), sp_size, sp_stride);
+                                    sp_size[1] = in_n.at(ti + 1);
+                                    sp_size[2] = wei_len;
+                                    sp_desc    = miopen::TensorDescriptor(
+                                        wDesc.GetType(), sp_size, sp_stride);
 
-                                OpTensor(handle,
-                                         miopenTensorOpAdd,
-                                         &alpha0,
-                                         sp_desc,
-                                         reserveSpace,
-                                         &alpha1,
-                                         w_desc,
-                                         w,
-                                         &beta_t,
-                                         sp_desc,
-                                         reserveSpace,
-                                         static_cast<int>(offset) + wei_len,
-                                         wei_shift_bias_temp + wei_len,
-                                         static_cast<int>(offset) + wei_len,
-                                         true);
-                                // Update time
-                                profileRNNkernels(handle, 1, ctime);
+                                    OpTensor(handle,
+                                             miopenTensorOpAdd,
+                                             &alpha0,
+                                             sp_desc,
+                                             reserveSpace,
+                                             &alpha1,
+                                             w_desc,
+                                             w,
+                                             &beta_t,
+                                             sp_desc,
+                                             reserveSpace,
+                                             static_cast<int>(offset) + wei_len,
+                                             wei_shift_bias_temp + wei_len,
+                                             static_cast<int>(offset) + wei_len,
+                                             true);
+                                    // Update time
+                                    profileRNNkernels(handle, 1, ctime);
+                                }
+                                cur_batch += in_n.at(ti);
                             }
-                            cur_batch += in_n.at(ti);
                         }
                     }
                 }
@@ -5374,18 +5397,17 @@ void RNNDescriptor::RNNBackwardDataPackedTensors(
     // dinput
     if(inputMode == miopenRNNskip)
     {
-        sp_size[1] = batch_n;
-        sp_size[2] = hy_h;
-        x_size[1]  = batch_n;
-        x_size[2]  = hy_h;
-        x_desc     = miopen::TensorDescriptor(rnn_data_type, x_size, x_stride);
-        sp_desc    = miopen::TensorDescriptor(rnn_data_type, sp_size, sp_stride);
+        const std::vector<int> dx_size{1, batch_n, hy_h};
+        x_desc  = miopen::TensorDescriptor(rnn_data_type, dx_size, x_stride);
+        sp_desc = miopen::TensorDescriptor(rnn_data_type, dx_size, sp_stride);
 
         alpha0 = 1;
         alpha1 = 1;
         beta_t = 0;
 
-        for(int gi = 0; gi < nHiddenTensorsPerLayer * bi; gi++)
+        CopyTensor(handle, sp_desc, workSpace, x_desc, dx, 0, 0, true);
+        profileRNNkernels(handle, 1, ctime);
+        for(int gi = 1; gi < nHiddenTensorsPerLayer * bi; gi++)
         {
             OpTensor(handle,
                      miopenTensorOpAdd,

diff --git a/src/rnn/rnn_util.cpp b/src/rnn/rnn_util.cpp
@@ -317,6 +317,10 @@ void RNNTensorBaseLayoutConverter::ChangeTensorGPUDataPadding(
         const std::vector<size_t> packed_stride =
             get_packed_stride(copy_size, tensor_desc.GetLayoutVector());
 
+        // Nothing to copy, avoiding error with zero lens in TensorDescriptor
+        if(!std::all_of(copy_size.cbegin(), copy_size.cend(), [](size_t x) { return x > 0; }))
+            continue;
+
         const auto packed_desc =
             miopen::TensorDescriptor(tensor_desc.GetType(), copy_size, packed_stride);
         const auto padded_desc =