Skip to content

Commit

Permalink
Merge branch 'develop' into amber/workspace-abstraction
Browse files Browse the repository at this point in the history
  • Loading branch information
amberhassaan authored Dec 12, 2023
2 parents 0b9a725 + a64e600 commit 9880704
Show file tree
Hide file tree
Showing 17 changed files with 7,295 additions and 1,662 deletions.
20 changes: 20 additions & 0 deletions .github/workflows/update_develop_nightly.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Sync branch

on:
pull_request:
workflow_dispatch:
push:
branches:
- develop
jobs:
sync-branch:
name: Update nightly branch
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@main
- uses: connor-baer/action-sync-branch@main
with:
branch: develop_nightly
token: ${{ secrets.GITHUB_TOKEN }}
force: false
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

AMD's library for high performance machine learning primitives.
Sources and binaries can be found at [MIOpen's GitHub site](https://github.com/ROCmSoftwarePlatform/MIOpen).
The latest released documentation can be read online [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/index.html).
The latest released documentation can be read online [here](https://rocm.docs.amd.com/projects/MIOpen/en/latest/index.html).

MIOpen supports two programming models

Expand All @@ -28,7 +28,7 @@ python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
## Prerequisites

* More information about ROCm stack via [ROCm Information Portal](https://docs.amd.com/).
* A ROCm enabled platform, more info [here](https://rocm.github.io/install.html).
* A ROCm enabled platform, more info [here](https://rocmdocs.amd.com/en/latest/).
* Base software stack, which includes:
* HIP -
* HIP and HCC libraries and header files.
Expand All @@ -39,7 +39,7 @@ python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
* [ROCm cmake](https://github.com/RadeonOpenCompute/rocm-cmake) - provide cmake modules for common build tasks needed for the ROCM software stack.
* [Half](http://half.sourceforge.net/) - IEEE 754-based half-precision floating point library
* [Boost](http://www.boost.org/)
* MIOpen uses `boost-system` and `boost-filesystem` packages to enable persistent [kernel cache](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/cache.html)
* MIOpen uses `boost-system` and `boost-filesystem` packages to enable persistent [kernel cache](https://rocm.docs.amd.com/projects/MIOpen/en/latest/cache.html)
* Version 1.79 is recommended, older version may need patches to work on newer systems, e.g. boost1{69,70,72} w/glibc-2.34
* [SQLite3](https://sqlite.org/index.html) - reading and writing performance database
* lbzip2 - multi-threaded compress or decompress utility
Expand Down Expand Up @@ -174,7 +174,7 @@ cmake -DMIOPEN_BACKEND=OpenCL -DBUILD_DEV=On ..

Database paths can be explicitly customized by means of `MIOPEN_SYSTEM_DB_PATH` (System PerfDb) and `MIOPEN_USER_DB_PATH` (User PerfDb) cmake variables.

More information about the performance database can be found [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/perfdatabase.html).
More information about the performance database can be found [here](https://rocm.docs.amd.com/projects/MIOpen/en/latest/perfdatabase.html).

### Persistent Program Cache

Expand All @@ -184,7 +184,7 @@ Users can also disable the cache during runtime using the environmental variable

#### For MIOpen version 2.3 and earlier

If the compiler changes, or the user modifies the kernels then the cache must be deleted for the MIOpen version in use; e.g., `rm -rf ~/.cache/miopen/<miopen-version-number>`. More information about the cache can be found [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/cache.html).
If the compiler changes, or the user modifies the kernels then the cache must be deleted for the MIOpen version in use; e.g., `rm -rf ~/.cache/miopen/<miopen-version-number>`. More information about the cache can be found [here](https://rocm.docs.amd.com/projects/MIOpen/en/latest/cache.html).

#### For MIOpen version 2.4 and later

Expand Down Expand Up @@ -218,7 +218,7 @@ The driver can be built using the `MIOpenDriver` target:

` cmake --build . --config Release --target MIOpenDriver ` **OR** ` make MIOpenDriver `

Documentation on how to run the driver is [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/driver.html).
Documentation on how to run the driver is [here](https://rocm.docs.amd.com/projects/MIOpen/en/latest/driver.html).

## Running the tests

Expand Down
2 changes: 1 addition & 1 deletion driver/rnn_seq_driver.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -948,7 +948,7 @@ int RNNSeqDriver<Tgpu, Tref>::AllocateBuffersAndCopy()
}

// Unless seed is persistent between runs validation using cache stored in file is impossible.
srand(0);
prng::reset_seed();

auto fill_array_via_gen = [](auto& dst, size_t dst_sz, double range_l, double range_r) {
for(size_t it = 0; it < dst_sz; it++)
Expand Down
34 changes: 30 additions & 4 deletions include/miopen/miopen.h
Original file line number Diff line number Diff line change
Expand Up @@ -649,7 +649,7 @@ MIOPEN_EXPORT miopenStatus_t miopenSet4dTensorDescriptor(

/*! @brief Set shape of ND tensor with specific layout
*
* Interface for setting N-D tensor shape. This interface support NHWC, NCHW, NCHWc*, CHWNc*
* Interface for setting N-D packed tensor shape. This interface support NHWC, NCHW, NCHWc*, CHWNc*
* @param tensorDesc Tensor descriptor (input/output)
* @param dataType MIOpen datatype (input)
* @param tensorLayout Tensor layout (input)
Expand All @@ -665,7 +665,10 @@ miopenSetNdTensorDescriptorWithLayout(miopenTensorDescriptor_t tensorDesc,
int num_lens);
/*! @brief Set shape and stride of 4D tensor
*
* Interface for setting 4-D tensor shape and stride.
* Interface for setting 4-D tensor shape and stride. It allows to create the non-packed tensor.
* A non-packed tensor refers to the tensor where the elements are not compressed or packed in any
* specific way. Each element in the tensor is stored individually, and there is no special
* compression applied to the storage.
*
* @param tensorDesc Tensor descriptor (input/output)
* @param dataType MIOpen datatype (input)
Expand Down Expand Up @@ -719,8 +722,7 @@ MIOPEN_EXPORT miopenStatus_t miopenGet4dTensorDescriptor(miopenTensorDescriptor_

/*! @brief Set shape of N-dimensional tensor
*
* Interface for setting tensor shape. MIOpen has support for 1, 2, 3, 4, 5 dimensional tensor of
* layout.
* Interface for setting non-packed tensor shape.
* @param tensorDesc Tensor descriptor (input/output)
* @param dataType MIOpen datatype (input)
* @param nbDims Number of dimensions in the dimsA array (input)
Expand Down Expand Up @@ -1731,6 +1733,14 @@ miopenFindConvolutionForwardAlgorithm(miopenHandle_t handle,
* The scaling parameter alpha (float) and shift parameter beta (float) are only supported for
* alpha = 1 and beta = 0.
*
* The forward convolution is designed to accommodate both packed and non-packed tensor strides for
* multiple data types and dimensions across various platforms. This flexibility ensures optimal
* performance in handling diverse computational scenarios. To configure tensor parameters,
* including strides, users can utilize the APIs miopenSetTensorDescriptor() and
* miopenGetTensorDescriptor(). These APIs empower developers to seamlessly set and retrieve tensor
* information, facilitating a more intuitive and efficient workflow. The tensor strides are
* non-packed by default.
*
* If using Group/Depthwise convolution mode, call miopenSetConvolutionGroupCount() before running
* this.
*
Expand Down Expand Up @@ -1875,6 +1885,14 @@ miopenFindConvolutionBackwardDataAlgorithm(miopenHandle_t handle,
* determine the required memory needed for the workspace and the best convolutional
* algorithm.
*
* The backward data convolution is designed to accommodate both packed and non-packed tensor
* strides for multiple data types and dimensions across various platforms. This flexibility ensures
* optimal performance in handling diverse computational scenarios. To configure tensor parameters,
* including strides, users can utilize the APIs miopenSetTensorDescriptor() and
* miopenGetTensorDescriptor(). These APIs empower developers to seamlessly set and retrieve tensor
* information, facilitating a more intuitive and efficient workflow. The tensor strides are
* non-packed by default.
*
* If using Group/Depthwise convolution mode, call miopenSetConvolutionGroupCount() before running
* this.
*
Expand Down Expand Up @@ -1999,6 +2017,14 @@ miopenFindConvolutionBackwardWeightsAlgorithm(miopenHandle_t handle,
* been executed previously to determine the required memory needed for the workspace and the
* best convolutional algorithm.
*
* The backward weights convolution is designed to accommodate both packed and non-packed tensor
* strides for multiple data types and dimensions across various platforms. This flexibility ensures
* optimal performance in handling diverse computational scenarios. To configure tensor parameters,
* including strides, users can utilize the APIs miopenSetTensorDescriptor() and
* miopenGetTensorDescriptor(). These APIs empower developers to seamlessly set and retrieve tensor
* information, facilitating a more intuitive and efficient workflow. The tensor strides are
* non-packed by default.
*
* If using Group/Depthwise convolution mode, call miopenSetConvolutionGroupCount() before running
* this.
*
Expand Down
180 changes: 101 additions & 79 deletions src/ocl/rnnocl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -570,6 +570,26 @@ void RNNDescriptor::RNNForwardTraining_MS(Handle& handle,
const std::vector<size_t> hcy_dst_stride{
static_cast<size_t>(hidden_size * max_batch), static_cast<size_t>(hidden_size), 1};

if(in_n.at(0) < max_batch)
{
float beta = 0.;
const std::vector<size_t> zero_set_size{1,
static_cast<size_t>(max_batch - in_n.at(0)),
static_cast<size_t>(hidden_size)};
auto set_batch_offset = in_n.at(0) * hidden_size;

auto set_desc =
miopen::TensorDescriptor(wDesc.GetType(), zero_set_size, hcy_dst_stride);
if(hy != nullptr)
{
SetTensor(handle, set_desc, hy, &beta, hcy_layer_offset + set_batch_offset);
}
if(cy != nullptr)
{
SetTensor(handle, set_desc, cy, &beta, hcy_layer_offset + set_batch_offset);
}
}

for(int time_i = seq_len - 1; time_i >= 0; time_i--)
{
auto copy_batch = (time_i == seq_len - 1) ? in_n.at(time_i)
Expand Down Expand Up @@ -2879,86 +2899,89 @@ void RNNDescriptor::RNNForwardTrainingPackedTensors(
}
else
{
sp_size[1] = batch_n - in_n.at(0);
sp_size[2] = wei_len;
sp_desc = miopen::TensorDescriptor(wDesc.GetType(), sp_size, sp_stride);
w_size[1] = 1;
w_size[2] = wei_len;
w_desc = miopen::TensorDescriptor(wDesc.GetType(), w_size, w_stride);
if(batch_n - in_n.at(0) > 0)
{
sp_size[1] = batch_n - in_n.at(0);
sp_size[2] = wei_len;
sp_desc = miopen::TensorDescriptor(wDesc.GetType(), sp_size, sp_stride);
w_size[1] = 1;
w_size[2] = wei_len;
w_desc = miopen::TensorDescriptor(wDesc.GetType(), w_size, w_stride);

OpTensor(handle,
miopenTensorOpAdd,
&alpha0,
sp_desc,
reserveSpace,
&alpha1,
w_desc,
w,
&beta_t,
sp_desc,
reserveSpace,
hid_shift + in_n.at(0) * hy_stride,
wei_shift_bias_temp,
hid_shift + in_n.at(0) * hy_stride,
true);
// Update time
profileRNNkernels(handle, 1, ctime);
OpTensor(handle,
miopenTensorOpAdd,
&alpha0,
sp_desc,
reserveSpace,
&alpha1,
w_desc,
w,
&beta_t,
sp_desc,
reserveSpace,
hid_shift + in_n.at(0) * hy_stride,
wei_shift_bias_temp,
hid_shift + in_n.at(0) * hy_stride,
true);
// Update time
profileRNNkernels(handle, 1, ctime);

if(dirMode != 0u)
{
if(in_n.at(0) == in_n.at(seqLen - 1))
{
OpTensor(handle,
miopenTensorOpAdd,
&alpha0,
sp_desc,
reserveSpace,
&alpha1,
w_desc,
w,
&beta_t,
sp_desc,
reserveSpace,
hid_shift + wei_len,
wei_shift_bias_temp + wei_len,
hid_shift + wei_len,
true);
// Update time
profileRNNkernels(handle, 1, ctime);
}
else
if(dirMode != 0u)
{
int cur_batch = 0;
for(int ti = 0; ti < seqLen; ti++)
if(in_n.at(0) == in_n.at(seqLen - 1))
{
if(ti != (seqLen - 1))
OpTensor(handle,
miopenTensorOpAdd,
&alpha0,
sp_desc,
reserveSpace,
&alpha1,
w_desc,
w,
&beta_t,
sp_desc,
reserveSpace,
hid_shift + wei_len,
wei_shift_bias_temp + wei_len,
hid_shift + wei_len,
true);
// Update time
profileRNNkernels(handle, 1, ctime);
}
else
{
int cur_batch = 0;
for(int ti = 0; ti < seqLen; ti++)
{
offset = hid_shift + cur_batch * hy_stride;
if(ti != (seqLen - 1))
{
offset = hid_shift + cur_batch * hy_stride;

sp_size[1] = in_n.at(ti + 1);
sp_size[2] = wei_len;
sp_desc =
miopen::TensorDescriptor(wDesc.GetType(), sp_size, sp_stride);
sp_size[1] = in_n.at(ti + 1);
sp_size[2] = wei_len;
sp_desc = miopen::TensorDescriptor(
wDesc.GetType(), sp_size, sp_stride);

OpTensor(handle,
miopenTensorOpAdd,
&alpha0,
sp_desc,
reserveSpace,
&alpha1,
w_desc,
w,
&beta_t,
sp_desc,
reserveSpace,
static_cast<int>(offset) + wei_len,
wei_shift_bias_temp + wei_len,
static_cast<int>(offset) + wei_len,
true);
// Update time
profileRNNkernels(handle, 1, ctime);
OpTensor(handle,
miopenTensorOpAdd,
&alpha0,
sp_desc,
reserveSpace,
&alpha1,
w_desc,
w,
&beta_t,
sp_desc,
reserveSpace,
static_cast<int>(offset) + wei_len,
wei_shift_bias_temp + wei_len,
static_cast<int>(offset) + wei_len,
true);
// Update time
profileRNNkernels(handle, 1, ctime);
}
cur_batch += in_n.at(ti);
}
cur_batch += in_n.at(ti);
}
}
}
Expand Down Expand Up @@ -5374,18 +5397,17 @@ void RNNDescriptor::RNNBackwardDataPackedTensors(
// dinput
if(inputMode == miopenRNNskip)
{
sp_size[1] = batch_n;
sp_size[2] = hy_h;
x_size[1] = batch_n;
x_size[2] = hy_h;
x_desc = miopen::TensorDescriptor(rnn_data_type, x_size, x_stride);
sp_desc = miopen::TensorDescriptor(rnn_data_type, sp_size, sp_stride);
const std::vector<int> dx_size{1, batch_n, hy_h};
x_desc = miopen::TensorDescriptor(rnn_data_type, dx_size, x_stride);
sp_desc = miopen::TensorDescriptor(rnn_data_type, dx_size, sp_stride);

alpha0 = 1;
alpha1 = 1;
beta_t = 0;

for(int gi = 0; gi < nHiddenTensorsPerLayer * bi; gi++)
CopyTensor(handle, sp_desc, workSpace, x_desc, dx, 0, 0, true);
profileRNNkernels(handle, 1, ctime);
for(int gi = 1; gi < nHiddenTensorsPerLayer * bi; gi++)
{
OpTensor(handle,
miopenTensorOpAdd,
Expand Down
4 changes: 4 additions & 0 deletions src/rnn/rnn_util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,10 @@ void RNNTensorBaseLayoutConverter::ChangeTensorGPUDataPadding(
const std::vector<size_t> packed_stride =
get_packed_stride(copy_size, tensor_desc.GetLayoutVector());

// Nothing to copy, avoiding error with zero lens in TensorDescriptor
if(!std::all_of(copy_size.cbegin(), copy_size.cend(), [](size_t x) { return x > 0; }))
continue;

const auto packed_desc =
miopen::TensorDescriptor(tensor_desc.GetType(), copy_size, packed_stride);
const auto padded_desc =
Expand Down
Loading

0 comments on commit 9880704

Please sign in to comment.