Releases: ROCm/aomp
rocm-6.3.1
ROCm release v6.3.1
AOMP Release 20.0-1
These are the release notes for AOMP 20.0-1. AOMP uses AMD developer modifications to the upstream LLVM development trunk. These differences are managed in a branch called the "amd-staging". This branch is found in a mirror of upstream LLVM found at https://github.com/ROCm/llvm-project. The amd-staging branch is constantly changing as it merges the upstream development trunk with its downstream development updates. The AMD modifications are experimental while under review for the upstream trunk. AOMP uses a snapshot of amd-staging at the commit ids and dates listed below. AOMP also includes builds of related ROCm components. We call AOMP a "standalone" build as it does not use or require ROCm with the exception of the kernel module (amdgpu-dkms) and libdrm which are often part of the Linux distribution. AOMP is isolated from any ROCm installations by installing into /usr/lib/aomp and the use of RPATH for runtime libraries.
For AOMP 20.0-1, the last LLVM trunk commit is 151901c762b724ef6ffe6f3db163475071e7b215 on December 11, 2024. The last amd-only commit is e82d86c7c81631754d1af5cb72ceef2385d215e3 on December 12, 2024. These commits form a frozen branch now called "aomp-20.0-1". See https://github.com/ROCm/llvm-project/tree/aomp-20.0-1.
The integrated ROCm components for this AOMP release were built with ROCM 6.3.0 sources.
This is the 2nd AOMP release based on upstream LLVM 20 development.
While Linux distros usually have the amdgpu kernel module, we strongly recommend using the ROCm 6.3 amdgpu-dkms and amdgpu-dkms-firmware packages which resolve a long-standing SDMA firmware issue .
In this release of AOMP, we disabled the OpenMP workaround of the SDMA firmware issue. The OpenMP workaround for the SDMA issue was to not chain automatic asynchronous data transfers to the kernel completion signal. The workaround synchronously initiated data transfers after kernel completion was detected by the host CPU. This resulted in some loss of performance.
The environment variable LIBOMPTARGET_SYNC_COPY_BACK is the trigger to use the workaround. Before AOMP 20.0-1 it had a default value of true to force synchronous copy backs. In this release we set the default to false which will improve performance for kernels with lots of return maps. But if your machine does not have the ROCm 6.3 firmware, you should set LIBOMPTARGET_SYNC_COPY_BACK=true to avoid potential errors.
Changes since AOMP 20.0-0:
- Changed default LIBOMPTARGET_SYNC_COPY_BACK=false
- Dropped support for CentOS 7/8/9, Ubuntu 20.04, SLES15-SP4
- Added support for RHEL 8/9, Ubuntu 24.04, SLES15-SP5
- Updated to ROCm 6.3 sources
- Added new component, SPIRV-LLVM-Translator. This is initial support for spirv JIT offloading. This includes a spirv to LLVM IR translation tool installed in the compiler bin directory lib/llvm/bin/amd-llvm-spirv. Toolchain support to support SPIRV is still in development.
- Added a new release file showing the summary of relevant git commits since the last release. See llvm-project-20-0-1-gitlog-summary.txt
- Upgraded cmake to 3.25.2
- Changed the commands for OpenMP offload linking to use the clang-linker-wrapper command. The old method was set of intermediate commands that passed files between various steps of the heterogeneous linking process. The default command line option before 20.0-1 was --opaque-offload-linker. The default is now --no-opaque-offload-linker. While both methods performed similar GPU linking, IR optimizations and backend, there were minor differences in the final offloading image that caused issues that have been resolved. One can still see the commands from the old method with the command line options "-v -save-temps --opaque-offload-linker",
- Corrected the installation lib-debug directories to contain debug builds of various runtime libraries. The sources of all debug runtimes are also installed so that gdbtui will automatically find the sources.
Merged roct and rocr into a single aomp build COMPONENT. - Renamed flang-legacy binary to **flang-classic"" as it is better known by the flang community. Yes, this will be deprecated in the future for the new llvm flang. Currently "flang" is a symbolic link to flang-classic binary.
Errata:
- Potential data corruption as a result of an SDMA issue when AOMP generated binaries are run without ROCm 6.3 amdgpu-dkms-firmware. Set LIBOMPTARGET_SINC_COPY_BACK=true to avoid problem with OpenMP.
- THIS RELEASE CANNOT BE BUILT FROM SOURCE EXTERNALLY. This is because there is a new AMD repository that is not yet available. In the next release this repository will be made public and put in the aomp manifest for cloning to support source build of aomp.
rocm-6.3.0
ROCm release v6.3.0
rocm-6.2.4
ROCm release v6.2.4
AOMP Release 20.0-0
These are the release notes for AOMP 20.0-0. AOMP uses AMD developer modifications to the upstream LLVM development trunk. These differences are managed in a branch called the "amd-staging". This branch is found in a mirror of upstream LLVM found at https://github.com/ROCm/llvm-project. The amd-staging branch is constantly changing as it merges the upstream development trunk with its downstream development updates. The AMD modifications are experimental while under review for the upstream trunk. AOMP uses a snapshot of amd-staging at the commit ids and dates listed below. AOMP also includes builds of related ROCm components. We call AOMP a "standalone" build as it does not use or require ROCm with the exception of the kernel module (dkms) and libdrm which are often part of the Linux distribution. AOMP is isolated from any ROCm installations by installing into /usr/lib/aomp and the use of RPATH for runtime libraries.
For AOMP 20.0-0, the last LLVM trunk commit is 7fa0d05a04056aac4365c69c4b515f613a43e454 on October 8, 2024. The last amd-only commit is 5809bc885c815fa281320094be6549458e15cf14 on October 10, 2024. These commits form a frozen branch now called "aomp-20.0-0". See https://github.com/ROCm/llvm-project/tree/aomp-20.0-0.
The integrated ROCm components for this AOMP release were built with ROCM 6.2.2 sources.
This is the 1st AOMP release based on upstream LLVM 20 development.
Changes since AOMP 19.0-3:
- Switched to ROCm 6.2.2 sources. This introduced a new component called rocprofiler-register.
- Move the install of llvm to lib/llvm, which is where ROCm installs llvm.
- AOMP now creates and uses rocm.cfg, clang.cfg clang++.cfg, etc.
- Add support for multiple devices (-md option) to gpurun utility.
- AOMP example updates:
- Use a common include file to set LLVM_INSTALL_DIR and LLVM_GPU_ARCH using amdgpu-arch and nvidia-arch.
- Remove mygpu dependency from every example.
- Create a new category stress for complex examples not in CI.
- Build Kokkos with a make file instead of script.
- Added build support for gfx90c, gfx1103, gfx1150, gfx1151, and gfx1152.
- Add ROCm SMI and AMD SMI as AOMP components.
Errata for AOMP 20.0-0:
- amdflang-new symbolic link should not exist as there is no flang-new binary.
rocm-6.2.2
ROCm release v6.2.2
rocm-6.2.1
ROCm release v6.2.1
rocm-6.2.0
ROCm release v6.2.0
AOMP Release 19.0-3
These are the release notes for AOMP 19.0-3 AOMP uses AMD developer modifications to the upstream LLVM development trunk. These differences are managed in a branch called the "amd-staging". This branch is found in a mirror of upstream LLVM found at https://github.com/ROCm/llvm-project. The amd-staging branch is constantly changing as it merges the upstream development trunk with its downstream development updates. The AMD modifications are experimental while under review for the upstream trunk. AOMP uses a snapshot of amd-staging at the commit ids and dates listed below. AOMP also includes builds of related ROCm components. We call AOMP a "standalone" build as it does not use or require ROCm with the exception of the kernel module (dkms) and libdrm which are often part of the Linux distribution. AOMP is isolated from any ROCm installations by installing into /usr/lib/aomp and the use of RPATH for runtime libraries.
For AOMP 19.0-3, the last LLVM trunk commit is 40954d7f9bb38b2407fe48a524befc5216f13cccon July 22, 2024. This was the last trunk commit before the trunk forked to LLVM-20. The last amd-only commit is baa883c3ad5d70e1f4da5b6f80f6d06c00b73c3a on July 22, 2024. These commits form a frozen branch now called "aomp-19.0-3". See https://github.com/ROCm/llvm-project/tree/aomp-19.0-3.
The integrated ROCm components for this AOMP release were built with ROCM 6.1.2 sources.
This is the 3rd AOMP release based on upstream LLVM 19 development. Since the LLVM trunk has moved to development of LLVM 20, the next AOMP release will be based on LLVM-20.
Changes since AOMP 19.0-2
- Support for requires atomic default mem order clause was added.
- OMPT no longer falls back into synchronous execution mode when profiler is attached.
- OMPT now supports callbacks for
omp_target_associate_ptr
andomp_target_disassociate_ptr
. - Xteam Reduction enabled by default at all opt levels.
- Some HIP interoperability issues with tracking HIP memory allocations on Mi200 were resolved.
- Remove deprecated utility offload-arch. This was replaced with amdgpu-arch or nvptx-arch.
AOMP Release 19.0-2
These are the release notes for AOMP 19.0-2 AOMP uses AMD developer modifications to the upstream LLVM development trunk. These differences are managed in a branch called the "amd-staging". This branch is found in a mirror of upstream LLVM found at https://github.com/ROCm/llvm-project. The amd-staging branch is constantly changing as it merges the upstream development trunk with its downstream development updates. The AMD modifications are experimental and/or/while contributions under review for the upstream trunk. AOMP uses a snapshot of amd-staging at the commit ids and dates listed below. AOMP also includes builds of related ROCm components. We call AOMP a "standalone" build as it does not use or require ROCm with the exception of the kernel module (dkms) and libdrm which are often part of the Linux distribution. AOMP is isolated from any ROCm installations by installing into /usr/lib/aomp and the use of RPATH for runtime libraries.
For AOMP 19.0-2, the last trunk commit is 6012de2b4ec24826574fe9f2d74c7d2ff2b52f23on June 20, 2024. The last amd-only commit is c3a455408b118b8c22f23c7a65d2b5dbf491ab56 on June 20, 2024. These commits forms a frozen branch now called "aomp-19.0-2". See https://github.com/ROCm/llvm-project/tree/aomp-19.0-2.
The integrated ROCm components for this AOMP release were built with ROCM 6.1.2 sources.
This is the 2nd AOMP release based on LLVM 19 development.
AOMP 19.0-1 was tagged, but will not be released.
Changes since AOMP 19.0-0:
- Significant runtime features to support zero-copy for CPU-GPU unified shared memory. See subsections below.
- Merge of the LLVM upstream relocation of libomptarget into the high level "offload" directory. This establishes the long term objective of the LLVM community to unify offload support for different offloading programming models.
- The integrated ROCm components (non-compiler) were built from ROCM 6.1.2 sources.
- Significant enhancements to the gpurun utility including: support for multiple devices, heterogeneous devices, malloc control inherited from numa-ctl -m -l options, and CPU core binding to same numa node as selected GPU. These changes preserve gpurun's ability to oversubscribe (run multiple processes per GPU) by segmenting a GPUs CUs to different processes. To be fixed in 19.0-3, gpurun fails in TPX mode on MI300X.
- Changes in runtime library locations unique to CPU target triple including fixes for lib64 in Red Hat package.
- Support for fp16 and bfloat16 reductions
- Removed long deprecated utilities mygpu, mymcpu, aompcc, aompExtractRegion, clang-ocl, and cloc.sh.
Errata for AOMP 19.0-2
- gpurun fails in TPX mode for MI300X
- LIBOMPTARGET_SYNC_COPY_BACK default is still true. This is to circumvent a long-standing SDMA problem where signal values appear incorrect to SDMA engines.
- Failure in dynamic_module_load which impacts application termination that uses offloading in shared objects.
Implicit Zero-Copy behavior on MI300A
OpenMP provides a relaxed shared memory model. Map clauses provided in the source code indicate how data is used and copied to and from the GPU device for each target region. On GPUs that provide unified shared memory like the MI300A, these clauses are optional but provide portability to discreet memory GPUs. There is an OpenMP pragma called "requires unified_shared_memory" which tell the compiler and runtime that the code is NOT portable to discreet memory GPUs, and must be compiled and executed on GPUs such as the MI300A. The MI300A is one of several AMD GPUs that has a feature to disable/enable page migration between CPU and GPU called xnack. In this release of the compiler and runtime, we set the runtime behavior depending on the status of xnack and existence of the pragma "requires unified_shared_memory".
MI300A | NO requires unified_shared_memory |
requires unified_shared_memory |
---|---|---|
XNACK enabled | Implicit Zero-Copy | Zero-Copy |
XNACK disabled | Copy | Runtime warning* |
(*) The runtime warning when running an application using #pragma omp requires unified_shared_memory
in XNACK disabled mode can be turned into a runtime error by setting environment variable OMPX_STRICT_SANITY_CHECKS to true (e.g., OMPX_STRICT_SANITY_CHECKS=true ./app_exec).
Implicit Zero-Copy on MI200 and MI300X and any other discrete GPU:
- On discrete memory GPUs, for applications not using
#pragma omp requires unified_shared_memory
, turn on implicit zero-copy behavior by running applications in XNACK enabled environment and setting to true the environment variableOMPX_APU_MAPS
(e.g.HSA_XNACK=1 OMPX_APU_MAPS=1 ./app_exec
) - All other configurations, for applications not using
#pragma omp requires unified_shared_memory
, will be run in copy behavior.
MI200, MI300X, etc. | not unified_shared_memory |
unified_shared_memory |
---|---|---|
XNACK enabled and OMPX_APU_MAPS=1 | Implicit Zero-Copy | Zero-Copy |
XNACK enabled | Copy | Zero-Copy |
XNACK disabled | Copy | Runtime warning(*) |
MI300A host memory pre-faulting in Zero-Copy modes
On MI300A, host memory TLB prefaulting applies when running in in Implicit Zero-Copy and when using #pragma omp requires unified_shared_memory
- By default, for all memory copies with size larger or equal to 1MB, the OpenMP runtime makes the copied host memory visible to the target device agent before calling the copy function
- The environment variable LIBOMPTARGET_APU_PREFAULT_MEMCOPY controls this behavior and it is set to true by default. Setting it to false will disable prefaulting for all memory copy sizes (e.g., disable prefaulting with
LIBOMPTARGET_APU_PREFAULT_MEMCOPY=false ./app_exec
) - The environment variable LIBOMPTARGET_APU_PREFAULT_MEMCOPY_SIZE controls the minimum size after which prefaulting is performed. It is currently set to 1MB, meaning that all memory copies that are performed in a synchronous way will have the host memory first prefaulted. Changing the minimum size enables prefaulting at sizes different than larger or equal to 1MB (e.g., to prefault all memory copies larger than 1KB, run with LIBOMPTARGET_APU_PREFAULT_MEMCOPY_SIZE=1024 ./app_exe)