microsoft · MichaelTylerArm · Sep 25, 2024
diff --git a/docs/build/eps.md b/docs/build/eps.md
@@ -396,75 +396,24 @@ The DirectML execution provider supports building for both x64 and x86 architect
 
 ---
 
-## ARM Compute Library
+## Arm Compute Library
 See more information on the ACL Execution Provider [here](../execution-providers/community-maintained/ACL-ExecutionProvider.md).
 
-### Prerequisites
-{: .no_toc }
-
-* Supported backend: i.MX8QM Armv8 CPUs
-* Supported BSP: i.MX8QM BSP
-  * Install i.MX8QM BSP: `source fsl-imx-xwayland-glibc-x86_64-fsl-image-qt5-aarch64-toolchain-4*.sh`
-* Set up the build environment
-```
-source /opt/fsl-imx-xwayland/4.*/environment-setup-aarch64-poky-linux
-alias cmake="/usr/bin/cmake -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/share/cmake/OEToolchainConfig.cmake"
-```
-* See [Build ARM](inferencing.md#arm) below for information on building for ARM devices
-
 ### Build Instructions
 {: .no_toc }
 
-1. Configure ONNX Runtime with ACL support:
-```
-cmake ../onnxruntime-arm-upstream/cmake -DONNX_CUSTOM_PROTOC_EXECUTABLE=/usr/bin/protoc -Donnxruntime_RUN_ONNX_TESTS=OFF -Donnxruntime_GENERATE_TEST_REPORTS=ON -Donnxruntime_DEV_MODE=ON -DPYTHON_EXECUTABLE=/usr/bin/python3 -Donnxruntime_USE_CUDA=OFF -Donnxruntime_USE_NSYNC=OFF -Donnxruntime_CUDNN_HOME= -Donnxruntime_USE_JEMALLOC=OFF -Donnxruntime_ENABLE_PYTHON=OFF -Donnxruntime_BUILD_CSHARP=OFF -Donnxruntime_BUILD_SHARED_LIB=ON -Donnxruntime_USE_EIGEN_FOR_BLAS=ON -Donnxruntime_USE_OPENBLAS=OFF -Donnxruntime_USE_ACL=ON -Donnxruntime_USE_DNNL=OFF -Donnxruntime_USE_MKLML=OFF -Donnxruntime_USE_OPENMP=ON -Donnxruntime_USE_TVM=OFF -Donnxruntime_USE_LLVM=OFF -Donnxruntime_ENABLE_MICROSOFT_INTERNAL=OFF -Donnxruntime_USE_BRAINSLICE=OFF -Donnxruntime_USE_EIGEN_THREADPOOL=OFF -Donnxruntime_BUILD_UNIT_TESTS=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo
-```
-The ```-Donnxruntime_USE_ACL=ON``` option will use, by default, the 19.05 version of the Arm Compute Library. To set the right version you can use:
-```-Donnxruntime_USE_ACL_1902=ON```, ```-Donnxruntime_USE_ACL_1905=ON```, ```-Donnxruntime_USE_ACL_1908=ON``` or ```-Donnxruntime_USE_ACL_2002=ON```;
-
-To use a library outside the normal environment you can set a custom path by using ```-Donnxruntime_ACL_HOME``` and ```-Donnxruntime_ACL_LIBS``` tags that defines the path to the ComputeLibrary directory and the build directory respectively.
+You must first build Arm Compute Library 24.07 for your platform as described in the [documentation](https://github.com/ARM-software/ComputeLibrary).
+See [here](inferencing.md#arm) for information on building for Arm®-based devices.
 
-```-Donnxruntime_ACL_HOME=/path/to/ComputeLibrary```, ```-Donnxruntime_ACL_LIBS=/path/to/build```
+Add the following options to `build.sh` to enable the ACL Execution Provider:
 
-
-2. Build ONNX Runtime library, test and performance application:
-```
-make -j 6
-```
-
-3. Deploy ONNX runtime on the i.MX 8QM board
 ```
-libonnxruntime.so.0.5.0
-onnxruntime_perf_test
-onnxruntime_test_all
+--use_acl --acl_home=/path/to/ComputeLibrary --acl_libs=/path/to/ComputeLibrary/build
 ```
 
-### Native Build Instructions 
-{: .no_toc }
-
-*Validated on Jetson Nano and Jetson Xavier*
-
-1. Build ACL Library (skip if already built)
-
-    ```bash
-    cd ~
-    git clone -b v20.02 https://github.com/Arm-software/ComputeLibrary.git
-    cd ComputeLibrary
-    sudo apt-get install -y scons g++-arm-linux-gnueabihf
-    scons -j8 arch=arm64-v8a  Werror=1 debug=0 asserts=0 neon=1 opencl=1 examples=1 build=native
-    ```
-
-1. Cmake is needed to build ONNX Runtime. Because the minimum required version is 3.13,
-   it is necessary to build CMake from source. Download Unix/Linux sources from https://cmake.org/download/
-   and follow https://cmake.org/install/ to build from source. Version 3.17.5 and 3.18.4 have been tested on Jetson.
-
-1. Build onnxruntime with --use_acl flag with one of the supported ACL version flags. (ACL_1902 | ACL_1905 | ACL_1908 | ACL_2002)
-
----
-
-## ArmNN
+## Arm NN
 
-See more information on the ArmNN Execution Provider [here](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md).
+See more information on the Arm NN Execution Provider [here](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md).
 
 ### Prerequisites
 {: .no_toc }
@@ -480,7 +429,7 @@ source /opt/fsl-imx-xwayland/4.*/environment-setup-aarch64-poky-linux
 alias cmake="/usr/bin/cmake -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/share/cmake/OEToolchainConfig.cmake"
 ```
 
-* See [Build ARM](inferencing.md#arm) below for information on building for ARM devices
+* See [here](inferencing.md#arm) for information on building for Arm-based devices
 
 ### Build Instructions
 {: .no_toc }
@@ -490,20 +439,20 @@ alias cmake="/usr/bin/cmake -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/sh
 ./build.sh --use_armnn
 ```
 
-The Relu operator is set by default to use the CPU execution provider for better performance. To use the ArmNN implementation build with --armnn_relu flag
+The Relu operator is set by default to use the CPU execution provider for better performance. To use the Arm NN implementation build with --armnn_relu flag
 
 ```bash
 ./build.sh --use_armnn --armnn_relu
 ```
 
-The Batch Normalization operator is set by default to use the CPU execution provider. To use the ArmNN implementation build with --armnn_bn flag
+The Batch Normalization operator is set by default to use the CPU execution provider. To use the Arm NN implementation build with --armnn_bn flag
 
 ```bash
 ./build.sh --use_armnn --armnn_bn
 ```
 
-To use a library outside the normal environment you can set a custom path by providing the --armnn_home and --armnn_libs parameters to define the path to the ArmNN home directory and build directory respectively. 
-The ARM Compute Library home directory and build directory must also be available, and can be specified if needed using --acl_home and --acl_libs respectively.
+To use a library outside the normal environment you can set a custom path by providing the --armnn_home and --armnn_libs parameters to define the path to the Arm NN home directory and build directory respectively. 
+The Arm Compute Library home directory and build directory must also be available, and can be specified if needed using --acl_home and --acl_libs respectively.
 
 ```bash
 ./build.sh --use_armnn --armnn_home /path/to/armnn --armnn_libs /path/to/armnn/build  --acl_home /path/to/ComputeLibrary --acl_libs /path/to/acl/build
@@ -519,7 +468,7 @@ See more information on the RKNPU Execution Provider [here](../execution-provide
 
 
 * Supported platform: RK1808 Linux
-* See [Build ARM](inferencing.md#arm) below for information on building for ARM devices
+* See [here](inferencing.md#arm) for information on building for Arm-based devices
 * Use gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu instead of gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf, and modify CMAKE_CXX_COMPILER & CMAKE_C_COMPILER in tool.cmake:
 
 ```

diff --git a/docs/build/inferencing.md b/docs/build/inferencing.md
@@ -88,7 +88,8 @@ If you would like to use [Xcode](https://developer.apple.com/xcode/) to build th
 
 Without this flag, the cmake build generator will be Unix makefile by default.
 
-Today, Mac computers are either Intel-Based or Apple silicon(aka. ARM) based. By default, ONNX Runtime's build script only generate bits for the CPU ARCH that the build machine has. If you want to do cross-compiling: generate ARM binaries on a Intel-Based Mac computer, or generate x86 binaries on a Mac ARM computer, you can set the "CMAKE_OSX_ARCHITECTURES" cmake variable. For example:
+Today, Mac computers are either Intel-Based or Apple silicon-based. By default, ONNX Runtime's build script only generate bits for the CPU ARCH that the build machine has. If you want to do cross-compiling: generate arm64 binaries on a Intel-Based Mac computer, or generate x86 binaries on a Mac
+system with Apple silicon, you can set the "CMAKE_OSX_ARCHITECTURES" cmake variable. For example:
 
 Build for Intel CPUs:
 ```bash
@@ -311,21 +312,21 @@ ORT_DEBUG_NODE_IO_DUMP_DATA_TO_FILES=1
     ```
 
 
-### ARM
+### Arm
 
-There are a few options for building ONNX Runtime for ARM. 
+There are a few options for building ONNX Runtime for Arm®-based devices. 
 
-First, you may do it on a real ARM device, or on a x86_64 device with an emulator(like qemu), or on a x86_64 device with a docker container with an emulator(you can run an ARM container on a x86_64 PC). Then the build instructions are essentially the same as the instructions for Linux x86_64. However, it wouldn't work if your the CPU you are targeting is not 64-bit since the build process needs more than 2GB memory.  
+First, you may do it on a real Arm-based device, or on a x86_64 device with an emulator(like qemu), or on a x86_64 device with a docker container with an emulator(you can run an Arm-based container on a x86_64 PC). Then the build instructions are essentially the same as the instructions for Linux x86_64. However, it wouldn't work if your the CPU you are targeting is not 64-bit since the build process needs more than 2GB memory.  
 
-* [Cross compiling for ARM with simulation (Linux/Windows)](#cross-compiling-for-arm-with-simulation-linuxwindows) - **Recommended**;  Easy, slow, ARM64 only(no support for ARM32)
+* [Cross compiling for Arm-based devices with simulation (Linux/Windows)](#cross-compiling-for-arm-with-simulation-linuxwindows) - **Recommended**;  Easy, slow, ARM64 only(no support for ARM32)
 * [Cross compiling on Linux](#cross-compiling-on-linux) - Difficult, fast
 * [Cross compiling on Windows](#cross-compiling-on-windows)
 
-#### Cross compiling for ARM with simulation (Linux/Windows)
+#### Cross compiling for Arm-based devices with simulation (Linux/Windows)
 
 *EASY, SLOW, RECOMMENDED*
 
-This method relies on qemu user mode emulation. It allows you to compile using a desktop or cloud VM through instruction level simulation. You'll run the build on x86 CPU and translate every ARM instruction to x86. This is much faster than compiling natively on a low-end ARM device. The resulting ONNX Runtime Python wheel (.whl) file is then deployed to an ARM device where it can be invoked in Python 3 scripts. The build process can take hours, and may run of memory if the target CPU is 32-bit.
+This method relies on qemu user mode emulation. It allows you to compile using a desktop or cloud VM through instruction level simulation. You'll run the build on x86 CPU and translate every Arm architecture instruction to x86. This is potentially much faster than compiling natively on a low-end device. The resulting ONNX Runtime Python wheel (.whl) file is then deployed to an Arm-based device where it can be invoked in Python 3 scripts. The build process can take hours, and may run of memory if the target CPU is 32-bit.
 
 #### Cross compiling on Linux
 
@@ -364,12 +365,12 @@ This option is very fast and allows the package to be built in minutes, but is c
 
     You must also know what kind of flags your target hardware need, which can differ greatly. For example, if you just get the normal ARMv7 compiler and use it for Raspberry Pi V1 directly, it won't work because Raspberry Pi only has ARMv6. Generally every hardware vendor will provide a toolchain; check how that one was built.
 
-    A target env is identifed by:
+    A target env is identified by:
 
     * Arch: x86_32, x86_64, armv6,armv7,arvm7l,aarch64,...
     * OS: bare-metal or linux.
     * Libc: gnu libc/ulibc/musl/...
-    * ABI: ARM has mutilple ABIs like eabi, eabihf...
+    * ABI: Arm has multiple ABIs like eabi, eabihf...
 
     You can get all these information from the previous output, please be sure they are all correct.
 
@@ -528,8 +529,8 @@ This option is very fast and allows the package to be built in minutes, but is c
 
 **Using Visual C++ compilers**
 
-1. Download and install Visual C++ compilers and libraries for ARM(64).
-   If you have Visual Studio installed, please use the Visual Studio Installer (look under the section `Individual components` after choosing to `modify` Visual Studio) to download and install the corresponding ARM(64) compilers and libraries.
+1. Download and install Visual C++ compilers and libraries for Arm(64).
+   If you have Visual Studio installed, please use the Visual Studio Installer (look under the section `Individual components` after choosing to `modify` Visual Studio) to download and install the corresponding Arm(64) compilers and libraries.
 
 2. Use `.\build.bat` and specify `--arm` or `--arm64` as the build option to start building. Preferably use `Developer Command Prompt for VS` or make sure all the installed cross-compilers are findable from the command prompt being used to build using the PATH environmant variable.
 

diff --git a/docs/execution-providers/Vitis-AI-ExecutionProvider.md b/docs/execution-providers/Vitis-AI-ExecutionProvider.md
@@ -27,9 +27,9 @@ The following table lists AMD targets that are supported by the Vitis AI ONNX Ru
 | **Architecture**   							    | **Family**                                                 | **Supported Targets**                                      | **Supported OS**                                           |
 |---------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------|
 | AMD64							                    | Ryzen AI                                                   | AMD Ryzen 7040U, 7040HS                                    | Windows                                                    |
-| ARM64 Cortex-A53   				                | Zynq UltraScale+ MPSoC                                     | ZCU102, ZCU104, KV260                                      | Linux                                                      |
-| ARM64 Cortex-A72				                    | Versal AI Core / Premium                                   | VCK190                                                     | Linux                                                      |
-| ARM64	Cortex-A72						            | Versal AI Edge                                             | VEK280                                                     | Linux                                                      |
+| Arm® Cortex®-A53   				                | Zynq UltraScale+ MPSoC                                     | ZCU102, ZCU104, KV260                                      | Linux                                                      |
+| Arm® Cortex®-A72				                    | Versal AI Core / Premium                                   | VCK190                                                     | Linux                                                      |
+| Arm® Cortex®-A72						            | Versal AI Edge                                             | VEK280                                                     | Linux                                                      |
 
 
 AMD Adaptable SoC developers can also leverage the Vitis AI ONNX Runtime Execution Provider to support custom (chip-down) designs.

diff --git a/docs/execution-providers/Xnnpack-ExecutionProvider.md b/docs/execution-providers/Xnnpack-ExecutionProvider.md
@@ -8,7 +8,7 @@ nav_order: 9
 
 # XNNPACK Execution Provider
 
-Accelerate ONNX models on Android/iOS devices and WebAssembly with ONNX Runtime and the XNNPACK execution provider. [XNNPACK](https://github.com/google/XNNPACK) is a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 platforms.
+Accelerate ONNX models on Android/iOS devices and WebAssembly with ONNX Runtime and the XNNPACK execution provider. [XNNPACK](https://github.com/google/XNNPACK) is a highly optimized library of floating-point neural network inference operators for Arm®-based, WebAssembly, and x86 platforms.
 
 ## Contents
 {: .no_toc }

diff --git a/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md b/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md
@@ -10,14 +10,7 @@ redirect_from: /docs/reference/execution-providers/ACL-ExecutionProvider
 # ACL Execution Provider
 {: .no_toc }
 
-The integration of ACL as an execution provider (EP) into ONNX Runtime accelerates performance of ONNX model workloads across Armv8 cores. [Arm Compute Library](https://github.com/ARM-software/ComputeLibrary){:target="_blank"} is an open source inference engine maintained by Arm and Linaro companies.
-
-
-## Contents
-{: .no_toc }
-
-* TOC placeholder
-{:toc}
+The ACL Execution Provider enables accelerated performance on Arm®-based CPUs through [Arm Compute Library](https://github.com/ARM-software/ComputeLibrary){:target="_blank"}.
 
 
 ## Build
@@ -30,10 +23,44 @@ For build instructions, please see the [build page](../../build/eps.md#arm-compu
 ```
 Ort::Env env = Ort::Env{ORT_LOGGING_LEVEL_ERROR, "Default"};
 Ort::SessionOptions sf;
-bool enable_cpu_mem_arena = true;
-Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_ACL(sf, enable_cpu_mem_arena));
+bool enable_fast_math = true;
+Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_ACL(sf, enable_fast_math));
 ```
 The C API details are [here](../../get-started/with-c.html).
 
+### Python
+{: .no_toc }
+
+```
+import onnxruntime
+
+providers = [("ACLExecutionProvider", {"enable_fast_math": "true"})]
+sess = onnxruntime.InferenceSession("model.onnx", providers=providers)
+```
+
 ## Performance Tuning
-When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest){:target="_blank"}, use the flag -e acl
+Arm Compute Library has a fast math mode that can increase performance with some potential decrease in accuracy for MatMul and Conv operators. It is disabled by default.
+
+When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest){:target="_blank"}, use the flag `-e acl` to enable the ACL Execution Provider.  You can additionally use `-i 'enable_fast_math|true'` to enable fast math.
+
+Arm Compute Library uses the ONNX Runtime intra-operator thread pool when running via the execution provider. You can control the size of this thread pool using the `-x` option.
+
+## Supported Operators
+
+|Operator|Supported types|
+|---|---|
+|AveragePool|float|
+|BatchNormalization|float|
+|Concat|float|
+|Conv|float, float16|
+|FusedConv|float|
+|FusedMatMul|float, float16|
+|Gemm|float|
+|GlobalAveragePool|float|
+|GlobalMaxPool|float|
+|MatMul|float, float16|
+|MatMulIntegerToFloat|uint8, int8, uint8+int8|
+|MaxPool|float|
+|NhwcConv|float|
+|Relu|float|
+|QLinearConv|uint8, int8, uint8+int8|
diff --git a/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md b/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md
@@ -7,7 +7,7 @@ nav_order: 2
 redirect_from: /docs/reference/execution-providers/ArmNN-ExecutionProvider
 ---
 
-# ArmNN Execution Provider
+# Arm NN Execution Provider
 {: .no_toc}
 
 ## Contents
@@ -16,14 +16,14 @@ redirect_from: /docs/reference/execution-providers/ArmNN-ExecutionProvider
 * TOC placeholder
 {:toc}
 
-Accelerate performance of ONNX model workloads across Armv8 cores with the ArmNN execution provider. [ArmNN](https://github.com/ARM-software/armnn) is an open source inference engine maintained by Arm and Linaro companies. 
+Accelerate performance of ONNX model workloads across Arm®-based devices with the Arm NN execution provider. [Arm NN](https://github.com/ARM-software/armnn) is an open source inference engine maintained by Arm and Linaro companies. 
 
 ## Build
 For build instructions, please see the [BUILD page](../../build/eps.md#armnn).
 
 ## Usage
 ### C/C++
-To use ArmNN as execution provider for inferencing, please register it as below.
+To use Arm NN as execution provider for inferencing, please register it as below.
 ```
 Ort::Env env = Ort::Env{ORT_LOGGING_LEVEL_ERROR, "Default"};
 Ort::SessionOptions so;