Skip to content

Latest commit

 

History

History
28 lines (21 loc) · 3.61 KB

README.md

File metadata and controls

28 lines (21 loc) · 3.61 KB

Contemporary M1 / M2 / M3 / M4 machines from Apple have (at least) four different ways for low-level programmers to perform heavy computations:

  1. Standard ARMv8 SIMD/NEON vector instructions on CPU cores (128 bits wide, issue up to four per cycle on Firestorm)
  2. Apple's undocumented AMX instructions, issued from CPU, executed on a special accelerator execution unit
  3. The Neural Engine (called ANE or NPU)
  4. The GPU (e.g. Metal Compute Shaders)

This repository is all about the 2nd of those: Apple's AMX instructions. Note that these instructions are neither documented nor supported by Apple. As a source of potential great confusion, Apple's AMX instructions are completely distinct from Intel's AMX instructions, though both are intended for issuing matrix multiply operations from a CPU.

The research was done on an Apple M1 Max (2021), with follow-up work on an M2 (2023), and additional follow-up work on an M3 (2023) and M4 Max (2024). Older or newer chips might have different AMX instructions. Some sources report that the M1 contains version 2 of the AMX instructions, which seems plausible (possibly everything using 7-bit writemasks comes from version 1, and everything using 9-bit writemasks is new in version 2). The M1 to M2 transition adds bf16 support, along with a few other tweaks. The M2 to M3 transition adds one extra mode to each of ldx and ldy and matint. The M3 to M4 transition causes some modes of extrh, extrv, vecfp, and vecint to ignore low bits of X/Y offset.

A good one-image summary of AMX is the following figure from abandoned patent US20180074824A1. Consider a 32x32 grid of compute units, where each unit can perform 16-bit multiply-accumulate, or a 2x2 subgrid of units can perform 32-bit multiply-accumulate, or a 4x4 subgrid can perform 64-bit multiply-accumulate. To feed this grid, there is a pool of X registers each containing 32 16-bit elements (or 16 32-bit elements, or 8 64-bit elements) and a pool of Y registers similarly containing 32 16-bit elements (or 16 32-bit elements, or 8 64-bit elements). A single instruction can perform a full outer product: multiply every element of an X register with every element of a Y register, and accumulate with the Z element in the corresponding position.

US20180074824A1 Figure 2

A single row of the 32x32 grid can also be used to perform vector operations (rather than matrix operations) between X and YT.

In terms of available data types, the general pattern is:

  • IEEE754 f16 or f32 or f64 (same width for all three fused-multiply-add operands)
  • IEEE754 f16 multiplicands, accumulating onto f32
  • On M2 hardware, bf16 multiplicands, accumulating onto bf16 or IEEE754 f32
  • Integer 8-bit or 16-bit multiplicands, accumulating onto 16 or 32 bits (in various signednesses)

This repository provides: