Instruction | General theme | Writemask | Optional special features |
---|---|---|---|
vecfp |
z[_][i] ±= f(x[i], y[i]) |
9 bit | Indexed X or Y, shuffle X, shuffle Y, broadcast Y element, positive selection, min , max |
Bit | Width | Meaning | Notes |
---|---|---|---|
10 | 22 | A64 reserved instruction | Must be 0x201000 >> 10 |
5 | 5 | Instruction | Must be 19 |
0 | 5 | 5-bit GPR index | See below for the meaning of the 64 bits in the GPR |
Bit | Width | Meaning | Notes |
---|---|---|---|
57 | 7 | Ignored | |
54 | 3 | Must be zero | No-op otherwise |
53 | 1 | Indexed load (1 ) or regular load (0 ) |
|
(53=1) 52 | 1 | Ignored | |
(53=1) 49 | 3 | Register to index into | |
(53=1) 48 | 1 | Indices are 4 bits (1 ) or 2 bits (0 ) |
|
(53=1) 47 | 1 | Indexed load of Y (1 ) or of X (0 ) |
|
(53=0) 47 | 6 | ALU mode | |
46 | 1 | Ignored | |
42 | 4 | Lane width mode | |
41 | 1 | Ignored | |
(31=1) 35 | 6 | Ignored | |
(31=1) 32 | 3 | Broadcast mode | |
(31=0) 38 | 3 | Write enable or broadcast mode | |
(31=0) 37 | 1 | Ignored | |
(31=0) 32 | 5 | Write enable value or broadcast lane index | Meaning dependent upon associated mode |
31 | 1 | Perform operation for multiple vectors (1 )or just one vector ( 0 ) |
M2 only (always reads as 0 on M1) |
29 | 2 | X shuffle | |
27 | 2 | Y shuffle | |
26 | 1 | Ignored | |
(31=1) 25 | 1 | "Multiple" means four vectors (1 )or two vectors ( 0 ) |
Top two bits of Z row ignored if operating on four vectors |
20 | 6 | Z row | Low bits ignored in some lane width modes When 31=1, top bit or top two bits ignored |
19 | 1 | Ignored | |
10 | 9 | X offset (in bytes) | On M4, when 31=1, some low bits ignored |
9 | 1 | Ignored | |
0 | 9 | Y offset (in bytes) | On M4, when 31=1, some low bits ignored |
ALU modes:
Floating-point operation | 47 | Notes |
---|---|---|
z + x*y |
0 |
|
z - x*y |
1 |
|
x <= 0 ? 0 : y |
4 |
Z input not used |
min(x, z) |
5 |
Y input not used |
max(x, z) |
7 |
Y input not used |
x * y |
10 |
M2 only. Z input not used |
z + x |
11 |
M2 only. Y input not used |
z + y |
12 |
M2 only. X input not used |
no-op | anything else |
Lane width modes:
X,Y | Z | 42 | Notes |
---|---|---|---|
bf16 | bf16 (one row) | 0 |
M2 only |
bf16 | f32 (two rows, interleaved pair) | 1 |
M2 only |
f16 | f32 (two rows, interleaved pair) | 3 |
|
f32 | f32 (one row) | 4 |
|
f64 | f64 (one row) | 7 |
|
f16 | f16 (one row) | anything else |
Write enable or broadcast modes when 31=0:
Mode | Meaning of value (N) |
---|---|
0 |
Enable all lanes (0 ), or odd lanes only (1 ), or even lanes only (2 ), or enable all lanes but override the ALU output to 0.0 (3 ) or enable all lanes but override X values to 0.0 (4 ) or enable all lanes but override Y values to 0.0 (5 ) or no lanes enabled (anything else) |
1 |
Enable all lanes, but broadcast Y lane #N to all lanes of Y |
2 |
Only enable the first N lanes, or all lanes when N is zero |
3 |
Only enable the last N lanes, or all lanes when N is zero |
4 |
Only enable the first N lanes (no lanes when N is zero) |
5 |
Only enable the last N lanes (no lanes when N is zero) |
6 |
No lanes enabled |
7 |
No lanes enabled |
Broadcast modes when 31=1:
Mode | X inputs | Y inputs | Other effects |
---|---|---|---|
0 |
Consecutive registers | Consecutive registers | |
1 |
Ignored | Ignored | Override ALU output to 0.0 |
2 |
Use same register for every iteration | Consecutive registers | |
3 |
Consecutive registers | Use same register for every iteration | |
4 |
Override values to 0.0 |
Consecutive registers | |
5 |
Consecutive registers | Override values to 0.0 |
|
6 |
Use same register for every iteration, and broadcast lane #0 to all lanes |
Consecutive registers | |
7 |
Consecutive registers | Use same register for every iteration, and broadcast lane #0 to all lanes |
Performs a pointwise fused-multiply-add (or other ALU operation) between an X vector, a Y vector, and a Z vector, accumulating onto the Z vector. All three vectors have the same element type, either f16 or f32 or f64 (or bf16 on M2). Alternatively, when X and Y are both f16 (or bf16 on M2), Z can have type f32, in which case two rows of Z are used (see Mixed lane widths).
On M2, the whole operation can optionally be repeated multiple times, by setting bit 31. Bit 25 controls the repetition count; either two times or four times. By default, consecutive X or Y registers are used as the source operands, but broadcast mode settings can cause the same vector (or lane therein) to be used multiple times. If repeated twice, the top bit of Z row is ignored, and Z row is incremented by 32 for the 2nd iteration. If repeated four times, the top two bits of Z row are ignored, and Z row is incremented by 16 on each iteration.
See vecfp.c. Note the code in test.c to set the DN bit of fpcr
.
A representative sample is:
void emulate_AMX_VECFP(amx_state* state, uint64_t operand) {
if ((operand >> 54) & 7) {
return;
}
operand &=~ (1ull << 37);
int alumode = (operand & VECFP_INDEXED_LOAD) ? 0 : (operand >> 47) & 0x3f;
switch (alumode) {
case 0: case 1: case 4: case 5: case 7:
break;
case 10: case 11: case 12:
if (AMX_VER >= AMX_VER_M2) {
break;
} else {
return;
}
default:
return;
}
uint32_t xybits, zbits, bf16 = 0;
switch ((operand >> 42) & 0xf) {
case 0: xybits = 16; if (AMX_VER >= AMX_VER_M2) { zbits = 16; bf16 = 1; } else { zbits = 16; } break;
case 1: xybits = 16; if (AMX_VER >= AMX_VER_M2) { zbits = 32; bf16 = 1; } else { zbits = 16; } break;
case 3: xybits = 16; zbits = 32; break;
case 4: xybits = 32; zbits = 32; break;
case 7: xybits = 64; zbits = 64; break;
default: xybits = 16; zbits = 16; break;
}
uint32_t xybytes = xybits / 8;
uint64_t z_row = operand >> 20;
uint64_t z_step = 64;
uint64_t x_step = 64;
uint64_t y_step = 64;
int32_t ximask = -1;
if ((AMX_VER >= AMX_VER_M2) && (operand & (1ull << 31))) {
uint64_t bmode = (operand >> 32) & 0x7;
operand &=~ (0x1ffull << 32);
switch (bmode) {
case 1: operand |= 3ull << 32; break; // override ALU operation to 0
case 2: x_step = 0; break; // same x vector for all operations
case 3: y_step = 0; break; // same y vector for all operations
case 4: operand |= 4ull << 32; break; // override x operand to zero
case 5: operand |= 5ull << 32; break; // override y operand to zero
case 6: x_step = 0; ximask = 0; break; // use lane 0 of x vector 0 for all operations
case 7: y_step = 0; operand |= 1ull << 38; break; // use lane 0 of y vector 0 for all operations
}
z_step = z_row & 32 ? 16 : 32;
}
uint64_t x_offset = operand >> 10;
uint64_t y_offset = operand;
for (z_row &= z_step - 1; z_row <= 63; z_row += z_step) {
amx_reg x;
amx_reg y;
load_xy_reg(&x, state->x, x_offset & 0x1FF); x_offset += x_step;
load_xy_reg(&y, state->y, y_offset & 0x1FF); y_offset += y_step;
if (operand & VECFP_INDEXED_LOAD) {
uint32_t src_reg = (operand >> 49) & 7;
uint32_t ibits = (operand & VECFP_INDEXED_LOAD_4BIT) ? 4 : 2;
if (operand & VECFP_INDEXED_LOAD_Y) {
load_xy_reg_indexed(y.u8, state->y[src_reg].u8, ibits, xybits);
y_offset -= y_step - y_step * ibits / xybits;
} else {
load_xy_reg_indexed(x.u8, state->x[src_reg].u8, ibits, xybits);
x_offset -= x_step - x_step * ibits / xybits;
}
}
xy_shuffle(x.u8, (operand >> 29) & 3, xybytes);
xy_shuffle(y.u8, (operand >> 27) & 3, xybytes);
uint64_t x_enable = parse_writemask(operand >> 32, xybytes, 9);
bool broadcast_y = ((operand >> (32+6)) & 7) == 1;
int32_t omask = -1;
if (broadcast_y) {
x_enable = ~(uint64_t)0;
} else if (((operand >> (32+6)) & 7) == 0) {
uint32_t val = (operand >> 32) & 0x3F;
if (val == 3) {
omask = 0;
} else if (val == 4) {
memset(&x, 0, 64);
} else if (val == 5) {
memset(&y, 0, 64);
}
}
if (zbits == 16) {
if (bf16) {
...
} else {
for (uint32_t i = 0; i < 32; i += 1) {
if (!((x_enable >> (i*xybytes)) & 1)) continue;
uint32_t j = broadcast_y ? ((operand >> 32) & 0x1f) : i;
_Float16* z = &state->z[z_row].f16[i];
*z = omask ? vecfp_alu16(x.f16[i & ximask], y.f16[j], *z, alumode) : 0;
}
}
} else {
...
}
}
}
_Float16 vecfp_alu16(_Float16 x, _Float16 y, _Float16 z, int alumode) {
switch (alumode) {
case 0: __asm("fmadd %h0, %h1, %h2, %h3" : "=w"(z) : "w"(x), "w"(y), "w"(z)); break;
case 1: __asm("fmsub %h0, %h1, %h2, %h3" : "=w"(z) : "w"(x), "w"(y), "w"(z)); break;
case 4: z = (x <= (_Float16)0) ? (_Float16)0 : y; break;
case 5: __asm("fmin %h0, %h1, %h2" : "=w"(z) : "w"(x), "w"(z)); break;
case 7: __asm("fmax %h0, %h1, %h2" : "=w"(z) : "w"(x), "w"(z)); break;
case 10: z = x * y; break;
case 11: z = z + x; break;
case 12: z = z + y; break;
}
return z;
}
Note that a fused-multiply-add counts as two floating-point operations. A measurement of 1.0 GFLOPS would mean 109 floating-point operations per second. The measurements are done without any load or store instructions; real-world workloads will need loads and stores, and thus will achieve lower numbers.
X and Y being f16[32]
, each Z accumulator being f16[32]
, ALU operation being z + x*y
or z - x*y
:
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|
1 per thread | 45.4 GFLOPS | 92.4 GFLOPS | 105.2 GFLOPS | 140.1 GFLOPS | 177.4 GFLOPS | 182.1 GFLOPS |
2 per thread | 90.0 GFLOPS | 174.4 GFLOPS | 210.3 GFLOPS | 301.1 GFLOPS | 384.2 GFLOPS | 378.0 GFLOPS |
3 per thread | 135.2 GFLOPS | 275.3 GFLOPS | 311.2 GFLOPS | 409.9 GFLOPS | 500.4 GFLOPS | 463.3 GFLOPS |
4 per thread | 183.7 GFLOPS | 369.3 GFLOPS | 417.2 GFLOPS | 549.9 GFLOPS | 595.2 GFLOPS | 570.3 GFLOPS |
5 per thread | 230.2 GFLOPS | 458.5 GFLOPS | 434.6 GFLOPS | 605.1 GFLOPS | 678.5 GFLOPS | 636.2 GFLOPS |
6 per thread | 272.3 GFLOPS | 546.9 GFLOPS | 522.3 GFLOPS | 710.3 GFLOPS | 743.6 GFLOPS | 758.3 GFLOPS |
7 per thread | 321.9 GFLOPS | 646.6 GFLOPS | 579.7 GFLOPS | 757.0 GFLOPS | 777.5 GFLOPS | 787.9 GFLOPS |
8 per thread | 369.5 GFLOPS | 737.6 GFLOPS | 667.2 GFLOPS | 796.3 GFLOPS | 792.8 GFLOPS | 744.4 GFLOPS |
9 per thread | 339.9 GFLOPS | 685.4 GFLOPS | 625.4 GFLOPS | 786.4 GFLOPS | 803.6 GFLOPS | 812.0 GFLOPS |
10 per thread | 362.0 GFLOPS | 703.8 GFLOPS | 642.9 GFLOPS | 755.8 GFLOPS | 742.5 GFLOPS | 811.2 GFLOPS |
11 per thread | 361.1 GFLOPS | 731.6 GFLOPS | 704.4 GFLOPS | 789.6 GFLOPS | 787.2 GFLOPS | 822.5 GFLOPS |
12 per thread | 359.3 GFLOPS | 728.4 GFLOPS | 649.4 GFLOPS | 777.5 GFLOPS | 785.2 GFLOPS | 830.3 GFLOPS |
13 per thread | 370.4 GFLOPS | 729.3 GFLOPS | 655.7 GFLOPS | 794.0 GFLOPS | 816.7 GFLOPS | 792.1 GFLOPS |
14 per thread | 365.9 GFLOPS | 731.8 GFLOPS | 649.5 GFLOPS | 789.4 GFLOPS | 782.8 GFLOPS | 806.1 GFLOPS |
15 per thread | 364.8 GFLOPS | 697.6 GFLOPS | 657.2 GFLOPS | 802.0 GFLOPS | 776.0 GFLOPS | 821.9 GFLOPS |
16 per thread | 369.1 GFLOPS | 733.9 GFLOPS | 662.7 GFLOPS | 790.0 GFLOPS | 789.8 GFLOPS | 809.1 GFLOPS |
X and Y being f16[32]
, each Z accumulator being f32[2][16]
, ALU operation being z + x*y
or z - x*y
:
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|
1 per thread | 44.6 GFLOPS | 91.9 GFLOPS | 95.6 GFLOPS | 139.6 GFLOPS | 171.6 GFLOPS | 189.6 GFLOPS |
2 per thread | 91.4 GFLOPS | 183.1 GFLOPS | 226.3 GFLOPS | 242.7 GFLOPS | 273.5 GFLOPS | 256.7 GFLOPS |
3 per thread | 137.9 GFLOPS | 276.6 GFLOPS | 285.3 GFLOPS | 391.2 GFLOPS | 461.9 GFLOPS | 407.9 GFLOPS |
4 per thread | 184.1 GFLOPS | 367.3 GFLOPS | 403.4 GFLOPS | 448.1 GFLOPS | 470.6 GFLOPS | 460.9 GFLOPS |
5 per thread | 171.7 GFLOPS | 345.7 GFLOPS | 365.0 GFLOPS | 406.9 GFLOPS | 461.2 GFLOPS | 423.5 GFLOPS |
6 per thread | 184.6 GFLOPS | 369.3 GFLOPS | 406.4 GFLOPS | 433.2 GFLOPS | 465.7 GFLOPS | 456.0 GFLOPS |
7 per thread | 182.0 GFLOPS | 366.9 GFLOPS | 377.2 GFLOPS | 420.8 GFLOPS | 463.3 GFLOPS | 434.6 GFLOPS |
8 per thread | 178.9 GFLOPS | 363.8 GFLOPS | 389.1 GFLOPS | 436.8 GFLOPS | 474.7 GFLOPS | 449.5 GFLOPS |
9 per thread | 185.1 GFLOPS | 369.9 GFLOPS | 376.3 GFLOPS | 442.7 GFLOPS | 465.5 GFLOPS | 424.9 GFLOPS |
10 per thread | 178.1 GFLOPS | 365.0 GFLOPS | 352.8 GFLOPS | 418.3 GFLOPS | 445.1 GFLOPS | 429.2 GFLOPS |
11 per thread | 182.2 GFLOPS | 362.9 GFLOPS | 417.9 GFLOPS | 435.3 GFLOPS | 457.6 GFLOPS | 455.3 GFLOPS |
12 per thread | 179.0 GFLOPS | 362.1 GFLOPS | 395.5 GFLOPS | 442.8 GFLOPS | 452.1 GFLOPS | 440.0 GFLOPS |
13 per thread | 184.2 GFLOPS | 368.3 GFLOPS | 368.7 GFLOPS | 433.7 GFLOPS | 465.8 GFLOPS | 434.9 GFLOPS |
14 per thread | 184.6 GFLOPS | 369.9 GFLOPS | 391.5 GFLOPS | 445.7 GFLOPS | 477.9 GFLOPS | 459.5 GFLOPS |
15 per thread | 182.7 GFLOPS | 368.3 GFLOPS | 382.6 GFLOPS | 436.2 GFLOPS | 470.8 GFLOPS | 442.0 GFLOPS |
16 per thread | 182.0 GFLOPS | 369.2 GFLOPS | 391.2 GFLOPS | 435.9 GFLOPS | 460.0 GFLOPS | 449.9 GFLOPS |
X and Y being f32[16]
, each Z accumulator being f32[16]
, ALU operation being z + x*y
or z - x*y
:
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|
1 per thread | 22.8 GFLOPS | 46.2 GFLOPS | 52.5 GFLOPS | 72.2 GFLOPS | 85.5 GFLOPS | 101.1 GFLOPS |
2 per thread | 45.9 GFLOPS | 91.5 GFLOPS | 99.8 GFLOPS | 151.0 GFLOPS | 176.2 GFLOPS | 193.5 GFLOPS |
3 per thread | 68.8 GFLOPS | 137.7 GFLOPS | 147.9 GFLOPS | 198.9 GFLOPS | 226.3 GFLOPS | 232.9 GFLOPS |
4 per thread | 91.6 GFLOPS | 183.2 GFLOPS | 192.7 GFLOPS | 268.0 GFLOPS | 295.6 GFLOPS | 285.9 GFLOPS |
5 per thread | 113.9 GFLOPS | 229.0 GFLOPS | 235.0 GFLOPS | 313.3 GFLOPS | 358.5 GFLOPS | 361.0 GFLOPS |
6 per thread | 138.4 GFLOPS | 276.6 GFLOPS | 320.7 GFLOPS | 357.2 GFLOPS | 380.4 GFLOPS | 377.3 GFLOPS |
7 per thread | 156.8 GFLOPS | 318.0 GFLOPS | 299.3 GFLOPS | 378.7 GFLOPS | 387.0 GFLOPS | 380.0 GFLOPS |
8 per thread | 184.6 GFLOPS | 369.2 GFLOPS | 330.5 GFLOPS | 404.1 GFLOPS | 424.1 GFLOPS | 401.2 GFLOPS |
9 per thread | 172.3 GFLOPS | 346.6 GFLOPS | 318.7 GFLOPS | 389.4 GFLOPS | 382.3 GFLOPS | 404.6 GFLOPS |
10 per thread | 180.5 GFLOPS | 359.7 GFLOPS | 327.1 GFLOPS | 400.3 GFLOPS | 391.0 GFLOPS | 401.5 GFLOPS |
11 per thread | 176.7 GFLOPS | 355.4 GFLOPS | 323.6 GFLOPS | 381.9 GFLOPS | 380.1 GFLOPS | 393.9 GFLOPS |
12 per thread | 181.7 GFLOPS | 367.4 GFLOPS | 330.8 GFLOPS | 400.9 GFLOPS | 388.6 GFLOPS | 412.3 GFLOPS |
13 per thread | 180.7 GFLOPS | 363.7 GFLOPS | 340.4 GFLOPS | 398.4 GFLOPS | 386.9 GFLOPS | 409.7 GFLOPS |
14 per thread | 183.2 GFLOPS | 367.5 GFLOPS | 334.0 GFLOPS | 397.1 GFLOPS | 414.1 GFLOPS | 410.0 GFLOPS |
15 per thread | 185.1 GFLOPS | 369.0 GFLOPS | 332.6 GFLOPS | 367.8 GFLOPS | 395.1 GFLOPS | 416.4 GFLOPS |
16 per thread | 184.5 GFLOPS | 368.5 GFLOPS | 333.3 GFLOPS | 398.5 GFLOPS | 395.9 GFLOPS | 411.7 GFLOPS |
X and Y being f64[8]
, each Z accumulator being f64[8]
, ALU operation being z + x*y
or z - x*y
:
Z Accumulators | 1 Thread | 2 Threads | 3 Threads | 4 Threads | 5 Threads | 6 Threads |
---|---|---|---|---|---|---|
1 per thread | 11.5 GFLOPS | 23.1 GFLOPS | 26.5 GFLOPS | 34.5 GFLOPS | 50.2 GFLOPS | 49.3 GFLOPS |
2 per thread | 23.1 GFLOPS | 46.2 GFLOPS | 54.1 GFLOPS | 71.2 GFLOPS | 93.8 GFLOPS | 97.8 GFLOPS |
3 per thread | 34.6 GFLOPS | 69.3 GFLOPS | 85.3 GFLOPS | 104.0 GFLOPS | 121.7 GFLOPS | 118.4 GFLOPS |
4 per thread | 46.2 GFLOPS | 92.3 GFLOPS | 116.1 GFLOPS | 137.0 GFLOPS | 163.0 GFLOPS | 146.3 GFLOPS |
5 per thread | 57.7 GFLOPS | 115.5 GFLOPS | 127.1 GFLOPS | 157.4 GFLOPS | 176.2 GFLOPS | 171.2 GFLOPS |
6 per thread | 68.8 GFLOPS | 138.5 GFLOPS | 133.5 GFLOPS | 178.3 GFLOPS | 189.1 GFLOPS | 185.1 GFLOPS |
7 per thread | 80.7 GFLOPS | 161.8 GFLOPS | 150.3 GFLOPS | 190.6 GFLOPS | 193.9 GFLOPS | 200.1 GFLOPS |
8 per thread | 92.4 GFLOPS | 184.9 GFLOPS | 166.9 GFLOPS | 200.4 GFLOPS | 203.5 GFLOPS | 210.6 GFLOPS |
9 per thread | 85.2 GFLOPS | 171.2 GFLOPS | 158.2 GFLOPS | 194.7 GFLOPS | 199.1 GFLOPS | 202.7 GFLOPS |
10 per thread | 91.1 GFLOPS | 182.2 GFLOPS | 162.5 GFLOPS | 194.0 GFLOPS | 196.1 GFLOPS | 203.5 GFLOPS |
11 per thread | 91.7 GFLOPS | 182.9 GFLOPS | 164.8 GFLOPS | 200.6 GFLOPS | 195.7 GFLOPS | 193.7 GFLOPS |
12 per thread | 91.7 GFLOPS | 184.5 GFLOPS | 165.8 GFLOPS | 198.8 GFLOPS | 198.0 GFLOPS | 205.5 GFLOPS |
13 per thread | 92.4 GFLOPS | 184.6 GFLOPS | 166.7 GFLOPS | 201.5 GFLOPS | 204.2 GFLOPS | 206.4 GFLOPS |
14 per thread | 92.7 GFLOPS | 184.9 GFLOPS | 165.4 GFLOPS | 197.8 GFLOPS | 198.0 GFLOPS | 209.4 GFLOPS |
15 per thread | 92.0 GFLOPS | 184.4 GFLOPS | 167.2 GFLOPS | 201.2 GFLOPS | 212.0 GFLOPS | 195.4 GFLOPS |
16 per thread | 92.3 GFLOPS | 184.7 GFLOPS | 166.4 GFLOPS | 199.8 GFLOPS | 198.8 GFLOPS | 203.2 GFLOPS |