Instruction | General theme | Writemask | Optional special features |
---|---|---|---|
extrh (26=0) |
x[i] = z[_][i] |
7 bit | |
extrh (26=1,10=0) |
x[i] = f(z[_][i]) |
9 bit | Integer right shift, integer saturation |
extrh (26=1,10=1) |
y[i] = f(z[_][i]) |
9 bit | Integer right shift, integer saturation |
Bit | Width | Meaning | Notes |
---|---|---|---|
10 | 22 | A64 reserved instruction | Must be 0x201000 >> 10 |
5 | 5 | Instruction | Must be 8 |
0 | 5 | 5-bit GPR index | See below for the meaning of the 64 bits in the GPR |
Bit | Width | Meaning | Notes |
---|---|---|---|
63 | 1 | Lane width mode (hi) | See bit 11 |
58 | 5 | Right shift amount | Only applies in mixed lane-width modes, ignored otherwise |
57 | 1 | Z is signed (1 ) or unsigned (0 ) |
Only applies in mixed lane-width modes, ignored otherwise |
56 | 1 | Z saturation is signed (1 ) or unsigned (0 ) |
Only applies in mixed lane-width modes, ignored otherwise |
55 | 1 | Saturate Z (1 ) or truncate Z (0 ) |
Only applies in mixed lane-width modes, ignored otherwise |
54 | 1 | Right shift is rounding (1 ) or truncating (0 ) |
Only applies in mixed lane-width modes, ignored otherwise |
41 | 13 | Ignored | |
38 | 3 | Write enable mode | |
32 | 6 | Write enable value | Meaning dependent upon associated mode |
27 | 5 | Ignored | |
26 | 1 | Must be 1 for this decode variant |
|
20 | 6 | Z row | |
15 | 5 | Ignored | |
11 | 4 | Lane width mode (lo) | See bit 63 |
10 | 1 | Destination is Y (1 ) or is X (0 ) |
|
9 | 1 | Ignored | |
0 | 9 | Destination offset (in bytes) |
Lane widths:
X (or Y) | Z | 63 | 11 | Notes |
---|---|---|---|---|
i8 or u8 | i8 or u8 | 0 |
0 |
|
i32 or u32 | i32 or u32 | 0 |
8 |
|
i16 or u16 | i32 or u32 (two rows, interleaved pair) | 0 |
9 |
Shift and saturation supported |
i16 or u16 | i32 or u32 (four rows, interleaved pair from those) | 0 |
10 |
Shift and saturation supported |
i8 or u8 | i32 or u32 (four rows, interleaved quartet) | 0 |
11 |
Shift and saturation supported |
i8 or u8 | i16 or u16 (two rows, interleaved pair) | 0 |
13 |
Shift and saturation supported |
i16 or u16 | i16 or u16 | 0 |
anything else | |
f64 | f64 | 1 |
1 |
|
f32 | f32 | 1 |
8 |
|
f16 | f16 | 1 |
anything else |
Write enable modes (with regard to X or Y):
Mode | Meaning of value (N) |
---|---|
0 |
Enable all lanes (0 or 4 or 5 ), or odd lanes only (1 ), or even lanes only (2 ), or enable all lanes but write 0 to them regardless of Z (3 ), or no lanes enabled (anything else) |
1 |
Only enable lane #N |
2 |
Only enable the first N lanes, or all lanes when N is zero |
3 |
Only enable the last N lanes, or all lanes when N is zero |
4 |
Only enable the first N lanes (no lanes when Z is zero) |
5 |
Only enable the last N lanes (no lanes when Z is zero) |
6 |
No lanes enabled |
7 |
No lanes enabled |
Bit | Width | Meaning | Notes |
---|---|---|---|
48 | 16 | Ignored | |
46 | 2 | Write enable mode | |
41 | 5 | Write enable value | Meaning dependent upon associated mode |
30 | 11 | Ignored | |
28 | 2 | Lane width mode | |
27 | 1 | Must be 0 |
Otherwise decodes as extrx |
26 | 1 | Must be 0 for this decode variant |
|
20 | 6 | Z row | |
19 | 1 | Ignored | |
10 | 9 | Destination offset (in bytes) | Destination is always X for this decode variant |
0 | 10 | Ignored |
Lane width modes:
X,Z | 28 |
---|---|
any 64-bit | 0 |
any 32-bit | 1 |
any 16-bit | 2 |
any 16-bit, but with high 8 bits of each lane disabled | 3 |
Write enable modes (with regard to X):
Mode | Meaning of value (N) |
---|---|
0 |
Enable all lanes (0 ), or odd lanes only (1 ), or even lanes only (2 ), or no lanes (anything else) |
1 |
Only enable lane #N |
2 |
Only enable the first N lanes, or all lanes when N is zero |
3 |
Only enable the last N lanes, or all lanes when N is zero |
When X/Y/Z all have the same lane width (which is always the case when 26=0), this operation is simple: the field at bit 20 identifies a Z row, and that row is copied to X (or transposed and copied to Y). The lane width only affects the write-enable logic.
When Z is wider than X/Y, this operation is more complex, as it needs to perform narrowing. The four mixed-width modes are 9
, 10
, 11
, 13
. All of these modes support right-shift and optional saturation of the Z values, and then take the low bits.
Mode 9 (32-bit Z elements, 16-bit X or Y elements), correspondance between X/Y lanes and pair of Z registers:
Z0 | 0 | 2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 | 22 | 24 | 26 | 28 | 30 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Z1 | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 15 | 17 | 19 | 21 | 23 | 25 | 27 | 29 | 31 |
Mode 10 (32-bit Z elements, 16-bit X/Y elements), correspondance between X/Y lanes and quartet of Z registers:
Z0 | 0 | 2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 | 22 | 24 | 26 | 28 | 30 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Z1 | ||||||||||||||||
Z2 | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 15 | 17 | 19 | 21 | 23 | 25 | 27 | 29 | 31 |
Z3 |
Mode 11 (32-bit Z elements, 8-bit X/Y elements), correspondance between X/Y lanes and quartet of Z registers:
Z0 | 0 | 4 | 8 | 12 | 16 | 20 | 24 | 28 | 32 | 36 | 40 | 44 | 48 | 52 | 56 | 60 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Z1 | 1 | 5 | 9 | 13 | 17 | 21 | 25 | 29 | 33 | 37 | 41 | 45 | 49 | 53 | 57 | 61 |
Z2 | 2 | 6 | 10 | 14 | 18 | 22 | 26 | 30 | 34 | 38 | 42 | 46 | 50 | 54 | 58 | 62 |
Z3 | 3 | 7 | 11 | 15 | 19 | 23 | 27 | 31 | 35 | 39 | 43 | 47 | 51 | 55 | 59 | 63 |
Mode 13 (16-bit Z elements, 8-bit X/Y elements), correspondance between X/Y lanes and pair of Z registers:
Z0 | 0 | 2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 | 22 | 24 | 26 | 28 | 30 | 32 | 34 | 36 | 38 | 40 | 42 | 44 | 46 | 48 | 50 | 52 | 54 | 56 | 58 | 60 | 62 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Z1 | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 15 | 17 | 19 | 21 | 23 | 25 | 27 | 29 | 31 | 33 | 35 | 37 | 39 | 41 | 43 | 45 | 47 | 49 | 51 | 53 | 55 | 57 | 59 | 61 | 63 |
See extr.c.
A representative sample is:
void emulate_AMX_EXTRX(amx_state* state, uint64_t operand) {
void* dst;
uint64_t dst_offset;
uint64_t z_row = (operand >> 20) & 63;
uint64_t store_enable = ~(uint64_t)0;
uint8_t buffer[64];
uint32_t stride = 0;
uint32_t zbytes, xybytes;
if (operand & EXTR_HV) {
dst = (operand & EXTR_HV_TO_Y) ? state->y : state->x;
dst_offset = operand & 0x1FF;
switch (((operand >> 63) << 4) | ((operand >> 11) & 0xF)) {
case 0: xybytes = 1; zbytes = 1; break;
case 8: xybytes = 4; zbytes = 4; break;
case 9: xybytes = 2; zbytes = 4; stride = 1; break;
case 10: xybytes = 2; zbytes = 4; stride = 2; break;
case 11: xybytes = 1; zbytes = 4; stride = 1; break;
case 13: xybytes = 1; zbytes = 2; stride = 1; break;
case 17: xybytes = 8; zbytes = 8; break;
case 24: xybytes = 4; zbytes = 4; break;
default: xybytes = 2; zbytes = 2; break;
}
store_enable &= parse_writemask(operand >> 32, xybytes, 9);
} else if (operand & EXTR_BETWEEN_XY) {
...
} else {
dst = state->x;
dst_offset = (operand >> 10) & 0x1FF;
xybytes = 8 >> ((operand >> 28) & 3);
if (xybytes == 1) {
xybytes = 2;
store_enable &= 0x5555555555555555ull;
}
store_enable &= parse_writemask(operand >> 41, xybytes, 7);
zbytes = xybytes;
}
uint32_t signext = (operand & EXTR_SIGNED_INPUT) ? 64 - zbytes*8 : 0;
for (uint32_t i = 0; i < 64; i += xybytes) {
uint64_t zoff = (i & (zbytes - 1)) / xybytes * stride;
int64_t val = load_int(&state->z[bit_select(z_row, z_row + zoff, zbytes - 1)].u8[i & -zbytes], zbytes, signext);
if (stride) val = extr_alu(val, operand, xybytes*8);
store_int(buffer + i, xybytes, val);
}
if ((operand & EXTR_HV) && (((operand >> 32) & 0x1ff) == 3)) {
memset(buffer, 0, sizeof(buffer));
}
store_xy_row(dst, dst_offset, buffer, store_enable);
}
int64_t extr_alu(int64_t val, uint64_t operand, uint32_t outbits) {
uint32_t shift = (operand >> 58) & 0x1f;
if (shift && (operand & EXTR_ROUNDING_SHIFT)) {
val += 1 << (shift - 1);
}
val >>= shift;
if (operand & EXTR_SATURATE) {
if (operand & EXTR_SIGNED_OUTPUT) outbits -= 1;
int64_t hi = 1ull << outbits;
if (operand & EXTR_SIGNED_INPUT) {
int64_t lo = (operand & EXTR_SIGNED_OUTPUT) ? -hi : 0;
if (val < lo) val = lo;
if (val >= hi) val = hi - 1;
} else {
if ((uint64_t)val >= (uint64_t)hi) val = hi - 1;
}
}
return val;
}