Full SME(1) instruction support and STREAMING Groups #415

FinnWilkinson · 2024-06-12T10:50:19Z

This PR implements all available SME (version 1) instructions that are contained within LLVM 14.0.5. Specifically, this is Version 2021-06 of the Armv9-A A64 ISA.

No FP16 or BF16 instructions have been supported due to lacking C++17 types. All Quad-Word instruction variants have been emulated using 64-bit data-types.

In addition to this, new STREAMING_SVE and STREAMING_PREDICATE groups have been introduced (along with corresponding decode logic) to allow for a different pipeline / latency configuration for these instructions when SVE Streaming Mode (the context mode which SME instructions are executed in) is enabled. This can allow for a co-processor style implementation of SME to be implemented within SimEng; with additional latency / reduced throughput being configured to mimic an offload penalty, and different execution or LD/STR hardware being modelled for said co-processor compared to the main core.

Add STREAMING Group support
Add execution logic and regression tests for all missing SME instructions

FinnWilkinson · 2024-07-09T11:18:52Z

#rerun tests

src/include/simeng/arch/aarch64/Architecture.hh

src/lib/arch/aarch64/Architecture.cc

src/include/simeng/arch/aarch64/InstructionGroups.hh

src/lib/arch/aarch64/Instruction_address.cc

src/lib/arch/aarch64/Instruction_execute.cc

test/regression/aarch64/instructions/sme.cc

jj16791

Some comments and I agree with several of Alex's comments. I think it would be good to get the ARM SME/SVE loops as part of our functional verification checks to help test these new instructions. I assume it would have to be done somewhere private though (not sure if we already have that guarantee in the upcoming CI/CD pipelines)?

CMakeLists.txt

configs/a64fx_SME.yaml

docs/sphinx/assets/instruction_groups_AArch64.png

src/lib/arch/aarch64/Instruction_address.cc

src/lib/arch/aarch64/Instruction.cc

src/lib/arch/aarch64/Architecture.cc

src/lib/arch/aarch64/Instruction_decode.cc

src/include/simeng/arch/aarch64/helpers/neon.hh

src/lib/arch/aarch64/Instruction_execute.cc

test/regression/aarch64/instructions/sme.cc

dANW34V3R

Haven't finished the review but posting comments to prevent overlaps

CMakeLists.txt

src/include/simeng/arch/aarch64/Architecture.hh

src/lib/arch/aarch64/Instruction_decode.cc

src/include/simeng/Register.hh

configs/a64fx_SME.yaml

dANW34V3R

LOOTS of new instructions, well done for grinding through them. Bring on SAIL

src/include/simeng/arch/aarch64/Instruction.hh

src/lib/arch/riscv/Instruction_decode.cc

src/lib/arch/aarch64/Instruction_execute.cc

FinnWilkinson · 2024-12-03T16:51:12Z

Now outdated as STREAMING groups logic removed which was the only cause for slowdown.
See below for this PR's performance compared to dev (times averaged over 5 runs):

Benchmark	`dev` Time (ms)	`dev` StdDev	This PR Time (ms)	% diff to `dev`	This PR StdDev
CloverLeaf serial gcc8.3.0 armv8.4	13194.4	60.1	13557.0	2.71%	132.52
CloverLeaf serial gcc9.3.0 armv8.4	13050.6	102.7	13580.2	3.98%	84.94
CloverLeaf serial gcc10.3.0 armv8.4	13290.4	47.9	13623.0	2.47%	44.06
CloverLeaf serial armclang20 armv8.4	11804.4	39.1	12343.2	4.46%	77.05
CloverLeaf openmp gcc8.3.0 armv8.4	17509.4	161.5	17889.8	2.15%	65.83
CloverLeaf openmp gcc9.3.0 armv8.4	17584.4	182.0	17995.2	2.31%	152.27
CloverLeaf openmp gcc10.3.0 armv8.4	17119.8	61.3	17651.4	3.06%	79.05
CloverLeaf openmp armclang20 armv8.4	15820.8	95.4	16211.0	2.44%	83.58
miniBUDE openmp gcc8.3.0 armv8.4	24691.2	52.3	24505.6	-0.75%	276.93
miniBUDE openmp gcc9.3.0 armv8.4	24500.0	175.6	24412.8	-0.36%	155.77
miniBUDE openmp gcc10.3.0 armv8.4	24438.0	146.7	24260.6	-0.73%	77.47
miniBUDE openmp armclang20 armv8.4	22725.2	150.0	22343.4	-1.69%	67.39
STREAM serial gcc8.3.0 armv8.4	7378.0	40.3	7769.8	5.17%	29.84
STREAM serial gcc9.3.0 armv8.4	7380.4	48.6	7722.6	4.53%	68.62
STREAM serial gcc10.3.0 armv8.4	7530.6	71.7	7632.6	1.35%	39.53
STREAM serial armclang20 armv8.4	8948.0	70.6	8317.4	-7.30%	36.88
STREAM openmp gcc8.3.0 armv8.4	11552.6	139.5	12020.4	3.97%	111.61
STREAM openmp gcc9.3.0 armv8.4	11737.0	133.1	11855.8	1.01%	48.96
STREAM openmp gcc10.3.0 armv8.4	11357.4	36.4	11768.0	3.55%	95.17
STREAM openmp armclang20 armv8.4	12701.0	227.5	12309.0	-3.13%	87.32
TeaLeaf 2D serial gcc8.3.0 armv8.4	13964.4	41.8	13605.8	-2.60%	42.13
TeaLeaf 2D serial gcc9.3.0 armv8.4	13976.2	40.8	13553.6	-3.07%	88.90
TeaLeaf 2D serial gcc10.3.0 armv8.4	14231.0	92.2	13961.2	-1.91%	109.20
TeaLeaf 2D serial armclang20 armv8.4	25691.8	86.2	24628.8	-4.22%	199.33
TeaLeaf 2D openmp gcc8.3.0 armv8.4	20085.2	88.6	20070.4	-0.07%	110.76
TeaLeaf 2D openmp gcc9.3.0 armv8.4	19980.2	79.3	20492.8	2.53%	146.48
TeaLeaf 2D openmp gcc10.3.0 armv8.4	19684.8	88.1	19522.4	-0.83%	100.20
TeaLeaf 2D openmp armclang20 armv8.4	58068.6	251.6	61880.2	6.36%	284.36
TeaLeaf 3D serial gcc8.3.0 armv8.4	15853.0	128.6	15818.2	-0.22%	57.76
TeaLeaf 3D serial gcc9.3.0 armv8.4	16483.8	58.3	16334.6	-0.91%	87.93
TeaLeaf 3D serial gcc10.3.0 armv8.4	16839.8	86.0	16521.4	-1.91%	28.94
TeaLeaf 3D serial armclang20 armv8.4	23052.2	157.0	22959.8	-0.40%	134.67
TeaLeaf 3D openmp gcc8.3.0 armv8.4	26103.0	145.5	26294.8	0.73%	190.12
TeaLeaf 3D openmp gcc9.3.0 armv8.4	26203.6	103.0	27278.8	4.02%	239.28
TeaLeaf 3D openmp gcc10.3.0 armv8.4	26068.2	278.0	26129.6	0.24%	112.81
TeaLeaf 3D openmp armclang20 armv8.4	45312.4	179.0	48379.4	6.55%	136.36
CloverLeaf serial gcc8.3.0 armv8.4+sve	12763.0	89.1	13372.0	4.66%	59.14
CloverLeaf serial gcc9.3.0 armv8.4+sve	12675.4	52.4	13300.4	4.81%	134.66
CloverLeaf serial gcc10.3.0 armv8.4+sve	12665.4	88.7	13086.4	3.27%	63.11
CloverLeaf serial armclang20 armv8.4+sve	12512.8	79.5	12963.4	3.54%	71.92
CloverLeaf openmp gcc8.3.0 armv8.4+sve	16973.8	119.5	17630.2	3.79%	197.66
CloverLeaf openmp gcc9.3.0 armv8.4+sve	17076.6	132.9	17460.8	2.22%	53.09
CloverLeaf openmp gcc10.3.0 armv8.4+sve	16814.4	96.4	17264.4	2.64%	76.24
CloverLeaf openmp armclang20 armv8.4+sve	16436.8	82.2	16844.2	2.45%	98.85
miniBUDE openmp gcc8.3.0 armv8.4+sve	9745.6	125.8	10291.4	5.45%	90.47
miniBUDE openmp gcc9.3.0 armv8.4+sve	9172.0	41.3	10081.6	9.45%	64.37
miniBUDE openmp gcc10.3.0 armv8.4+sve	9180.0	36.6	10054.0	9.09%	61.30
miniBUDE openmp armclang20 armv8.4+sve	9746.6	63.0	10098.8	3.55%	85.55
STREAM serial gcc8.3.0 armv8.4+sve	3915.0	18.9	4139.4	5.57%	15.92
STREAM serial gcc9.3.0 armv8.4+sve	3919.4	16.7	4139.2	5.46%	18.14
STREAM serial gcc10.3.0 armv8.4+sve	3862.0	29.9	4086.2	5.64%	23.04
STREAM serial armclang20 armv8.4+sve	2550.2	3.7	2593.4	1.68%	17.33
STREAM openmp gcc8.3.0 armv8.4+sve	7977.4	32.4	8196.2	2.71%	38.70
STREAM openmp gcc9.3.0 armv8.4+sve	7987.4	87.9	8265.6	3.42%	12.76
STREAM openmp gcc10.3.0 armv8.4+sve	7999.2	69.2	8051.0	0.65%	34.07
STREAM openmp armclang20 armv8.4+sve	6836.0	10.0	6990.8	2.24%	35.39
TeaLeaf 2D serial gcc8.3.0 armv8.4+sve	14022.8	99.5	13579.0	-3.22%	59.23
TeaLeaf 2D serial gcc9.3.0 armv8.4+sve	13996.4	63.8	13610.4	-2.80%	64.07
TeaLeaf 2D serial gcc10.3.0 armv8.4+sve	14362.6	59.8	13831.0	-3.77%	65.83
TeaLeaf 2D serial armclang20 armv8.4+sve	9835.2	75.5	9782.0	-0.54%	113.49
TeaLeaf 2D openmp gcc8.3.0 armv8.4+sve	19885.8	62.1	20026.2	0.70%	69.48
TeaLeaf 2D openmp gcc9.3.0 armv8.4+sve	20028.2	143.4	20322.0	1.46%	111.62
TeaLeaf 2D openmp gcc10.3.0 armv8.4+sve	19695.6	83.6	19575.4	-0.61%	38.66
TeaLeaf 2D openmp armclang20 armv8.4+sve	57176.4	405.7	59327.4	3.69%	324.77
TeaLeaf 3D serial gcc8.3.0 armv8.4+sve	13828.8	50.1	14023.6	1.40%	51.09
TeaLeaf 3D serial gcc9.3.0 armv8.4+sve	13901.6	36.4	14065.6	1.17%	35.20
TeaLeaf 3D serial gcc10.3.0 armv8.4+sve	14043.8	58.0	14203.2	1.13%	103.32
TeaLeaf 3D serial armclang20 armv8.4+sve	22478.8	138.4	22850.6	1.64%	51.21
TeaLeaf 3D openmp gcc8.3.0 armv8.4+sve	23927.6	73.3	24201.0	1.14%	94.22
TeaLeaf 3D openmp gcc9.3.0 armv8.4+sve	23638.8	119.3	24663.4	4.24%	138.94
TeaLeaf 3D openmp gcc10.3.0 armv8.4+sve	23550.4	130.2	24060.4	2.14%	31.31
TeaLeaf 3D openmp armclang20 armv8.4+sve	48104.8	253.4	50293.2	4.45%	319.78

jj16791

Looks good, just need to finish off resolve existing conversations for the sake of clarity

src/include/simeng/arch/aarch64/helpers/neon.hh

test/regression/aarch64/instructions/sme.cc

jj16791 · 2024-12-14T11:18:56Z

src/include/simeng/arch/aarch64/helpers/neon.hh

@@ -585,9 +590,14 @@ RegisterValue vecUMinP(srcValContainer& sourceValues) {
  const T* n = sourceValues[0].getAsVector<T>();
  const T* m = sourceValues[1].getAsVector<T>();

+  // Concatenate the vectors
+  T temp[2 * I];
+  memcpy(temp, m, sizeof(T) * I);


Shouldn't m and n be switched here?

Good spot, only updated maxP to be in-line with HW...

…ssion test (B, H, S, D)

…ion test (B, H, S, D)

…n alias and regression tests (B, H, S, D)

…uctions and aliases and regression tests (B, H, S, D)

… tests.

…regression tests.

…ests.

…ardware.

…ogic.

FinnWilkinson added enhancement New feature or request 0.9.7 Part of SimEng Release 0.9.7 labels Jun 12, 2024

FinnWilkinson self-assigned this Jun 12, 2024

FinnWilkinson force-pushed the additional-sme-support branch from 531ebd0 to 7974237 Compare August 9, 2024 15:58

FinnWilkinson marked this pull request as ready for review August 28, 2024 13:41

FinnWilkinson requested review from dANW34V3R, jj16791, JosephMoore25 and ABenC377 August 28, 2024 13:41

ABenC377 requested changes Aug 30, 2024

View reviewed changes

FinnWilkinson changed the title ~~[WIP] Full SME(1) instruction support and STREAMING Groups~~ Full SME(1) instruction support and STREAMING Groups Sep 2, 2024

jj16791 requested changes Oct 26, 2024

View reviewed changes

dANW34V3R reviewed Oct 28, 2024

View reviewed changes

FinnWilkinson force-pushed the additional-sme-support branch from 4ad3b6e to aa40d88 Compare October 29, 2024 14:46

FinnWilkinson force-pushed the additional-sme-support branch from 91c4336 to 5945bae Compare November 6, 2024 16:44

ABenC377 previously approved these changes Nov 11, 2024

View reviewed changes

dANW34V3R previously approved these changes Dec 4, 2024

View reviewed changes

FinnWilkinson dismissed stale reviews from dANW34V3R and ABenC377 via 0af7abc December 5, 2024 13:35

ABenC377 previously approved these changes Dec 5, 2024

View reviewed changes

jj16791 reviewed Dec 8, 2024

View reviewed changes

src/include/simeng/arch/aarch64/helpers/neon.hh Show resolved Hide resolved

test/regression/aarch64/instructions/sme.cc Show resolved Hide resolved

jj16791 requested changes Dec 14, 2024

View reviewed changes

FinnWilkinson mentioned this pull request Dec 16, 2024

[AArch64] NEON, SVE2 and SME2 instruction support with tests #439

Open

FinnWilkinson dismissed ABenC377’s stale review via 31a3e6b December 16, 2024 12:36

ABenC377 previously approved these changes Dec 17, 2024

View reviewed changes

FinnWilkinson force-pushed the additional-sme-support branch from 31a3e6b to c0b2316 Compare December 17, 2024 17:37

dANW34V3R previously approved these changes Dec 18, 2024

View reviewed changes

FinnWilkinson added 26 commits December 20, 2024 11:17

Fixed execution logic for vertical ST1D and ST1W SME stores.

40228df

Implemented SME ST1B and ST1H (H and V) instruction logic.

a55e45d

Implemented SME LD1B and LD1H (H and V) instruction logic.

b73ca9e

Added SME LD1B and LD1H regression tests.

9461680

Updated ST1D and ST1W SME regression tests.

a3ba507

Added SME ST1B and ST1H regression tests.

e9d4cf2

Implemented SME MOVA (Tile to Vec, horizontal) instructions and regre…

faf54a7

…ssion test (B, H, S, D)

Implemented SME MOVA (Tile to Vec, vertical) instructions and regress…

594a5b8

…ion test (B, H, S, D)

Implemented SME MOV (Tile to Vec, vertical and horizontal) instructio…

c3aed6d

…n alias and regression tests (B, H, S, D)

Implemented SME MOVA/MOV (Vec to Tile, vertical and horizontal) instr…

a927b37

…uctions and aliases and regression tests (B, H, S, D)

Implemented SME LDR instruction and regression tests.

0869be6

Implemented SME ADDHA and ADDVA (S and D) instructions and regression…

dca22ea

… tests.

Updated ADDHA test to make more specific.

b585701

Corrected ADDVA execution logic.

53959cf

Updated ADDVA test to make more specific.

a6b61e7

Added SME MOVA (tile to vec, vec to tile) Quad-word instructions and …

857cd9b

…regression tests.

Implemented SME ST1Q and LD1Q (V and H) instructions and regression t…

882ce0a

…ests.

Removed werror.

d33d1c1

NEON instruction logic fixes.

b5c4cda

Attended PR comments.

7b74b34

Switched order of concatonation for NEON UMAXP instruction to match H…

790f3df

…ardware.

Fixed LD1W (into ZA, 32-bit) buffer overflow error.

e1ab10c

Removed STREAMING_SVE and STREAMING_PREDICATE groups and associated l…

a39dd23

…ogic.

Reverted docs aarch64 instruction groups image.

611d607

Fixed order of vector concat for NEON uminp.

f3088d5

Post rebase fixes.

1232bcc

FinnWilkinson dismissed stale reviews from dANW34V3R and ABenC377 via 1232bcc December 20, 2024 11:28

FinnWilkinson force-pushed the additional-sme-support branch from c0b2316 to 1232bcc Compare December 20, 2024 11:28

ABenC377 approved these changes Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full SME(1) instruction support and STREAMING Groups #415

Full SME(1) instruction support and STREAMING Groups #415

FinnWilkinson commented Jun 12, 2024 •

edited

Loading

FinnWilkinson commented Jul 9, 2024

jj16791 left a comment

dANW34V3R left a comment

dANW34V3R left a comment

FinnWilkinson commented Dec 3, 2024 •

edited

Loading

jj16791 left a comment

jj16791 Dec 14, 2024

FinnWilkinson Dec 16, 2024

Full SME(1) instruction support and STREAMING Groups #415

Are you sure you want to change the base?

Full SME(1) instruction support and STREAMING Groups #415

Conversation

FinnWilkinson commented Jun 12, 2024 • edited Loading

FinnWilkinson commented Jul 9, 2024

jj16791 left a comment

Choose a reason for hiding this comment

dANW34V3R left a comment

Choose a reason for hiding this comment

dANW34V3R left a comment

Choose a reason for hiding this comment

FinnWilkinson commented Dec 3, 2024 • edited Loading

jj16791 left a comment

Choose a reason for hiding this comment

jj16791 Dec 14, 2024

Choose a reason for hiding this comment

FinnWilkinson Dec 16, 2024

Choose a reason for hiding this comment

FinnWilkinson commented Jun 12, 2024 •

edited

Loading

FinnWilkinson commented Dec 3, 2024 •

edited

Loading