Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full SME(1) instruction support and STREAMING Groups #415

Open
wants to merge 48 commits into
base: dev
Choose a base branch
from

Conversation

FinnWilkinson
Copy link
Contributor

@FinnWilkinson FinnWilkinson commented Jun 12, 2024

This PR implements all available SME (version 1) instructions that are contained within LLVM 14.0.5. Specifically, this is Version 2021-06 of the Armv9-A A64 ISA.

No FP16 or BF16 instructions have been supported due to lacking C++17 types. All Quad-Word instruction variants have been emulated using 64-bit data-types.

In addition to this, new STREAMING_SVE and STREAMING_PREDICATE groups have been introduced (along with corresponding decode logic) to allow for a different pipeline / latency configuration for these instructions when SVE Streaming Mode (the context mode which SME instructions are executed in) is enabled. This can allow for a co-processor style implementation of SME to be implemented within SimEng; with additional latency / reduced throughput being configured to mimic an offload penalty, and different execution or LD/STR hardware being modelled for said co-processor compared to the main core.

  • Add STREAMING Group support
  • Add execution logic and regression tests for all missing SME instructions

@FinnWilkinson FinnWilkinson added enhancement New feature or request 0.9.7 Part of SimEng Release 0.9.7 labels Jun 12, 2024
@FinnWilkinson FinnWilkinson self-assigned this Jun 12, 2024
@FinnWilkinson
Copy link
Contributor Author

#rerun tests

@FinnWilkinson FinnWilkinson force-pushed the additional-sme-support branch from 531ebd0 to 7974237 Compare August 9, 2024 15:58
@FinnWilkinson FinnWilkinson marked this pull request as ready for review August 28, 2024 13:41
src/include/simeng/arch/aarch64/Architecture.hh Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Architecture.cc Outdated Show resolved Hide resolved
src/include/simeng/arch/aarch64/InstructionGroups.hh Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_address.cc Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_execute.cc Outdated Show resolved Hide resolved
test/regression/aarch64/instructions/sme.cc Show resolved Hide resolved
@FinnWilkinson FinnWilkinson changed the title [WIP] Full SME(1) instruction support and STREAMING Groups Full SME(1) instruction support and STREAMING Groups Sep 2, 2024
Copy link
Contributor

@jj16791 jj16791 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and I agree with several of Alex's comments. I think it would be good to get the ARM SME/SVE loops as part of our functional verification checks to help test these new instructions. I assume it would have to be done somewhere private though (not sure if we already have that guarantee in the upcoming CI/CD pipelines)?

CMakeLists.txt Outdated Show resolved Hide resolved
configs/a64fx_SME.yaml Outdated Show resolved Hide resolved
docs/sphinx/assets/instruction_groups_AArch64.png Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_address.cc Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Instruction.cc Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Architecture.cc Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_decode.cc Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_execute.cc Outdated Show resolved Hide resolved
test/regression/aarch64/instructions/sme.cc Show resolved Hide resolved
Copy link
Contributor

@dANW34V3R dANW34V3R left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished the review but posting comments to prevent overlaps

CMakeLists.txt Outdated Show resolved Hide resolved
src/include/simeng/arch/aarch64/Architecture.hh Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_decode.cc Outdated Show resolved Hide resolved
src/include/simeng/Register.hh Show resolved Hide resolved
configs/a64fx_SME.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@dANW34V3R dANW34V3R left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOOTS of new instructions, well done for grinding through them. Bring on SAIL

src/include/simeng/arch/aarch64/Instruction.hh Outdated Show resolved Hide resolved
src/lib/arch/riscv/Instruction_decode.cc Outdated Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_execute.cc Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_execute.cc Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_execute.cc Show resolved Hide resolved
src/lib/arch/aarch64/Instruction_execute.cc Outdated Show resolved Hide resolved
ABenC377
ABenC377 previously approved these changes Nov 11, 2024
@FinnWilkinson
Copy link
Contributor Author

FinnWilkinson commented Dec 3, 2024

Now outdated as STREAMING groups logic removed which was the only cause for slowdown.
See below for this PR's performance compared to dev (times averaged over 5 runs):

Benchmark dev Time (ms) dev StdDev This PR Time (ms) % diff to dev This PR StdDev
CloverLeaf serial gcc8.3.0 armv8.4 13194.4 60.1 13557.0 2.71% 132.52
CloverLeaf serial gcc9.3.0 armv8.4 13050.6 102.7 13580.2 3.98% 84.94
CloverLeaf serial gcc10.3.0 armv8.4 13290.4 47.9 13623.0 2.47% 44.06
CloverLeaf serial armclang20 armv8.4 11804.4 39.1 12343.2 4.46% 77.05
CloverLeaf openmp gcc8.3.0 armv8.4 17509.4 161.5 17889.8 2.15% 65.83
CloverLeaf openmp gcc9.3.0 armv8.4 17584.4 182.0 17995.2 2.31% 152.27
CloverLeaf openmp gcc10.3.0 armv8.4 17119.8 61.3 17651.4 3.06% 79.05
CloverLeaf openmp armclang20 armv8.4 15820.8 95.4 16211.0 2.44% 83.58
miniBUDE openmp gcc8.3.0 armv8.4 24691.2 52.3 24505.6 -0.75% 276.93
miniBUDE openmp gcc9.3.0 armv8.4 24500.0 175.6 24412.8 -0.36% 155.77
miniBUDE openmp gcc10.3.0 armv8.4 24438.0 146.7 24260.6 -0.73% 77.47
miniBUDE openmp armclang20 armv8.4 22725.2 150.0 22343.4 -1.69% 67.39
STREAM serial gcc8.3.0 armv8.4 7378.0 40.3 7769.8 5.17% 29.84
STREAM serial gcc9.3.0 armv8.4 7380.4 48.6 7722.6 4.53% 68.62
STREAM serial gcc10.3.0 armv8.4 7530.6 71.7 7632.6 1.35% 39.53
STREAM serial armclang20 armv8.4 8948.0 70.6 8317.4 -7.30% 36.88
STREAM openmp gcc8.3.0 armv8.4 11552.6 139.5 12020.4 3.97% 111.61
STREAM openmp gcc9.3.0 armv8.4 11737.0 133.1 11855.8 1.01% 48.96
STREAM openmp gcc10.3.0 armv8.4 11357.4 36.4 11768.0 3.55% 95.17
STREAM openmp armclang20 armv8.4 12701.0 227.5 12309.0 -3.13% 87.32
TeaLeaf 2D serial gcc8.3.0 armv8.4 13964.4 41.8 13605.8 -2.60% 42.13
TeaLeaf 2D serial gcc9.3.0 armv8.4 13976.2 40.8 13553.6 -3.07% 88.90
TeaLeaf 2D serial gcc10.3.0 armv8.4 14231.0 92.2 13961.2 -1.91% 109.20
TeaLeaf 2D serial armclang20 armv8.4 25691.8 86.2 24628.8 -4.22% 199.33
TeaLeaf 2D openmp gcc8.3.0 armv8.4 20085.2 88.6 20070.4 -0.07% 110.76
TeaLeaf 2D openmp gcc9.3.0 armv8.4 19980.2 79.3 20492.8 2.53% 146.48
TeaLeaf 2D openmp gcc10.3.0 armv8.4 19684.8 88.1 19522.4 -0.83% 100.20
TeaLeaf 2D openmp armclang20 armv8.4 58068.6 251.6 61880.2 6.36% 284.36
TeaLeaf 3D serial gcc8.3.0 armv8.4 15853.0 128.6 15818.2 -0.22% 57.76
TeaLeaf 3D serial gcc9.3.0 armv8.4 16483.8 58.3 16334.6 -0.91% 87.93
TeaLeaf 3D serial gcc10.3.0 armv8.4 16839.8 86.0 16521.4 -1.91% 28.94
TeaLeaf 3D serial armclang20 armv8.4 23052.2 157.0 22959.8 -0.40% 134.67
TeaLeaf 3D openmp gcc8.3.0 armv8.4 26103.0 145.5 26294.8 0.73% 190.12
TeaLeaf 3D openmp gcc9.3.0 armv8.4 26203.6 103.0 27278.8 4.02% 239.28
TeaLeaf 3D openmp gcc10.3.0 armv8.4 26068.2 278.0 26129.6 0.24% 112.81
TeaLeaf 3D openmp armclang20 armv8.4 45312.4 179.0 48379.4 6.55% 136.36
CloverLeaf serial gcc8.3.0 armv8.4+sve 12763.0 89.1 13372.0 4.66% 59.14
CloverLeaf serial gcc9.3.0 armv8.4+sve 12675.4 52.4 13300.4 4.81% 134.66
CloverLeaf serial gcc10.3.0 armv8.4+sve 12665.4 88.7 13086.4 3.27% 63.11
CloverLeaf serial armclang20 armv8.4+sve 12512.8 79.5 12963.4 3.54% 71.92
CloverLeaf openmp gcc8.3.0 armv8.4+sve 16973.8 119.5 17630.2 3.79% 197.66
CloverLeaf openmp gcc9.3.0 armv8.4+sve 17076.6 132.9 17460.8 2.22% 53.09
CloverLeaf openmp gcc10.3.0 armv8.4+sve 16814.4 96.4 17264.4 2.64% 76.24
CloverLeaf openmp armclang20 armv8.4+sve 16436.8 82.2 16844.2 2.45% 98.85
miniBUDE openmp gcc8.3.0 armv8.4+sve 9745.6 125.8 10291.4 5.45% 90.47
miniBUDE openmp gcc9.3.0 armv8.4+sve 9172.0 41.3 10081.6 9.45% 64.37
miniBUDE openmp gcc10.3.0 armv8.4+sve 9180.0 36.6 10054.0 9.09% 61.30
miniBUDE openmp armclang20 armv8.4+sve 9746.6 63.0 10098.8 3.55% 85.55
STREAM serial gcc8.3.0 armv8.4+sve 3915.0 18.9 4139.4 5.57% 15.92
STREAM serial gcc9.3.0 armv8.4+sve 3919.4 16.7 4139.2 5.46% 18.14
STREAM serial gcc10.3.0 armv8.4+sve 3862.0 29.9 4086.2 5.64% 23.04
STREAM serial armclang20 armv8.4+sve 2550.2 3.7 2593.4 1.68% 17.33
STREAM openmp gcc8.3.0 armv8.4+sve 7977.4 32.4 8196.2 2.71% 38.70
STREAM openmp gcc9.3.0 armv8.4+sve 7987.4 87.9 8265.6 3.42% 12.76
STREAM openmp gcc10.3.0 armv8.4+sve 7999.2 69.2 8051.0 0.65% 34.07
STREAM openmp armclang20 armv8.4+sve 6836.0 10.0 6990.8 2.24% 35.39
TeaLeaf 2D serial gcc8.3.0 armv8.4+sve 14022.8 99.5 13579.0 -3.22% 59.23
TeaLeaf 2D serial gcc9.3.0 armv8.4+sve 13996.4 63.8 13610.4 -2.80% 64.07
TeaLeaf 2D serial gcc10.3.0 armv8.4+sve 14362.6 59.8 13831.0 -3.77% 65.83
TeaLeaf 2D serial armclang20 armv8.4+sve 9835.2 75.5 9782.0 -0.54% 113.49
TeaLeaf 2D openmp gcc8.3.0 armv8.4+sve 19885.8 62.1 20026.2 0.70% 69.48
TeaLeaf 2D openmp gcc9.3.0 armv8.4+sve 20028.2 143.4 20322.0 1.46% 111.62
TeaLeaf 2D openmp gcc10.3.0 armv8.4+sve 19695.6 83.6 19575.4 -0.61% 38.66
TeaLeaf 2D openmp armclang20 armv8.4+sve 57176.4 405.7 59327.4 3.69% 324.77
TeaLeaf 3D serial gcc8.3.0 armv8.4+sve 13828.8 50.1 14023.6 1.40% 51.09
TeaLeaf 3D serial gcc9.3.0 armv8.4+sve 13901.6 36.4 14065.6 1.17% 35.20
TeaLeaf 3D serial gcc10.3.0 armv8.4+sve 14043.8 58.0 14203.2 1.13% 103.32
TeaLeaf 3D serial armclang20 armv8.4+sve 22478.8 138.4 22850.6 1.64% 51.21
TeaLeaf 3D openmp gcc8.3.0 armv8.4+sve 23927.6 73.3 24201.0 1.14% 94.22
TeaLeaf 3D openmp gcc9.3.0 armv8.4+sve 23638.8 119.3 24663.4 4.24% 138.94
TeaLeaf 3D openmp gcc10.3.0 armv8.4+sve 23550.4 130.2 24060.4 2.14% 31.31
TeaLeaf 3D openmp armclang20 armv8.4+sve 48104.8 253.4 50293.2 4.45% 319.78

dANW34V3R
dANW34V3R previously approved these changes Dec 4, 2024
@FinnWilkinson FinnWilkinson dismissed stale reviews from dANW34V3R and ABenC377 via 0af7abc December 5, 2024 13:35
ABenC377
ABenC377 previously approved these changes Dec 5, 2024
Copy link
Contributor

@jj16791 jj16791 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just need to finish off resolve existing conversations for the sake of clarity

@@ -585,9 +590,14 @@ RegisterValue vecUMinP(srcValContainer& sourceValues) {
const T* n = sourceValues[0].getAsVector<T>();
const T* m = sourceValues[1].getAsVector<T>();

// Concatenate the vectors
T temp[2 * I];
memcpy(temp, m, sizeof(T) * I);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't m and n be switched here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, only updated maxP to be in-line with HW...

ABenC377
ABenC377 previously approved these changes Dec 17, 2024
dANW34V3R
dANW34V3R previously approved these changes Dec 18, 2024
…uctions and aliases and regression tests (B, H, S, D)
@FinnWilkinson FinnWilkinson dismissed stale reviews from dANW34V3R and ABenC377 via 1232bcc December 20, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.9.7 Part of SimEng Release 0.9.7 enhancement New feature or request
Projects
Status: Changes Requested
Development

Successfully merging this pull request may close these issues.

4 participants