Loop-unrolled transposed [SD]GEMV kernels for A64FX and Neoverse V1 #4996

iha-taisei · 2024-12-02T10:42:32Z

This pull request proposes a patch for issue #4989.
Loop-unrolling of kernels for transposed [SD]GEMV is implemented.
The graphs below show performance improvement of 2.3x on A64FX and 1.2x on Neoverse V1, compared with v0.3.28 on average.

Mousius · 2024-12-02T12:08:26Z

Hi @iha-taisei, these graphs look positive 😸

Can we delete gemv_t_sve.c now that we have two better-optimized variants?

Also, can you run the reproducer from #4324 to check if this produces the correct results?

martin-frbg · 2024-12-05T20:53:11Z

@Mousius same result for #4324 at least on my Pixel8 (last element zero instead of 1e-16), so (un)rolling is neither the problem nor the solution for that issue. Don't see it as a reason to reject the nice performance gains from this PR though...

Mousius · 2024-12-05T21:28:20Z

@Mousius same result for #4324 at least on my Pixel8 (last element zero instead of 1e-16), so (un)rolling is neither the problem nor the solution for that issue. Don't see it as a reason to reject the nice performance gains from this PR though...

Ah, thanks for checking, I agree if the numbers are the same then this is a good improvement. Can decide to delete the other file later.

martin-frbg · 2024-12-05T21:36:10Z

I think keeping the old version has the advantage that one can quickly switch implementations in the KERNEL file without having to dig though git history first...

iha-taisei · 2024-12-12T06:23:44Z

Hi @Mousius, @martin
Thank you for confirming this pull request.

Can we delete gemv_t_sve.c now that we have two better-optimized variants?

gemv_t_sve.c was improved after my pull request, so I don't know how it will be used in the future. It doesn't matter if it is deleted or not.

Also, can you run the reproducer from #4324 to check if this produces the correct results?

This is the result of running A64FX and Neoverse V1 for the "Reproduction Code" shown in #4324. DDOT and DGEMV have different results, but I think we should also consider DDOT's loop-unrolling optimization.

A64FX:

 BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473726032e-01        3.578143506473724922e-01
7.045563366848872633e-02        7.045563366848883735e-02
9.965128148869295543e-03        9.965128148869309421e-03
1.099586127207341502e-03        1.099586127207556390e-03
9.945433911356937884e-05        9.945433911360671605e-05
7.620541308761552557e-06        7.620541308896932986e-06
5.064719744707346649e-07        5.064719744437437483e-07
2.971814139218764694e-08        2.971814140421127963e-08
1.560886697671293177e-09        1.560886564391797127e-09
7.419931336016816203e-11        7.419902813631537848e-11
3.221423128252354218e-12        3.221406076314455686e-12
1.290079154614431900e-13        1.289813102624490508e-13
4.551914400963141816e-15        4.440892098500626162e-15
-3.885780586188047891e-16       -4.440892098500626162e-16
ddot: -3.053113317719180486e-16
dgemv: -3.885780586188047891e-16
ddot == dgemv? NO
dgemv is 0.0? NO

NEOVERSE V1:

 BLAS                         MAT
1.175201193643801600e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473725477e-01        3.578143506473724922e-01
7.045563366848892062e-02        7.045563366848883735e-02
9.965128148869406566e-03        9.965128148869309421e-03
1.099586127207563546e-03        1.099586127207556390e-03
9.945433911373591229e-05        9.945433911360671605e-05
7.620541309094619464e-06        7.620541308896932986e-06
5.064719743597123625e-07        5.064719744437437483e-07
2.971814122565419325e-08        2.971814140421127963e-08
1.560886628282354138e-09        1.560886564391797127e-09
7.419923009344131515e-11        7.419902813631537848e-11
3.221256594798660444e-12        3.221406076314455686e-12
1.290079154614431900e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
-2.220446049250313081e-16       -4.440892098500626162e-16
ddot: -3.885780586188047891e-16
dgemv: -2.220446049250313081e-16
ddot == dgemv? NO
dgemv is 0.0? NO

martin-frbg · 2024-12-12T08:58:04Z

Thank you for showing the results of the reproducer from 4324. Interesting to see that it shows slightly different results from my tests with the NEOVERSEV1 kernel - in particular having a non-zero value for the last vector element (even if it is does not match the result of the naive multiplication loop). At the very least this should confirm my interpretation that it is a FMA-induced difference rather than some part of the algorithm forgetting to update this element.

iha-taisei · 2024-12-26T11:44:34Z

Thank you for confirming my results.
I think you are comparing with the results of Neoverse V1. Could you tell me your operating environment and the results of the reproducer from #4324?

martin-frbg · 2024-12-27T21:06:50Z

Sorry for being a bit cryptic - my results were indeed from running the reproducer with the NEOVERSEV1 target on either actual AWS Graviton3 hardware or a Cortex X2 equipped phone (Pixel8pro). Results for the latter are

BLAS                         MAT                 1.17520119364380138     1.17520119364380204
1.103638323514327       1.10363832351432722
0.357814350647372548    0.357814350647372381
0.0704556336684889206   0.0704556336684888235
0.00996512814886942391  0.00996512814886940657
0.00109958612720750804  0.00109958612720744407
9.94543391135901955e-05 9.94543391136320457e-05
7.62054130892808601e-06 7.62054130889501869e-06
5.06471974359712362e-07 5.06471974322999519e-07
2.97181412534097689e-08 2.97181411972819313e-08
1.56088661440456633e-09 1.560886608374941e-09
7.4199313360168162e-11  7.4199280392763292e-11
3.22142312825235422e-12 3.22124132100942957e-12             1.28785870856518159e-13 1.28947776208071182e-13
4.44089209850062616e-15 4.40108483686630767e-15
0       -4.02512398736335814e-16
ddot: -2.220446049250313081e-16
dgemv: 0.000000000000000000e+00                             ddot == dgemv? NO
dgemv is 0.0? YES

(with the results for an actual NeoverseV1 not fundamentally different if I remember correctly - definitely having the last vector element show as either exact zero or an infinitesimal value much smaller than the desired -4e-16)

Loop-unrolled transposed [SD]GEMV kernels for A64FX and Neoverse V1

4918bee

Mousius approved these changes Dec 5, 2024

View reviewed changes

martin-frbg added this to the 0.3.29 milestone Dec 5, 2024

martin-frbg merged commit 230e665 into OpenMathLib:develop Dec 5, 2024
83 checks passed

martin-frbg mentioned this pull request Dec 5, 2024

Performance improvement of transposed [SD]GEMV on A64FX and Neoverse V1. #4989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loop-unrolled transposed [SD]GEMV kernels for A64FX and Neoverse V1 #4996

Loop-unrolled transposed [SD]GEMV kernels for A64FX and Neoverse V1 #4996

iha-taisei commented Dec 2, 2024

Mousius commented Dec 2, 2024

martin-frbg commented Dec 5, 2024 •

edited

Loading

Mousius commented Dec 5, 2024

martin-frbg commented Dec 5, 2024

iha-taisei commented Dec 12, 2024

martin-frbg commented Dec 12, 2024

iha-taisei commented Dec 26, 2024

martin-frbg commented Dec 27, 2024

Loop-unrolled transposed [SD]GEMV kernels for A64FX and Neoverse V1 #4996

Loop-unrolled transposed [SD]GEMV kernels for A64FX and Neoverse V1 #4996

Conversation

iha-taisei commented Dec 2, 2024

Mousius commented Dec 2, 2024

martin-frbg commented Dec 5, 2024 • edited Loading

Mousius commented Dec 5, 2024

martin-frbg commented Dec 5, 2024

iha-taisei commented Dec 12, 2024

martin-frbg commented Dec 12, 2024

iha-taisei commented Dec 26, 2024

martin-frbg commented Dec 27, 2024

martin-frbg commented Dec 5, 2024 •

edited

Loading