Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop-unrolled transposed [SD]GEMV kernels for A64FX and Neoverse V1 #4996

Merged
merged 1 commit into from
Dec 5, 2024

Conversation

iha-taisei
Copy link
Contributor

This pull request proposes a patch for issue #4989.
Loop-unrolling of kernels for transposed [SD]GEMV is implemented.
The graphs below show performance improvement of 2.3x on A64FX and 1.2x on Neoverse V1, compared with v0.3.28 on average.

20241202_gemv_unroll_A64FX

20241202_gemv_unroll_NEOVERSEV1

@Mousius
Copy link
Contributor

Mousius commented Dec 2, 2024

Hi @iha-taisei, these graphs look positive 😸

Can we delete gemv_t_sve.c now that we have two better-optimized variants?

Also, can you run the reproducer from #4324 to check if this produces the correct results?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 5, 2024

@Mousius same result for #4324 at least on my Pixel8 (last element zero instead of 1e-16), so (un)rolling is neither the problem nor the solution for that issue. Don't see it as a reason to reject the nice performance gains from this PR though...

@Mousius
Copy link
Contributor

Mousius commented Dec 5, 2024

@Mousius same result for #4324 at least on my Pixel8 (last element zero instead of 1e-16), so (un)rolling is neither the problem nor the solution for that issue. Don't see it as a reason to reject the nice performance gains from this PR though...

Ah, thanks for checking, I agree if the numbers are the same then this is a good improvement. Can decide to delete the other file later.

@martin-frbg
Copy link
Collaborator

I think keeping the old version has the advantage that one can quickly switch implementations in the KERNEL file without having to dig though git history first...

@martin-frbg martin-frbg added this to the 0.3.29 milestone Dec 5, 2024
@martin-frbg martin-frbg merged commit 230e665 into OpenMathLib:develop Dec 5, 2024
83 checks passed
@iha-taisei
Copy link
Contributor Author

Hi @Mousius, @martin
Thank you for confirming this pull request.

Can we delete gemv_t_sve.c now that we have two better-optimized variants?

gemv_t_sve.c was improved after my pull request, so I don't know how it will be used in the future. It doesn't matter if it is deleted or not.

Also, can you run the reproducer from #4324 to check if this produces the correct results?

This is the result of running A64FX and Neoverse V1 for the "Reproduction Code" shown in #4324. DDOT and DGEMV have different results, but I think we should also consider DDOT's loop-unrolling optimization.

A64FX:

 BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473726032e-01        3.578143506473724922e-01
7.045563366848872633e-02        7.045563366848883735e-02
9.965128148869295543e-03        9.965128148869309421e-03
1.099586127207341502e-03        1.099586127207556390e-03
9.945433911356937884e-05        9.945433911360671605e-05
7.620541308761552557e-06        7.620541308896932986e-06
5.064719744707346649e-07        5.064719744437437483e-07
2.971814139218764694e-08        2.971814140421127963e-08
1.560886697671293177e-09        1.560886564391797127e-09
7.419931336016816203e-11        7.419902813631537848e-11
3.221423128252354218e-12        3.221406076314455686e-12
1.290079154614431900e-13        1.289813102624490508e-13
4.551914400963141816e-15        4.440892098500626162e-15
-3.885780586188047891e-16       -4.440892098500626162e-16
ddot: -3.053113317719180486e-16
dgemv: -3.885780586188047891e-16
ddot == dgemv? NO
dgemv is 0.0? NO

NEOVERSE V1:

 BLAS                         MAT
1.175201193643801600e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473725477e-01        3.578143506473724922e-01
7.045563366848892062e-02        7.045563366848883735e-02
9.965128148869406566e-03        9.965128148869309421e-03
1.099586127207563546e-03        1.099586127207556390e-03
9.945433911373591229e-05        9.945433911360671605e-05
7.620541309094619464e-06        7.620541308896932986e-06
5.064719743597123625e-07        5.064719744437437483e-07
2.971814122565419325e-08        2.971814140421127963e-08
1.560886628282354138e-09        1.560886564391797127e-09
7.419923009344131515e-11        7.419902813631537848e-11
3.221256594798660444e-12        3.221406076314455686e-12
1.290079154614431900e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
-2.220446049250313081e-16       -4.440892098500626162e-16
ddot: -3.885780586188047891e-16
dgemv: -2.220446049250313081e-16
ddot == dgemv? NO
dgemv is 0.0? NO

@martin-frbg
Copy link
Collaborator

Thank you for showing the results of the reproducer from 4324. Interesting to see that it shows slightly different results from my tests with the NEOVERSEV1 kernel - in particular having a non-zero value for the last vector element (even if it is does not match the result of the naive multiplication loop). At the very least this should confirm my interpretation that it is a FMA-induced difference rather than some part of the algorithm forgetting to update this element.

@iha-taisei
Copy link
Contributor Author

Thank you for confirming my results.
I think you are comparing with the results of Neoverse V1. Could you tell me your operating environment and the results of the reproducer from #4324?

@martin-frbg
Copy link
Collaborator

Sorry for being a bit cryptic - my results were indeed from running the reproducer with the NEOVERSEV1 target on either actual AWS Graviton3 hardware or a Cortex X2 equipped phone (Pixel8pro). Results for the latter are

BLAS                         MAT                 1.17520119364380138     1.17520119364380204
1.103638323514327       1.10363832351432722
0.357814350647372548    0.357814350647372381
0.0704556336684889206   0.0704556336684888235
0.00996512814886942391  0.00996512814886940657
0.00109958612720750804  0.00109958612720744407
9.94543391135901955e-05 9.94543391136320457e-05
7.62054130892808601e-06 7.62054130889501869e-06
5.06471974359712362e-07 5.06471974322999519e-07
2.97181412534097689e-08 2.97181411972819313e-08
1.56088661440456633e-09 1.560886608374941e-09
7.4199313360168162e-11  7.4199280392763292e-11
3.22142312825235422e-12 3.22124132100942957e-12             1.28785870856518159e-13 1.28947776208071182e-13
4.44089209850062616e-15 4.40108483686630767e-15
0       -4.02512398736335814e-16
ddot: -2.220446049250313081e-16
dgemv: 0.000000000000000000e+00                             ddot == dgemv? NO
dgemv is 0.0? YES

(with the results for an actual NeoverseV1 not fundamentally different if I remember correctly - definitely having the last vector element show as either exact zero or an infinitesimal value much smaller than the desired -4e-16)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants