-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loop-unrolled transposed [SD]GEMV kernels for A64FX and Neoverse V1 #4996
Conversation
Hi @iha-taisei, these graphs look positive 😸 Can we delete Also, can you run the reproducer from #4324 to check if this produces the correct results? |
Ah, thanks for checking, I agree if the numbers are the same then this is a good improvement. Can decide to delete the other file later. |
I think keeping the old version has the advantage that one can quickly switch implementations in the KERNEL file without having to dig though git history first... |
Hi @Mousius, @martin
gemv_t_sve.c was improved after my pull request, so I don't know how it will be used in the future. It doesn't matter if it is deleted or not.
This is the result of running A64FX and Neoverse V1 for the "Reproduction Code" shown in #4324. DDOT and DGEMV have different results, but I think we should also consider DDOT's loop-unrolling optimization. A64FX:
NEOVERSE V1:
|
Thank you for showing the results of the reproducer from 4324. Interesting to see that it shows slightly different results from my tests with the NEOVERSEV1 kernel - in particular having a non-zero value for the last vector element (even if it is does not match the result of the naive multiplication loop). At the very least this should confirm my interpretation that it is a FMA-induced difference rather than some part of the algorithm forgetting to update this element. |
Thank you for confirming my results. |
Sorry for being a bit cryptic - my results were indeed from running the reproducer with the NEOVERSEV1 target on either actual AWS Graviton3 hardware or a Cortex X2 equipped phone (Pixel8pro). Results for the latter are
(with the results for an actual NeoverseV1 not fundamentally different if I remember correctly - definitely having the last vector element show as either exact zero or an infinitesimal value much smaller than the desired -4e-16) |
This pull request proposes a patch for issue #4989.
Loop-unrolling of kernels for transposed [SD]GEMV is implemented.
The graphs below show performance improvement of 2.3x on A64FX and 1.2x on Neoverse V1, compared with v0.3.28 on average.