Optimize crc32 & crc32c on NVIDIA Grace #2204

krenzland · 2024-05-16T13:30:07Z

This pull request adds hardware accelerated routines for CRC32 and CRC32C for Arm AARCH64 CPUs. The changes here have been tested on NVIDIA Grace.
In detail, it contains routines for:

Computing CRC32 and CRC32C hashes on dataset using the CRC intrinsics. On Grace/Neoverse V2, this can process 8 bytes/cycle.
A vectorized implementation of the gf_multiply_crc32c_hw and gf_multiply_crc32_hw functions used in routines to merge partial CRC checksums. These functions are more or less a 1:1 translation of the x86 vectorized routines.
I've introduced feature flags for AES, and SHA extensions for Arm CPUs. The feature checks for the vectorized functions are a bit more messy than on x86 because CPUs can implement a subset of these extensions.

This should resolve issue #2027.

facebook-github-bot · 2024-05-16T20:16:36Z

@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

meteorfox · 2024-05-17T16:17:17Z

folly/hash/Checksum.cpp

@@ -77,6 +77,10 @@ uint32_t crc32_hw(
 }

 bool crc32c_hw_supported() {
+  return crc32_hw_supported_sse42();


This has a typo, missing the c after crc32.

Correct name should be crc32c_hw_supported_sse42

meteorfox · 2024-05-17T17:51:12Z

folly/hash/detail/Crc32CombineDetail.cpp

@@ -105,7 +111,35 @@ static uint32_t gf_multiply_crc32_hw(uint64_t crc1, uint64_t crc2, uint32_t) {
  return _mm_cvtsi128_si32(_mm_srli_si128(_mm_xor_si128(res3, res1), 4));
 }

-#else
+#elif FOLLY_NEON && FOLLY_ARM_FEATURE_CRC32 && FOLLY_ARM_FEATURE_AES && FOLLY_ARM_FEATURE_SHA2


This line is too long.
Can you resubmit after a clang-format?

Sorry, we should have this automated as part of a git hook or something, not sure why we don't have it.

Okay. Can I suggest to add clang-format to https://github.com/facebook/folly/blob/main/CONTRIBUTING.md?

- Fix typo - Clang format all changed files

krenzland · 2024-05-21T12:03:03Z

Thanks for the review! I forgot to add that this should be compiled with the flags
python3 build/fbcode_builder/getdeps.py --allow-system-packages build --extra-cmake-defines '{"CMAKE_CXX_FLAGS": "-march=armv8.5-a+crc+crypto"}'
or similar (+crypto could be replaced by +aes+sha2?) to enable all required features.

facebook-github-bot · 2024-05-22T18:27:14Z

@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

meteorfox · 2024-07-08T22:21:24Z

@krenzland Hey after internal discussions, we would like to request to move your contributions under folly/external/nvidia-crc32 to have a more defined copyright lines, you can still define them under folly namespace.

AFAIU, CMake should automatically pick them up, as we have auto_source with recurse.

Thanks in advanced

yfeldblum · 2024-07-26T01:27:35Z

folly/hash/Checksum.cpp

-#else
+#elif FOLLY_ARM_FEATURE_CRC32
+uint32_t crc32_hw(const uint8_t* buf, size_t len, uint32_t crc) {
+  auto* buf_64 = reinterpret_cast<const uint64_t*>(buf);


This looks like unaligned access.

Note that while an architecture may support unaligned access, the language generally deems unaligned access of this form to be undefined behavior. The kind of undefined behavior that may end up being subject to miscompilations.

To avoid this unaligned access, do we need to start off with an optional round of each of __crc32b, __crc32h, __crc32w, and __crc32d? Alternatively, perhaps we can memcpy from buf instead of reinterpret_cast - modern compilers recognize this idiom and lower to mov or ldr instructions without emitting calls to memcpy.

Here and crc32c_hw below.

Thanks, you're totally right. Unaligned access isn't directly forbidden in C++ but my implementation is incorrect by the strict aliasing rule. While this is typically fine when operating on bytes my code is still incorrect.

The memcpy version is a good idea. It's obviously correct and at a first glance seems to lead to (nearly) the same code for GCC/clang. I'll test it and update the PR.

facebook-github-bot · 2024-08-20T20:19:50Z

@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-28T18:20:55Z

@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-09-26T17:40:29Z

@r1mikey merged this pull request in 8fc0e33.

Summary: This pull request adds hardware accelerated routines for CRC32 and CRC32C for Arm AARCH64 CPUs. The changes here have been tested on NVIDIA Grace. In detail, it contains routines for: - Computing CRC32 and CRC32C hashes on dataset using the CRC intrinsics. On Grace/Neoverse V2, this can process 8 bytes/cycle. - A vectorized implementation of the `gf_multiply_crc32c_hw` and `gf_multiply_crc32_hw` functions used in routines to merge partial CRC checksums. These functions are more or less a 1:1 translation of the x86 vectorized routines. - I've introduced feature flags for AES, and SHA extensions for Arm CPUs. The feature checks for the vectorized functions are a bit more messy than on x86 because CPUs can implement a subset of these extensions. This should resolve issue facebook/folly#2027. X-link: facebook/folly#2204 Reviewed By: yfeldblum Differential Revision: D57456858 Pulled By: r1mikey fbshipit-source-id: 8ff7be6c7b03bff8cf6df46a76a9a2b5ad8555ef

krenzland added 2 commits May 16, 2024 15:16

Add Arm implementation for CRC32(C)

c9783f4

Add SSE4.2 feature flag for CRC

4499efa

facebook-github-bot added the CLA Signed label May 16, 2024

krenzland changed the title ~~Optimize crc32 & crc32c on Nvidia Grace~~ Optimize crc32 & crc32c on NVIDIA Grace May 16, 2024

meteorfox reviewed May 17, 2024

View reviewed changes

Address review.

afbaa63

- Fix typo - Clang format all changed files

yfeldblum reviewed Jul 26, 2024

View reviewed changes

krenzland added 2 commits August 20, 2024 01:26

Merge branch 'main' into optimize-crc32c

c585881

Remove undefined behavior in crc32(c)

4f99658

krenzland force-pushed the optimize-crc32c branch from 81c22f5 to 4f99658 Compare August 20, 2024 11:49

Move NVIDIA contribs to external directory

aacedc0

facebook-github-bot closed this in 8fc0e33 Sep 26, 2024

facebook-github-bot added the Merged label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize crc32 & crc32c on NVIDIA Grace #2204

Optimize crc32 & crc32c on NVIDIA Grace #2204

krenzland commented May 16, 2024 •

edited

Loading

facebook-github-bot commented May 16, 2024

meteorfox May 17, 2024

meteorfox May 17, 2024 •

edited

Loading

krenzland May 21, 2024

krenzland commented May 21, 2024

facebook-github-bot commented May 22, 2024

meteorfox commented Jul 8, 2024

yfeldblum Jul 26, 2024

krenzland Jul 29, 2024

facebook-github-bot commented Aug 20, 2024

facebook-github-bot commented Aug 28, 2024

facebook-github-bot commented Sep 26, 2024

Optimize crc32 & crc32c on NVIDIA Grace #2204

Optimize crc32 & crc32c on NVIDIA Grace #2204

Conversation

krenzland commented May 16, 2024 • edited Loading

facebook-github-bot commented May 16, 2024

meteorfox May 17, 2024

Choose a reason for hiding this comment

meteorfox May 17, 2024 • edited Loading

Choose a reason for hiding this comment

krenzland May 21, 2024

Choose a reason for hiding this comment

krenzland commented May 21, 2024

facebook-github-bot commented May 22, 2024

meteorfox commented Jul 8, 2024

yfeldblum Jul 26, 2024

Choose a reason for hiding this comment

krenzland Jul 29, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Aug 20, 2024

facebook-github-bot commented Aug 28, 2024

facebook-github-bot commented Sep 26, 2024

krenzland commented May 16, 2024 •

edited

Loading

meteorfox May 17, 2024 •

edited

Loading