Concurrent cuckoo hash table implemented in CUDA
# Install dependencies
sudo apt install meson ninja-build nvidia-cuda-toolkit
# Build
meson setup build
meson compile -C build
# Run
build/chash
Initially, we follow Alcantara's advice using the hash function with the formula [(c0 + c1 * k) mod 1900813], where c0, c1 are 2 constants generated at initialization for each subtable. However, experiment 1 shows that this is not an ideal choice. With capacity set to 2 ^ 25 and load factor about 0.3, the straightforward implementation of cuckoo hashing rehashes indefinitely and never terminates.
After some time researching online comparing the merits and demerits of several hash functions, we settled on xxHash32
as our hash function. Benchmarks performed by Jarzynski et al. suggests that xxHash32
strikes a good balance between run time performance and hash table collision rate. Switching to xxHash32
proved to be a wise choice with a substantial decrease in collision count and negligible runtime overhead. The maximum load factor that the straightforward table was able to contain is now over 0.5. Another benefit of xxHash32
is that it can be seeded and only require a single u32 seed. This allows us to generate different hash functions for each subtable very efficiently.
In the beginning, we use the same hash function used in our hash table xxHash32
to generate random integers. We also tried cuda's default random number generator, the curand
library. Unfortunately, it only supports generating floating-point numbers and not suitable for our use case. To be scientific and firm in our reportings, our final implementation uses the rand()
function from <stdlib.h>
. Random integers are generated and stored in host memory and later memcpy'ed to the device memory. Duplicates are removed using
std::set`.
We first implement a straightforward version of Cuckoo hashing. While cuckoo hashing is a recursive process, device functions usually do not favor function calls. Thus the eviction chain is implemented as a for loop with maximum loop count set to eviction bound. If an empty slot is found, we exit the loop early. If no open space is found and the maximum iteration is reached, we increment collision
flagging to host that a rehash is needed. In practice, this rarely happens when the load factor is less than 0.5.
Note unlike conventional hash tables, accesses to table slots are made simultaneously by multiple GPU threads. We use atomicExch
to atomically swap keys to be hashed and the target table slot to cope with concurrency. We terminate the loop if an empty
value is retrieved, i.e., the key is successfully hashed; otherwise, we continue on the eviction chain.
Drawing from what we've learned in class, frequent uncoalesced memory access to global memory is often a performance kill, let alone all accesses must be made atomically. Following this insight, we try to implement a multilevel hash table where the keys are first hashed into buckets in global memory. Cuckoo hashing is performed on individual buckets in shared memory. If any of the first level buckets is found out to be full, we generate a new hash function, perform a rehash and try to distribute the keys into the hash table more evenly. After the keys are distributed into buckets, we perform cuckoo hashing on individual buckets. If an unfortunate choice of a hash function is made, we only rehash the conflicting buckets.
While the insertion time alone is shorter than the straightforward table, combining the time to bucket input puts the multilevel table at a disadvantage. Quite contrary to our initial expectation. After extensive research online, we've found that global atomics are optimized from the Kepler and Maxwell architecture and then on. Performing atomics on shared memory could actually hurt performance. We regard this as a real-life example of how architectural improvement in hardware could turn software optimization into deoptimization. Our result is consistent with more recent publications of 1 and 2.
There are still some advantages of using a multilevel table, though. The naive table simply throws the towel when the load factor exceeds 0.7. While the multilevel table persists till 0.85, or when capacity = 1.2n. It is also less sensitive to the eviction bound size, reducing the time tuning this parameter.
A third approach that we took comes from the 2 observations:
- When the load factor is small, collisions happen relatively infrequently. A complete table rehash, in such a case, is not worth extra overhead.
- When the load factor is large, collisions are inevitable. Whether or not the hash table will contain all keys is a matter of probability, and such probability decreases sharply after the load factor reaches 0.7.
Following these observations, we conclude that it might be worthwhile to set aside a small stash to store conflicting keys. This complements the excellent performance of the straightforward table when the load is small and helps us advance through even the ill-formulated capacity = 1.01n, achieving the best of both worlds. Experimental results show that the stash table has the best overall performance.
All performance figures are reported in Mops or million operations per second. Additional performance metrics like the total number of collisions, mean time, and standard deviation are reported for the stash table. Each experiment is repeated 32 times to gain statistical confidence. (N/A means timeout)
Config | Value |
---|---|
CPU | Intel Xeon E5-2690 |
GPU | Nvidia K40m (2880 CUDA cores) |
Memory | 252 GiB |
Distro | Ubuntu 18.04 |
Kernel | 3.10.0 |
CUDA | 9.0 |
We verify the correctness of our implementation by inserting 512 elements and performing lookups on the same elements. If all elements are marked as inserted, we conclude that our implementation is correct. Different seeds are used, and many trials are performed to obtain reliable conclusions.
Create a hash table of size 2^25 in GPU global memory, where each table entry stores a 32-bit integer. Insert a set of 2^s random integer keys into the hash table, for s = 10, 11, ..., 24.
s | Naive | Multilevel | Stash | Mean/ms | StdDev/ms |
---|---|---|---|---|---|
10 | 0.5086 | 0.0622 | 0.5308 | 1.9290 | 0.1554 |
11 | 1.0097 | 0.1242 | 1.0876 | 1.8830 | 0.0140 |
12 | 1.9824 | 0.2475 | 2.1593 | 1.8969 | 0.0157 |
13 | 3.9719 | 0.4878 | 4.3357 | 1.8894 | 0.0106 |
14 | 8.0463 | 0.8993 | 8.6290 | 1.8987 | 0.0147 |
15 | 15.7511 | 1.4739 | 16.8820 | 1.9410 | 0.0127 |
16 | 30.4085 | 2.3080 | 32.8732 | 1.9936 | 0.0145 |
17 | 58.0311 | 3.7947 | 62.5064 | 2.0969 | 0.0091 |
18 | 104.1098 | 6.8521 | 114.4582 | 2.2903 | 0.0212 |
19 | 176.4422 | 12.7828 | 192.5563 | 2.7228 | 0.0192 |
20 | 269.0715 | 22.2279 | 287.5881 | 3.6461 | 0.1272 |
21 | 347.9472 | 37.5608 | 381.7428 | 5.4936 | 0.0262 |
22 | 391.0260 | 66.5917 | 434.2630 | 9.6584 | 0.0359 |
23 | 374.9033 | 101.5353 | 424.4112 | 19.7653 | 0.0273 |
24 | 270.1257 | 118.2038 | 336.0414 | 49.9260 | 0.0452 |
Performance dips when s = 24 for the naive and stash table because collisions begin to happen. However, the stash table performs better because no rehash is needed.
Insert a set Sܵ of 2^24 random keys into a hash table of size 2^25, then perform lookups for the following sets of keys ܵS_0, ..., S_10. Each set ܵS_i should contain 2^24 keys, where (100 - 10i) percent of the keys are randomly chosen from S, and the remainder is random 32-bit keys. For example, ܵS_0 should contain only random keys from S, while S_5 should 50% random keys from S and 50% completely random keys.
i | Naive | Multilevel | Stash | Mean/ms | StdDev/ms |
---|---|---|---|---|---|
0 | 377.6520 | 252.7055 | 534.9546 | 31.3619 | 0.0567 |
1 | 378.4491 | 253.6125 | 523.1317 | 32.0707 | 0.0355 |
2 | 379.3551 | 254.0673 | 509.9108 | 32.9023 | 0.0314 |
3 | 379.9102 | 254.6413 | 497.3425 | 33.7337 | 0.0280 |
4 | 380.7537 | 255.2533 | 483.8156 | 34.6769 | 0.0349 |
5 | 381.4734 | 255.0833 | 470.7332 | 35.6406 | 0.0282 |
6 | 382.4053 | 255.7710 | 456.3187 | 36.7664 | 0.0560 |
7 | 382.7881 | 256.4347 | 443.1547 | 37.8586 | 0.0309 |
8 | 383.9326 | 256.7915 | 428.6367 | 39.1409 | 0.0411 |
9 | 384.3994 | 257.3034 | 413.5550 | 40.5683 | 0.0174 |
10 | 384.8294 | 257.6221 | 394.6965 | 42.5066 | 0.0265 |
The increase in performance for the naive and multilevel table could be explained by the reduction in divergence. While finding a key early means exiting the loop early, other threads in the same warp may not take the same path. This is especially true when i = 0, where keys could be located at all 3 subtables. The warp scheduler may have to execute every iteration of the for loop twice, depending on the distribution of keys. All key lookup fails when i = 10 fails, and all threads perfectly synchronize, reducing divergence to 0.
On the other hand, the trend is inverse for the stash table. This is due to the extra time needed to linearly probe the stash when the lookup fails in the main table. Since the stash is minimal, performance only degrades slightly.
Fix a set of n = 2^24 random keys, and measure the time to insert the keys into hash tables of sizes s = 1.1n, 1.2n, ..., 2n. ݊Also, measure the insertion times for hash tables of sizes 1.01n, 1.02݊n, and 1.05n. Terminate the experiment if it takes too long and reports the time used.
s | Naive | Multilevel | Stash | Collision/% | Mean/ms | StdDev/ms |
---|---|---|---|---|---|---|
2.0 | 270.3744 | 118.0719 | 336.2089 | 0.00 | 49.9012 | 0.0420 |
1.9 | 258.9772 | 116.1746 | 324.2233 | 0.00 | 51.7459 | 0.0472 |
1.8 | 246.4734 | 113.6895 | 310.7758 | 0.00 | 53.9850 | 0.0564 |
1.7 | 232.1953 | 110.5201 | 295.0688 | 0.00 | 56.8587 | 0.0712 |
1.6 | 216.0954 | 105.8690 | 276.5650 | 0.00 | 60.6628 | 0.0935 |
1.5 | 198.1926 | 100.0926 | 253.2920 | 0.00 | 66.2367 | 0.2379 |
1.4 | N/A | 92.0229 | 224.8730 | 0.00 | 74.6075 | 0.1195 |
1.3 | N/A | 81.3341 | 191.8155 | 0.04 | 87.4654 | 0.0855 |
1.2 | N/A | 60.8193 | 156.2301 | 0.35 | 107.3879 | 0.1266 |
1.1 | N/A | N/A | 120.4658 | 2.30 | 139.2695 | 0.1096 |
1.05 | N/A | N/A | 104.7163 | 4.52 | 160.2158 | 0.1488 |
1.02 | N/A | N/A | 96.4342 | 6.22 | 173.9757 | 0.1293 |
1.01 | N/A | N/A | 93.8997 | 6.83 | 178.6717 | 0.1222 |
This experiment shows that using a stash to cope with collisions is indeed a very effective technique.
Using n = 2^24 random keys and a hash table of size 1.4n, experiment with different bounds on the maximum length of an eviction chain before restarting. Which bound gives the best running time for constructing the hash table? Note, however you are not required to find the optimal bound.
e * logn | Naive | Multilevel | Stash | Collision/% | Mean/ms | StdDev/ms |
---|---|---|---|---|---|---|
0.2 | N/A | N/A | 233.1928 | 4.29 | 71.9457 | 0.1570 |
0.3 | N/A | 66.2509 | 221.4478 | 1.78 | 75.7615 | 0.2969 |
0.4 | N/A | 61.8062 | 216.2449 | 0.60 | 77.5843 | 0.2400 |
0.5 | N/A | 65.6564 | 217.5787 | 0.21 | 77.1087 | 0.1529 |
0.6 | N/A | 80.5926 | 218.5072 | 0.08 | 76.7811 | 0.1830 |
0.7 | N/A | 76.2219 | 218.3521 | 0.06 | 76.8356 | 0.1114 |
0.8 | N/A | 76.6323 | 220.7911 | 0.02 | 75.9868 | 0.1059 |
0.9 | N/A | 85.7437 | 222.7855 | 0.01 | 75.3066 | 0.0954 |
1.0 | N/A | 82.7799 | 225.0036 | 0.00 | 74.5642 | 0.0702 |
2.0 | 108.6175 | 91.4688 | 230.3316 | 0.00 | 72.8394 | 0.1075 |
4.0 | 176.5028 | 91.9897 | 230.4251 | 0.00 | 72.8098 | 0.0923 |
8.0 | 176.4994 | 91.9756 | 230.3617 | 0.00 | 72.8299 | 0.0671 |
16.0 | 176.5834 | 91.9972 | 230.3030 | 0.00 | 72.8484 | 0.1776 |
The first decreasing then increasing trend in the performance of the stash table can be attributed to 2 factors:
- The eviction bound
- The collision rate
When the eviction bound is tiny, the number of iterations needed to complete insertion is also small. However, this comes at the cost of a higher collision rate—the collision rate decreases as the eviction bound grows. Insertion performance for all tables converges after e >= 4.0n, which is the default suggested bound. Setting e to 0.1 or smaller causes endless rehash for all tables, and thus figures are not presented.