-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Class of Bijective 128-Bit Hashes for GPU #26
Comments
Here's the double XOR-rotate inversion function I forgot to include:
This can probably be collapsed somewhat. I'll try to do it when I have time. |
Okay. I found that only three, instead of seven, iterations of the rotation inversions above will work. I collapsed the inversion function by simplifying the cyclic permutations: (I+C³²+C⁶⁴)(I+C³²+C⁶⁴)(I+C³²+C⁶⁴) Which means the inversion function above can be simplified to:
The Shadertoy link I gave earlier has been updated to reflect this. |
This is very cool. I'm not sure how hash-prospector could test this, since even getting reasonable comparisons of 64-bit hashes is challenging... The component swizzling approach for uvec4 seems solid, and I can't find any easy ways to speed up your hash (maybe the inverse can be faster if |
For modern GCC/clang (and icc/icx), I think, it translates rather straightforwardly: typedef uint32_t uvec4 __attribute__((vector_size(16)));
uvec4 bjhash128(uvec4 p) {
#define SWIZZLE(v, a, b, c, d) (uvec4){v[a], v[b], v[c], v[d]}
const uvec4 shft = (uvec4){14U, 15U, 16U, 17U};
p ^= p >> shft;
p *= (uvec4){0xEAF649A9U, 0x6AF649A9U, 0x050C2D35U, 0xAAF649A9U};
p ^= SWIZZLE(p, 1, 2, 3, 0) ^ SWIZZLE(p, 2, 3, 0, 1); //p ^= p.yzwx ^ p.zwxy;
p ^= p >> SWIZZLE(shft, 1, 2, 3, 0); //p ^= p >> shft.yzwx;
p *= (uvec4){0x21F0AAADU, 0x0D0C2D35U, 0xE8F649A9U, 0xD30332D3U};
p ^= SWIZZLE(p, 1, 2, 3, 0) ^ SWIZZLE(p, 2, 3, 0, 1); //p ^= p.yzwx ^ p.zwxy;
return p ^ p >> SWIZZLE(shft, 3, 0, 1, 2); //return p ^ p >> shft.wxyz;
#undef SWIZZLE
} Of course, you can also use intrinsics for specific instruction set (SSE2/NEON/whatever) instead. Note: parenthesizing This definition of #define SWIZZLE(v, a, b, c, d) __builtin_shufflevector(v, v, a, b, c, d) // <-- clang (6.0.0+) and newer GCC (12.1+)
#define SWIZZLE(v, a, b, c, d) __builtin_shuffle(v, uvec4{a, b, c, d}) // <-- older GCC (4.8.1+), but not clang As for the hash itself, |
Note that All 16 possible versions of this pass the birthday test, incidentally. |
Constants being so similar
looks a bit suspicious. Edit: just realized constants are from #19. |
Yeah, the constants were arbitrarily chosen from the list in #19 just to get something to start off with. There was no statistical motivation for choosing them, but I'm reasonably confident that tweaking the constants and shift values could yield better results. It was somewhat disappointing to see that one of the swizzles failed at just 16KB. I'm still working on a battery of tests for the GPU. I started off with OpenGL 4.6, but decided to switch to WGSL, as the WebGPU API is supposed to be implemented at large in Chromium-based browsers sometime near the end of May, I believe, and it supports compute shaders. It would also probably be quicker to prototype something since I wouldn't have to compile any of the code before running it. |
It just how the constants interact, e.g. p *= uvec4(0xEAF649A9U, 0x6AF649A9U, 0x050C2D35U, 0xAAF649A9U);
p ^= p.yzwx ^ p.zwxy; is the same as p = uvec4(p.x * 0xEAF649A9U, p.y * 0x6AF649A9U, p.z * 0x050C2D35U, p.w * 0xAAF649A9U)
^ uvec4(p.y * 0x6AF649A9U, p.z * 0x050C2D35U, p.w * 0xAAF649A9U, p.x * 0xEAF649A9U)
^ uvec4(p.z * 0x050C2D35U, p.w * 0xAAF649A9U, p.x * 0xEAF649A9U, p.y * 0x6AF649A9U); and resulting Of course, in the above tests only one input component was non-zero anyway. I myself tend to use constants from https://gist.github.com/Marc-B-Reynolds/53ecc4f9648d9ba4348403b9afd06804 with equally little justification. Edit: the other thing, shift amounts being a linear progression {14,15,16,17} may interact undesirably with swizzles moving components by 1 place (or maybe not). |
The shifts being linear were also an arbitrary choice, as I swizzle the constant vector that contains them between steps, hoping that would be enough. I also thought about using
etc...
and you're not constrained to using all the swizzles, so an option could be
and the like. |
Interesting to see generalization to higher bit counts. That is, insofar can you apply different generated hash functions one after the other instead of applying one repeated over the whole bitstring?
They don't quite need to be secure for my use-case, but anything naive fails (like literally putting them side-to-side because the effects of bit changes stay within their bucket). |
If you wanted to extend the hash above to higher bit counts, it would probably work fine for any power-of-2-times-32-bit inputs, but in order to ensure bits propagate across the entire set, you would have to do more more rounds of double-XOR-swizzles. I'm not sure why you would want to compare hash functions directly. Could you explain your reason for that? It might help me answer your question more clearly. The typical way to test the quality of a hash is to use a battery of tests that output some kind of p-value where the threshold set is where a truly random set of numbers would pass every test (though false negatives do occasionally come up, even among sets of random numbers). So in essence, this is equivalent to testing a hash against the set of random numbers itself and the comparison comes from how closely each hash stacks up against that. As far as directly comparing hashes, depending on what exactly you want to do, it could be trivially fast on a GPU in a fragment or compute shader, or it could be as rigorous as running a battery of tests like those mentioned above, possibly even more so. A direct comparison of output values or bits could be done in realtime with a very large set of inputs in a fragment shader and would be very easy to implement. Checking for bit pattern overlaps between hashes would be far more intensive, though, especially among bijective hashes, as the full period length of the input state is so prodigiously large that it's not reasonable to check the full domain (128 bits have over 340 undecillion possible outputs). It's likely that using the results of a test to then drive another hash (which is what it sounds like you want to do) in realtime would be very difficult, and maybe not even possible, depending of what kinds of comparisons you want to drive it with. Bijective hashes may be used for encryption, but that doesn't necessarily mean they are "secure". It depends on what you want. If you want something fast, then you'll likely have to sacrifice quality. If you want something more random, you'll probably have to forgo speed. And finding something in between is where testing hashes is most invaluable. |
Just thought I'd give an update on my GPU hash test suite progress. I'm too clumsy with GitHub to really maintain a codebase here yet, as I'm still pretty new to the platform. Since I'm going with the NIST SP 800-22 recommendations for tests, I needed to build a small codebase of fast (though slightly less accurate) variants of some of the functions in the cephes.h and math.h libraries. Namely, I've made the shader language equivalents of the erf() and gamma-related functions which are so ubiquitous throughout the last steps of each test the NIST recommends. Some tests will be blazing-fast on a GPU, like the random walk test, where you could literally test strings of hundreds of millions of bits in less than a second on a modern mid-tier graphics card. Even the discrete Fourier transform test can be performed on a very large set in almost no time at all. But the monobit frequency test is a bit trickier regarding fast memory management and parallelism. It's a bit more involved trying to keep as many of the results of the different stages of computation on GPU memory without leaving a huge gap of under-utilized memory somewhere or having to write to and read from non-volatile memory. The main difficulty, though, is in workgroups being limited to a 3D structure, but needing to do computations in 4D without having too much overhead in doing computationally-heavy transforms from the workgroup IDs to the integer space in a 4x32-bit hash. |
It seems odd that the NIST recommends using erf() for evaluating statistical properties of a digital algorithm, since there are many different approximations of it but it doesn't seem to be directly calculatable. I wonder what alternative hash tests like SMHasher use... It sounds like you've already done a lot of hard work, thank you for that! |
In the case of the NIST's recommendations, erf() and igamc() only seem to be used in the final step of tests, namely calculating p-values, so there really shouldn't be much in the way of cumulative error issues. The erf() approximation I settled on is Winitzki's, as it's a lot faster than using continued fractions and is more accurate than the Bürmann series. It's accurate to about 7 significant digits. The upper incomplete gamma function was a bit tougher, as I went through about a dozen "approximation" methods that seemed to work fine with really high upper bounds values, but were terrible when the upper bounds were small and close to the lower bounds. Some were off by more than a factor of 2 with lower values. I actually had to look in a 1992 publication of "Numerical Recipes in FORTRAN 77" to find that the best way was just to use the odd-number continued fractions to get it to converge really quickly. The accuracy I've found to be anywhere between 6 and 11 significant digits with just four iterations, with the accuracy being on the higher end for lower values. An accuracy of 3 significant digits is all that's needed for the NIST's tests, as the p-value threshold is 0.01 for all of them, as far as I'm aware. Either way, all of the transcendental versions of these functions can be represented as a continued fraction with integers, so the accuracy can be whatever a user desires. Good point, @tommyettinger! I should perhaps give users the option to choose which form of the functions to use, either the faster ones or the transcendental ones where they could choose the number of iterations for each that suit their needs. |
First of all, big fan of this class of hashes. Came across your work through Shadertoy and went way too far down the rabbithole trying to see if I could come up with a good adaption of this for 128-bit hashes on a GPU. I don't really know any C, so I couldn't give you the C equivalent here, but these are super fast on a GPU because 4-element vectors can swap 32-bit values between them at zero cost. I found an article by Marc Reynolds http://marc-b-reynolds.github.io/math/2017/10/13/XorRotate.html about reversing double XOR-rotates and came to the realization that a vector swizzle is just a rotation by multiples of 32 bits. So the GLSL code goes like this:
And the inverse:
And the Shadertoy is here https://www.shadertoy.com/view/mstXD2 with further explanation and some basic debugging tools. Obviously the above code was written with vector parallelism in mind, so I apologize if this doesn't translate to C very well, but I'm hoping this might give you a good starting point for developing a faster 128-bit variant in C.
I was rolling around the idea of writing a compute shader to do what your Hash Prospector does, but on the GPU so as to be able to scale it efficiently for 128-bit testing. As such, I still haven't done any tests for statistical robustness other than a visual inspection, so there's room for improvement in choosing the correct constants and bitwise shifts.
That being said, all the bits seem to be visually random, both the lowest and highest bits, so that gives me hope that this will be a feasible variant. I'm just curious as to what your thoughts are.
The text was updated successfully, but these errors were encountered: