-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Q] Seeded integer hash functions #16
Comments
the first idea that come to mind |
Thanks! I guess I meant "integer hash function" in the sense that the README uses it, which seems to imply reversibility. |
From my experience getting Pearson B. Hashing to pass SMHasher especially with seeding, I would use some scramble of the seed, e.g. For example, and depending on your needs, a simple two-step |
You're on the right track. XORing the input is what I've done in the past myself. It's safe in that it cannot negatively affect the permutation regardless of the seed. It's is better than ADDing ( I haven't done any thorough testing on this yet, but another possibility I've considered is mixing the seed in the middle: uint32_t hash32(uint32_t x, uint32_t seed)
{
x ^= x >> A;
x *= B;
x ^= x >> C;
x ^= seed;
x *= D;
x ^= x >> E;
} This more fundamentally effects the permutation, though maybe in a bad way (why this needs more testing). Also, is it better before or after the middle xorshift? I don't know yet. There are also other exotic options: // WARNING: This is just for illustration purposes!
uint32_t hash32(uint32_t x, uint32_t seed)
{
uint32_t hi = (seed >> 15) & 0x1fffe;
uint32_t lo = (seed << 1) & 0x1fffe;
x ^= x >> A;
x *= B ^ hi;
x ^= x >> C;
x *= D ^ lo;
x ^= x >> E;
} However changing the multiplier constants undermines the whole point of using the good constants, and some seeds are going to create lousy permutations. Probably not worth it. |
afair, crcN(uintN) is always reversible function, so author can include it in his list of reversible primitives that said, half of his primitives are parameterized, so you can use any sequence of the primitives that employs enough entropy from the seed. but since crc and mul are better bit-mixers than other ops, crc(x*(seed|1)) still looks for me like the most efficient reversible hash function and regarding quality, you can reuse his code to compute AVERAGE bias for random-generated operation parameters instead of searching for MINIMAL one, for various operation sequences. actually, it would make nice addition to the current mode of operation
they can't do it as far as each individual operation is reversible
it's useless except when your seeds are from extremally poor PRNG such as
|
Thanks all!
Makes sense, luckily for my use case I know the seeds will be good.
Good to know I'm not doing something terrible. Although it does seem to me that I'm also curious about thoughts on how to use, say, 128 or 256 or more bits of seed. The naive thing I thought of is repeatedly mixing the seed in the middle, but maybe there's something more efficient:
Yeah, measurement is an interesting question (and the path forward from here). You'd want to ensure hash functions with different seeds aren't correlated (e.g., if you just ignored the seed, lowbias32 would perform pretty well if you just examined each seeded hash function in isolation). I have a measurement that's specific to my use case, but it's pretty blunt. Like skeeto, I'm a little untrusting of solutions that mix seeds into the multiplier constants; your suggestion is probably a good way to validate whether that's a good idea. |
This calls for cryptography. SPECK offers a version featuring 32-bit block size and 64-bit key size. Other than that, if you implemented reversible bit-permutation as an intermediate step, it would consume log₂ (32!) ≈ 118 bits of your seed to name a specific permutation. Leaves a few more for an XOR mask to invert some of the bits – or CLMUL, see below... However, I doubt that bit-permutation meets computation efficiency requirements... 😉
Why not use hardware-accelerated carry-less multiplication CLMUL instead of XOR? CLMUL changes the (2³² --> 2³²) function's structure in a more complex way than adding or exclusive-oring a constant at some (intermediate) point(s). It steers how and which of Also, I would not speak against arithmetic multiplication by the seed (with LSB set) as long as it remains an additional intermediate step and does not replace the multiplication with the good constants – and well, neither immediately follows nor directly precedes it to not de-associativate its effect. Compared to CLMUL, the propagated carry will have an increasing smearing effect on the upper-most bits (this imbalance is mitigated by the following and well matching xor-shift-right in The CLMUL-step, lacking the carry, could be combined with a prepended shift-right-add ( P.S. Still calculating log₂ (2³²!) for optimal minimum seed length... 😉 |
@skeeto have you considered testing the bias of the options you've suggested vs the XOR? In the meantime, would you recommend XOR? |
Related, could use the XOR method for a list of integers? Eg declare function triple32inc(x: number): number;
function hash(seed: number, values: number[]): number {
for (const value of values) {
seed = triple32inc(seed ^ value);
}
return seed;
} |
What's the purpose of seeding? I think the purpose of seeding can be viewed as generating different permutations, with this in mind, xor the input with seed is not sufficient. Consider two permutations, one that maps Another view on this is that the hash function itself is multiple rounds of weak permutation stacked on top of another, xor the input can be viewed as adding another round of weak permutation to the hash function, which does not change the hash function much. By the way, I have tested ideas of adding seed to the
|
Thank you for the cool project and writeup!
I was curious — do you have recommendations for integer hash function families? That is, given some amount of seed entropy, choose an integer hash function; for example, a naive thing one could do with a 32 bit seed is
return lowbias32(x ^ seed)
.The text was updated successfully, but these errors were encountered: