Use memory address instead of counter for hashes of some mutable types #2934

d-torrance · 2023-09-20T10:54:27Z

This is a first draft addressing the points in #2591 (comment). I've started with a relatively conservative approach:

We introduce a new function in the interpreter, hashFromAddress, which takes an Expr and returns a hash code based on its address in memory. Since the addresses are all multiples of powers of 2, by themselves, they make awful hash codes with lots of collisions. So we use Fibonacci hashing to spread them out.
Every Expr type that previously used nextHash to determine its hash code but is not one of the input types for youngest or serialNumber now uses hashFromAddress.

Before

i1 : hash new Type of BasicList

o1 = 1248871

i2 : hash new Type of BasicList

o2 = 1248879

After

i1 : hash new Type of BasicList 

o1 = 1186750

i2 : hash new Type of BasicList 

o2 = 1186752

Questions

Is there a better choice of hashing function? There are some comments online throwing shade at Fibonacci hashing, but it seems to do pretty well. For example, it's pretty close to what you would expect if it's equally likely for an element to be sorted into a each bucket:

i2 : B = buckets set apply(1000, i -> x -> x);

i3 : #B

o3 = 2048

i4 : tally \\ length \ B

o4 = Tally{0 => 1257}
           1 => 597
           2 => 181
           3 => 11
           4 => 2

o4 : Tally

i5 : needsPackage "Probability"; X = binomialDistribution(1000, 1/2048.);

i7 : apply(5, i -> 2048 * density_X i)

o7 = {1256.67, 613.907, 149.803, 24.3451, 2.96435}

The documentation for youngest says it only accepts mutable hash tables, but in reality it also accepts files, compiled function closures, and symbols. Are these necessary? Should we move these over to hashFromAddress as well?
Should we limit the use of nextHash to just Type objects, or keep it for all mutable hash tables?
The serialNumber function, which also uses hash codes for some types to determine the age of an object, is only used by the Serialization package, which is "still experimental and preliminary". Do we need this? Or could we move these over to hashFromAddress, too?

From TAOCP Section 6.4. We use w = 2^32 and A = (sqrt 5 - 1)/2 * 2^32 = 2654435769.

For computing hash codes for mutable objects from their address in memory. We first cast the pointer to a long and then an int to avoid "cast from pointer to integer of different size" compiler warnings. We use Fibonacci hashing to decrease collisions.

Add a helper function newCompiledFunction for simplification.

The functions dbmopenin and dbmopenout are nearly identically, so we take the opportunity to refactor them to both call a helper function.

d-torrance added 7 commits September 19, 2023 21:18

Add Fibonacci hash function to interpreter

98ebeff

From TAOCP Section 6.4. We use w = 2^32 and A = (sqrt 5 - 1)/2 * 2^32 = 2654435769.

Use hashFromAddress for mutable instances of PythonObject

94d7cc6

Use hashFromAddress for CompiledFunction objects

3cd54b9

Add a helper function newCompiledFunction for simplification.

Use hashFromAddress for FunctionClosure objects

05b49c9

Use hashFromAddress for Database objects

5cb76bc

The functions dbmopenin and dbmopenout are nearly identically, so we take the opportunity to refactor them to both call a helper function.

Use hashFromAddress for FunctionBody objects

f63e441

d-torrance changed the title ~~Use memory address instead of counter for some mutable types~~ Use memory address instead of counter for hashes of some mutable types Sep 20, 2023

d-torrance marked this pull request as ready for review October 16, 2023 00:33

DanGrayson merged commit fb79a8f into Macaulay2:development Oct 26, 2023
6 checks passed

d-torrance deleted the hash-codes branch November 9, 2023 02:18

d-torrance mentioned this pull request Jun 29, 2024

Broken hash functions #2591

Closed

d-torrance mentioned this pull request Aug 6, 2024

Deduplicate identical entries in "code methods X" #3388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use memory address instead of counter for hashes of some mutable types #2934

Use memory address instead of counter for hashes of some mutable types #2934

d-torrance commented Sep 20, 2023

Use memory address instead of counter for hashes of some mutable types #2934

Use memory address instead of counter for hashes of some mutable types #2934

Conversation

d-torrance commented Sep 20, 2023

Before

After

Questions