Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memory address instead of counter for hashes of some mutable types #2934

Merged
merged 7 commits into from
Oct 26, 2023

Conversation

d-torrance
Copy link
Member

This is a first draft addressing the points in #2591 (comment). I've started with a relatively conservative approach:

  • We introduce a new function in the interpreter, hashFromAddress, which takes an Expr and returns a hash code based on its address in memory. Since the addresses are all multiples of powers of 2, by themselves, they make awful hash codes with lots of collisions. So we use Fibonacci hashing to spread them out.
  • Every Expr type that previously used nextHash to determine its hash code but is not one of the input types for youngest or serialNumber now uses hashFromAddress.

Before

i1 : hash new Type of BasicList

o1 = 1248871

i2 : hash new Type of BasicList

o2 = 1248879

After

i1 : hash new Type of BasicList 

o1 = 1186750

i2 : hash new Type of BasicList 

o2 = 1186752

Questions

  • Is there a better choice of hashing function? There are some comments online throwing shade at Fibonacci hashing, but it seems to do pretty well. For example, it's pretty close to what you would expect if it's equally likely for an element to be sorted into a each bucket:
i2 : B = buckets set apply(1000, i -> x -> x);

i3 : #B

o3 = 2048

i4 : tally \\ length \ B

o4 = Tally{0 => 1257}
           1 => 597
           2 => 181
           3 => 11
           4 => 2

o4 : Tally

i5 : needsPackage "Probability"; X = binomialDistribution(1000, 1/2048.);

i7 : apply(5, i -> 2048 * density_X i)

o7 = {1256.67, 613.907, 149.803, 24.3451, 2.96435}
  • The documentation for youngest says it only accepts mutable hash tables, but in reality it also accepts files, compiled function closures, and symbols. Are these necessary? Should we move these over to hashFromAddress as well?
  • Should we limit the use of nextHash to just Type objects, or keep it for all mutable hash tables?
  • The serialNumber function, which also uses hash codes for some types to determine the age of an object, is only used by the Serialization package, which is "still experimental and preliminary". Do we need this? Or could we move these over to hashFromAddress, too?

From TAOCP Section 6.4.

We use w = 2^32 and A = (sqrt 5 - 1)/2 * 2^32 = 2654435769.
For computing hash codes for mutable objects from their address in
memory.  We first cast the pointer to a long and then an int to avoid
"cast from pointer to integer of different size" compiler warnings.

We use Fibonacci hashing to decrease collisions.
Add a helper function newCompiledFunction for simplification.
The functions dbmopenin and dbmopenout are nearly identically, so we
take the opportunity to refactor them to both call a helper function.
@d-torrance d-torrance changed the title Use memory address instead of counter for some mutable types Use memory address instead of counter for hashes of some mutable types Sep 20, 2023
@d-torrance d-torrance marked this pull request as ready for review October 16, 2023 00:33
@DanGrayson DanGrayson merged commit fb79a8f into Macaulay2:development Oct 26, 2023
6 checks passed
@d-torrance d-torrance deleted the hash-codes branch November 9, 2023 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants