Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

det_hash is not consistent with __eq__ #602

Open
apoorvkh opened this issue Sep 26, 2023 · 0 comments
Open

det_hash is not consistent with __eq__ #602

apoorvkh opened this issue Sep 26, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@apoorvkh
Copy link
Contributor

🐛 Describe the bug

Problem

I found that det_hash can sometimes hash equal objects differently. This was causing me a lot of grief when trying to manually initializing a step and retrieve its cached results. I feel that the default hashing in Tango should (at least) respect __eq__ for built-in types and fallback to Pickle serialization for more complex and custom data structures.

Examples

from tango.common.det_hash import det_hash

x = ["bert", "bert"]
y = ["bert", "bert "[:-1]]
assert x == y
assert det_hash(x) == det_hash(y)  # AssertionError

# Also, suppose you invoke this script as `python script.py bert bert`
assert x == sys.argv[1:]
assert det_hash(x) == det_hash(sys.argv[1:])  # AssertionError

# I understand that this is because the object hierarchies may be different for `x` and `y`.

x_addr = [id(i) for i in x]
y_addr = [id(i) for i in y]
assert x_addr[0] == x_addr[1]
assert y_addr[0] == y_addr[1]  # AssertionError

# Therefore, `x` and `y` pickle differently.

import dill
assert dill.dumps(x) == dill.dumps(y)  # AssertionError

Solutions

In Python, object hashes (hash or __hash__) are indeed expected to respect equality: i.e.

x == y implies both that x is y and hash(x) == hash(y)

However, there are a few problems with using __hash__:

  1. mutable data types are not supposed to implement __hash__

  2. not all classes will implement __hash__

  3. determinism can only be toggled when initializing the interpreter

I propose the following implementation:

  1. At hash time in Tango, we just want the object's instantaneous hash and do not actually care whether objects are immutable. So we can manually hash the mutable built-in types. Also, since mutable types may be nested, we can hash recursively.

  2. We can simply fall back to pickle serialization for objects that are not hashable. I am assuming that hash collisions between __hash__ and pickle are negligible.

  3. Two suggestions:

    3.1. We could write a simple C extension to temporarily disable hash determinism during Tango's hashing function. (Note: I'm not personally familiar with cpython, am just assuming that this variable can be changed.) Then, we will have deterministic hashing for all types that implement __hash__.
    3.2. Alternatively, we could write a custom function with deterministic hashing logic for all built-in Python types. This would be recursive (like rec_hash above) to handle nested data structures. But custom classes that do implement __hash__ will not be supported.

Together (with 3.1) that's:

def rec_hash(o: Any) -> str:
    if isinstance(o, collections.abc.Sequence):  # tuples, lists, ranges
        return hash((rec_hash(x) for x in o))
    elif isinstance(o, set):
        # set elements are guaranteed to be hashable
        return hash(frozenset(o))
    elif isinstance(o, dict):
        # dict keys are guaranteed to be hashable
        return hash(sorted(hash((k, rec_hash(v)) for k,v in o.items())))
    elif isinstance(o, collections.abc.Hashable):
        # nested types may be un-hashable, could raise TypeError
        return hash(o)
    raise TypeError(f"unhashable type: '{type(o).__name__}'")


def det_hash(o: Any) -> str:
    try:
        with hash_seed(0):  # TODO: need to implement this
            h = rec_hash(o)
        return base58.b58encode_int(h)
    except TypeError:
        pass

    # Fallback to pickling
    m = hashlib.blake2b()
    with io.BytesIO() as buffer:
        pickler = _DetHashPickler(buffer)
        pickler.dump(o)
        m.update(buffer.getbuffer())
        return base58.b58encode(m.digest()).decode()

Please let me know what you think, thanks!

Versions

Python 3.9.16
ai2-tango==1.2.1
base58==2.1.1
black==23.7.0
boto3==1.28.54
botocore==1.31.54
cached-path==1.4.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.2.0
click==8.1.7
click-help-colors==0.9.2
dill==0.3.7
filelock==3.12.4
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.37
glob2==0.7
google-api-core==2.12.0
google-auth==2.23.0
google-cloud-core==2.3.3
google-cloud-storage==2.11.0
google-crc32c==1.5.0
google-resumable-media==2.6.0
googleapis-common-protos==1.60.0
huggingface-hub==0.16.4
idna==3.4
jmespath==1.0.1
markdown-it-py==3.0.0
mdurl==0.1.2
more-itertools==9.1.0
packaging==23.1
petname==2.6
protobuf==4.24.3
pyasn1==0.5.0
pyasn1-modules==0.3.0
Pygments==2.16.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
rich==13.5.3
rjsonnet==0.5.3
rsa==4.9
s3transfer==0.6.2
six==1.16.0
smmap==5.0.1
sqlitedict==2.1.0
tqdm==4.66.1
typing_extensions==4.8.0
urllib3==1.26.16
xxhash==3.3.0

@apoorvkh apoorvkh added the bug Something isn't working label Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant