You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found that det_hash can sometimes hash equal objects differently. This was causing me a lot of grief when trying to manually initializing a step and retrieve its cached results. I feel that the default hashing in Tango should (at least) respect __eq__ for built-in types and fallback to Pickle serialization for more complex and custom data structures.
Examples
fromtango.common.det_hashimportdet_hashx= ["bert", "bert"]
y= ["bert", "bert "[:-1]]
assertx==yassertdet_hash(x) ==det_hash(y) # AssertionError# Also, suppose you invoke this script as `python script.py bert bert`assertx==sys.argv[1:]
assertdet_hash(x) ==det_hash(sys.argv[1:]) # AssertionError# I understand that this is because the object hierarchies may be different for `x` and `y`.x_addr= [id(i) foriinx]
y_addr= [id(i) foriiny]
assertx_addr[0] ==x_addr[1]
asserty_addr[0] ==y_addr[1] # AssertionError# Therefore, `x` and `y` pickle differently.importdillassertdill.dumps(x) ==dill.dumps(y) # AssertionError
Solutions
In Python, object hashes (hash or __hash__) are indeed expected to respect equality: i.e.
x == y implies both that x is y and hash(x) == hash(y)
However, there are a few problems with using __hash__:
mutable data types are not supposed to implement __hash__
not all classes will implement __hash__
determinism can only be toggled when initializing the interpreter
I propose the following implementation:
At hash time in Tango, we just want the object's instantaneous hash and do not actually care whether objects are immutable. So we can manually hash the mutable built-in types. Also, since mutable types may be nested, we can hash recursively.
We can simply fall back to pickle serialization for objects that are not hashable. I am assuming that hash collisions between __hash__ and pickle are negligible.
Two suggestions:
3.1. We could write a simple C extension to temporarily disable hash determinism during Tango's hashing function. (Note: I'm not personally familiar with cpython, am just assuming that this variable can be changed.) Then, we will have deterministic hashing for all types that implement __hash__.
3.2. Alternatively, we could write a custom function with deterministic hashing logic for all built-in Python types. This would be recursive (like rec_hash above) to handle nested data structures. But custom classes that do implement __hash__ will not be supported.
Together (with 3.1) that's:
defrec_hash(o: Any) ->str:
ifisinstance(o, collections.abc.Sequence): # tuples, lists, rangesreturnhash((rec_hash(x) forxino))
elifisinstance(o, set):
# set elements are guaranteed to be hashablereturnhash(frozenset(o))
elifisinstance(o, dict):
# dict keys are guaranteed to be hashablereturnhash(sorted(hash((k, rec_hash(v)) fork,vino.items())))
elifisinstance(o, collections.abc.Hashable):
# nested types may be un-hashable, could raise TypeErrorreturnhash(o)
raiseTypeError(f"unhashable type: '{type(o).__name__}'")
defdet_hash(o: Any) ->str:
try:
withhash_seed(0): # TODO: need to implement thish=rec_hash(o)
returnbase58.b58encode_int(h)
exceptTypeError:
pass# Fallback to picklingm=hashlib.blake2b()
withio.BytesIO() asbuffer:
pickler=_DetHashPickler(buffer)
pickler.dump(o)
m.update(buffer.getbuffer())
returnbase58.b58encode(m.digest()).decode()
🐛 Describe the bug
Problem
I found that
det_hash
can sometimes hash equal objects differently. This was causing me a lot of grief when trying to manually initializing a step and retrieve its cached results. I feel that the default hashing in Tango should (at least) respect__eq__
for built-in types and fallback to Pickle serialization for more complex and custom data structures.Examples
Solutions
In Python, object hashes (
hash
or__hash__
) are indeed expected to respect equality: i.e.However, there are a few problems with using
__hash__
:mutable data types are not supposed to implement
__hash__
not all classes will implement
__hash__
determinism can only be toggled when initializing the interpreter
I propose the following implementation:
At hash time in Tango, we just want the object's instantaneous hash and do not actually care whether objects are immutable. So we can manually hash the mutable built-in types. Also, since mutable types may be nested, we can hash recursively.
We can simply fall back to
pickle
serialization for objects that are not hashable. I am assuming that hash collisions between__hash__
andpickle
are negligible.Two suggestions:
3.1. We could write a simple C extension to temporarily disable hash determinism during Tango's hashing function. (Note: I'm not personally familiar with cpython, am just assuming that this variable can be changed.) Then, we will have deterministic hashing for all types that implement
__hash__
.3.2. Alternatively, we could write a custom function with deterministic hashing logic for all built-in Python types. This would be recursive (like
rec_hash
above) to handle nested data structures. But custom classes that do implement__hash__
will not be supported.Together (with 3.1) that's:
Please let me know what you think, thanks!
Versions
Python 3.9.16
ai2-tango==1.2.1
base58==2.1.1
black==23.7.0
boto3==1.28.54
botocore==1.31.54
cached-path==1.4.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.2.0
click==8.1.7
click-help-colors==0.9.2
dill==0.3.7
filelock==3.12.4
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.37
glob2==0.7
google-api-core==2.12.0
google-auth==2.23.0
google-cloud-core==2.3.3
google-cloud-storage==2.11.0
google-crc32c==1.5.0
google-resumable-media==2.6.0
googleapis-common-protos==1.60.0
huggingface-hub==0.16.4
idna==3.4
jmespath==1.0.1
markdown-it-py==3.0.0
mdurl==0.1.2
more-itertools==9.1.0
packaging==23.1
petname==2.6
protobuf==4.24.3
pyasn1==0.5.0
pyasn1-modules==0.3.0
Pygments==2.16.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
rich==13.5.3
rjsonnet==0.5.3
rsa==4.9
s3transfer==0.6.2
six==1.16.0
smmap==5.0.1
sqlitedict==2.1.0
tqdm==4.66.1
typing_extensions==4.8.0
urllib3==1.26.16
xxhash==3.3.0
The text was updated successfully, but these errors were encountered: