Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for xxh3-64 #16

Open
vivadavid opened this issue Jul 8, 2023 · 9 comments
Open

Support for xxh3-64 #16

vivadavid opened this issue Jul 8, 2023 · 9 comments

Comments

@vivadavid
Copy link

Hi,

I wanted to know if support for xxh3-64 could be added. Thanks!

@idrassi
Copy link
Owner

idrassi commented Jul 22, 2023

yes this is planned although I have preference for XXH128 because of its better resistance to collision.

@vivadavid
Copy link
Author

Hi! Teracopy is using now xxh3-64 as the default protocol for hash check, and I've seen people recommend it in other places too. Is xxh3-128 faster? In my personal case, I don't need the hash check for security purposes: I just need to find out if two files are identical or make sure a file stored in my hard drive or SSD isn't corrupted.

@idrassi
Copy link
Owner

idrassi commented Jul 22, 2023

Thank you for bringing Teracopy's use of xxh3-64 to my attention.

Concerning the speed, xxh3-128 is slower than xxh3-64. The increased computation time is because it generates a longer hash value, specifically twice as long. While this does mean that operations are slower, it also significantly reduces the risk of collision.

The risk of collision is directly related to the size of the hash. With a 64-bit hash, there are 2^64 different possible hash values. With a 128-bit hash, there are 2^128 different possible hash values. The difference between these two numbers is colossal. The more potential outputs a hash function can have, the less likely it is for two different inputs to produce the same output (a collision).

While it's true that, in theory, a 64-bit hash could be used to identify unique data, in practice, the risk of collision is too high for most use cases. This is why hashing algorithms like MD5 produce a 128-bit hash. Despite being theoretically possible to find two different inputs that produce the same MD5 hash, the sheer number of possible hash values makes this highly unlikely.

If you're just using a hash to check for file corruption, a 64-bit hash might be sufficient. However, if you're using the hash to uniquely identify a file or other data (which is often the case in data integrity checks and other similar uses), a 128-bit hash is generally preferable due to the significantly reduced risk of collision.

That being said, in the spirit of flexibility and catering to diverse user needs, I plan to integrate both xxh3-64 and xxh3-128. This will empower users to select the hashing algorithm most suitable for their specific circumstances and requirements.

@vivadavid
Copy link
Author

Hi, @idrassi ,

Thank you for integrating both algorithms.

For aficionados like me, hashing is a tricky issue. I struggle to make certain decisions, most notably the algorithm that best suits my needs. My use case for hashing is as follows:

  1. I download, for example, a PDF from two sources and I want to know if they correspond exactly to the same file.
  2. I have the ISO for Windows or Linux and, before copying it in a USB drive with Rufus, I want to make sure that the file, after being stored for months in my hard drive, isn't corrupted.

Basically I'm not concerned about security, about the fact that somebody has replaced one of my files for obscure purposes. All my files are stored locally.

So far, I've been using MD5. I didn't think about it much before I chose it, but it seemed more modern and safer than, for example, CRC-32 and faster than others.

It looks like 64 bits generate a huge number of combinations, despite the fact that it's not remotely comparable with 128 bits. Speed is an issue for me. Checksums tend to take more time than reasonable, as I don't have portable SSDs, but just hard drives. I presume that the time taken by the calculation depends on CPU speed and the reading speed of your storage device, so having hard drives is my bottleneck.

Anyway, I keep having second thoughts about the algorithm that makes more sense to me, so having all the possible options available is always great. Thanks again!

@vivadavid
Copy link
Author

Hi,

Just a quick note to inform you that, after reading your comments and other information out there, I eventually switched to Blake3 instead of xxh3-64.

In any case, having xxh3-64 available at some point would be great so users can choose the algorithm they need.

@redactedscribe
Copy link

redactedscribe commented Nov 26, 2023

Related: #10, #19

@sergeevabc
Copy link

sergeevabc commented Dec 28, 2023

@vivadavid, there's a substantiated tutorial by @jolynch on when it's appropriate to use xxHash and BLAKE3. Being a user, I've been thinking about this together with developers since the emergence of both algorithms, that is for years, and came to the same conclusion.

Ideally, I should

  • use XXH3-64 to verify the integrity of non-sensitive data (e.g. to make sure the media collection has no signs of data decay and missing files)
  • use BLAKE3 to reduce the possibility of intentional data tampering (e.g. to share things with my name on it -- apps, scripts, patches, translations, etc -- in a way that you can verify their content has not been changed by the third party).

But I see problems with that idealistic approach (as of December 2023):

  1. What is the state of the software that deals with hashes? It's a mess with a poorly designed façade and no established or trendy solutions like Keepass and Veracrypt in their areas. People en masse, says my experience, rely (if they rely at all) on SHA256 for everything without giving a second thought why and how. As for hash developers, BLAKE2 let us down by rolling out several flavors with no clarity which one is meant for general audience and the same uncertainty is paralyzing the widespread use of xxHash. Its default flavor is XXH64, but XX3H-64 is recommended by the author @Cyan4973, whereas respected @idrassi argues above that XXH3-128 is a better choice. The yearning for clarity and certainty prompts to opt for BLAKE3, because the single flavor is promoted and it sounds vernacular (e.g. William Blake, an English poet and painter).
  2. Where do I publish a cryptographic BLAKE3 hash? Where it is more difficult for an attacker to gain access than to the file itself. If both are published on the same site (e.g. forum), an attacker will likely change both. So I should either sign a hash or keep it separately (e.g. on a server I am sure of, but do I have one?). Otherwise, it's a false sense of security, a non-cryptographic XXH3-64 hash is enough then. But… this acronym sounds unmemorable, can be easily confused with other xxHash flavors, it is not a default although recommended, not to mention a common drive to escape anxiety with more bits (128 vs 64), so… BLAKE3 then? Aggrrhh!

Now, how do I explain to comrades that hashing is a useful trick in the baggage of worldly experience, like taking care of shoes after a walk on a snowy street sprinkled with anti-ice chemicals? (Happy New Year, btw.) I see with the shoes: wipe it down with a rag and put it out to dry, then it will last you longer. Giving similar easy-to-follow instructions about hashing and its benefits is still a challenge. At times it seems that hashing is never meant to be a desktop tool, but just a tech under the hood of other apps (backup, messengers, torrents, etc), and yet, I believe, it's a path worth exploring.

@vivadavid
Copy link
Author

Thanks for your message, @sergeevabc , and for providing the tutorial. So far, I've been using BLAKE3, even if my purpose is "to verify the integrity of non-sensitive data" because it's reliable and fast, but I'm always open to change. It's a very complex issue.

@vivadavid
Copy link
Author

Today I've decided to do some testing on Teracopy to compare xxh3-64 and Blake3. My results indicate that xxh3-64 is noticeably faster, so I'm switching back.

I can't wait to use xxh3-64 on HashCheck whenever it's feasible! Until then, I'll stick to Blake3 as second best for my purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants