Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash_algos() docs should clarify which algos are cryptographic #3616

Open
cmb69 opened this issue Jul 26, 2024 · 11 comments
Open

hash_algos() docs should clarify which algos are cryptographic #3616

cmb69 opened this issue Jul 26, 2024 · 11 comments
Labels
enhancement New feature or request Extension: hash

Comments

@cmb69
Copy link
Member

cmb69 commented Jul 26, 2024

Triggered by https://news-web.php.net/php.internals/124613. Thanks, @IMSoP!

hash_hmac() has a respective changelog entry:

<entry>Usage of non-cryptographic hash functions (adler32, crc32, crc32b, fnv132, fnv1a32, fnv164, fnv1a64, joaat) was disabled.</entry>

I think it's a good idea to also state that in the hash_algos() docs.

@cmb69
Copy link
Member Author

cmb69 commented Jul 26, 2024

Maybe it is sufficient to clarify that hash_hmac_algos() lists these.

@Girgias Girgias added enhancement New feature or request Extension: hash labels Jul 26, 2024
@IMSoP
Copy link
Collaborator

IMSoP commented Jul 26, 2024

This would only solve a tiny portion of the problem I was pointing out.

  • First, it assumes that the user knows what a "cryptographic hash" is, and when they should use one over the opposite (a "non-cryptographic hash"?)
  • Second, it still leaves them with a list of 44 algorithms to choose from, and no guidance whatsoever

What's really needed is:

  • An explanation of different hashing use cases, and terms like "cryptographic hash"
  • An explanation of when to use hash(), hash_hmac(), or password_hash()
  • A list or table with the available algorithms, giving more than just their names
  • Guidance on which algorithms to avoid (here's where you can talk about the weaknesses of MD5 and SHA1!)
  • Some kind of recommendation of what algorithm users should pick for common use cases, if they're not constrained by compatibility

@claudepache
Copy link

hash_algos() docs should clarify which algos are cryptographic

I’m not sure it’s actually useful information; at least, it’s largely insufficient. For instance, md4 is “cryptographic“, but you shouldn’t use it for anything cryptography-related unless someone holds a gun to your head.

@jimwins
Copy link
Member

jimwins commented Jul 27, 2024

A common theme in the user-contributed notes for hash() was performance benchmarks, so it's probably worth adding some discussion of that (including why you may not even want the fastest algo). Also, if we're going to have a table of algo information in the documentation, the expected/maximum output size of each would be a good data point to add.

hash() and hash_hmac() should definitely have a common paragraph about their possible use in password situations with reference to password_hash().

(I deleted a bunch of the notes on hash(), there were quite a few that were just benchmarks from 5-10 years ago.)

@damianwadley
Copy link
Member

I just want to be sure of something here: is the goal of this documentation to talk about the PHP functions and how they work, or is the goal to teach developers about how to implement their own version of cryptography?

@IMSoP
Copy link
Collaborator

IMSoP commented Jul 28, 2024

@damianwadley What do you mean by "implement their own version"? I don't think anyone's expecting users to come up with new, novel, hashing algorithms.

What I am hoping for is some description beyond a name for the 60 different algorithms currently supported by hash(), with some explanation of why a user might want to use them, or why they should avoid them.

@cmb69
Copy link
Member Author

cmb69 commented Jul 28, 2024

While agree that the current documentation is somewhat insufficient, I wouldn't go too much into the details; perhaps we can find some good article(s) to link to, instead.

  • An explanation of different hashing use cases, and terms like "cryptographic hash"

A short explantion might be in order, but certainly not a thorough treatment like on https://en.wikipedia.org/wiki/Hash_function or https://en.wikipedia.org/wiki/Cryptographic_hash_function.

  • An explanation of when to use hash(), hash_hmac(), or password_hash()

ACK

  • A list or table with the available algorithms, giving more than just their names

Hmm, maybe some rough categorization might be in order, but detailed explanation about every single algorithm seems out of scope of the PHP manual. Besides, it's already not easy to keep the simple list up to date.

  • Guidance on which algorithms to avoid (here's where you can talk about the weaknesses of MD5 and SHA1!)

That's difficult. Depending on the use case, MD5 and SHA1 might still be fine (and sometimes just necessary for interoperability with already existing hashes). See https://en.wikipedia.org/wiki/Cryptographic_hash_function#Properties for details.

  • Some kind of recommendation of what algorithm users should pick for common use cases, if they're not constrained by compatibility

That's difficult, again. Maybe we could attempt some rough categorization of the available algorithms.

A common theme in the user-contributed notes for hash() was performance benchmarks, so it's probably worth adding some discussion of that (including why you may not even want the fastest algo).

A rough explanation of the performance might make sense, but these benchmarks are pretty useless, in my opinion. After all, some of the algorithms may be implemented with SIMD instructions (but having a fallback if these instructions are not available), and a few even might have hardware support (e.g. php/php-src#4108), and the implementations may even change over time.

@IMSoP
Copy link
Collaborator

IMSoP commented Jul 28, 2024

Hmm, maybe some rough categorization might be in order, but detailed explanation about every single algorithm seems out of scope of the PHP manual. Besides, it's already not easy to keep the simple list up to date.

I didn't say "detailed explanation", I said "some description beyond a name". The context being that multiple people are claiming that users should be using the hash() function, and choosing the right algorithm; and they don't seem keen on simply adding a function for sha256(), or whatever the "best" algorithm is. So I'm assuming there is more to say about the strengths and weaknesses of different algorithms, in which case we need to present that to users.

Maybe there are some algorithms that can just be labelled "rarely used, included for compatibility with other systems", but right now we don't even have that.

@cmb69
Copy link
Member Author

cmb69 commented Jul 29, 2024

I'm not an expert on hash functions, so take the following with a huge grain of salt (and please correct me, if I'm wrong). As I see it, there are roughly three categories of hash functions:

  • checksum algorithms (like crc32, adler32):
    These can be used to calculate checksums, for instance, to check for transmission errors. For this reason they are supposed to be very fast (and likely simple to implement). They might also be used if you need an integer value (since they require only a couple of bytes), e.g. for a very simple hash table implementation.
  • other non-cryptographic algorithms (like fvn, murmur):
    These can be used to calculate hash values for hash tables (if you ever need to implement one yourself). They should be fast, but still yield a good distribution over arbitrary string inputs.
  • cryptographic algorithms (like md5, sha*, blake*):
    These are supposed to yield hash values which are representitive of their inputs, but are neither guessable (i.e. robust against "pre-image" attacks, i.e you can't guess the input from the hash values), nor prone to collisions (i.e. two distinct inputs yield distinct hash values; that's basically the same as to be representative). Their performance is of secondary concern. Of course, unguessability and collision resistance also depend on the entropy of their result (i.e. the number of relevant bits). E.g. a hash function which would have one bit of entropy, would only be able to distinguish two "categories" of inputs, and would as such be severly prone to collisions (although it would be almost impossible to guess the input from the hash value). This is the reason that there are different variants of several of the hash algorithms: choose the necessary entropy as suitable. Some of the early cryptograhic hash algorithms (such as md4, md5, sha1) have been proven to be prone to collsion attacks, and as such may better be avoided, unless you use them in a way where this doesn't matter much (e.g. for caching, if you ensure that collisions won't be a problem).

So "usually" this boils down to:

  • compatibility: use whatever algorithm is required
  • cryptographics purposes: use sha2 or sha3
  • checksums: use crc32* or adler32

jimwins added a commit to jimwins/doc-en that referenced this issue Jul 30, 2024
Based on this comment by @cmb69:
php#3616 (comment)

Related to issue php#3616.
@jimwins
Copy link
Member

jimwins commented Jul 30, 2024

@cmb69, I thought that was a good starting point for beefing up the introduction to the documentation for the hash extension! PR is just a draft, feel free to suggest changes and additions and maybe we can address some of the other areas that @IMSoP identified.

@cmb69
Copy link
Member Author

cmb69 commented Jul 30, 2024

Quick note to not forget about it: maybe link to https://csrc.nist.gov/projects/hash-functions (see https://news-web.php.net/php.internals/124678).

jimwins added a commit that referenced this issue Jul 31, 2024
* Add more description for hash extension

Based on this comment by @cmb69:
#3616 (comment)

With additional feedback from @TimWolla
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Extension: hash
Projects
None yet
Development

No branches or pull requests

6 participants