Rearrange alphabet

to make specials and digits more likely to be boosted when there is modulo bias.
alanhkarp · Jul 8, 2024 · 9b26e1a · 9b26e1a
1 parent 08d7e82
commit 9b26e1a
Show file tree

Hide file tree

Showing 3 changed files with 24 additions and 18 deletions.
diff --git a/notes/Collisions.pdf b/notes/Collisions.pdf
diff --git a/notes/Collisions.tex b/notes/Collisions.tex
@@ -112,49 +112,55 @@ \section{Computing an Acceptable Password}
 \begin{enumerate}
 \item Rotate the array of characters by one position and try again until an acceptable password is found or the array is back to where it started.  
 
-\item Convert more bytes of the hash to characters than needed for the password, and convert the first $L$ bytes into the candidate password.  If it doesn't pass, drop the first of those bytes and try again until an acceptable password is found or the characters have all been used.  The number of characters produced from the hash is limited by the required latency.
+\item Compute more bytes of the hash than needed for the password.  Convert the first set of bytes to the candidate password.  If it doesn't pass, drop the first byte and try again until an acceptable password is found or the bytes have all been used.  The number of bytes produced from the hash is limited by the required latency.  
 
 \item Keep computing additional hashes until an acceptable password is generated.  These should be fast hashes in order to increase the chance of finding a valid password within the required latency.  Stop if one hasn't been found in a time limited by the latency requirement.
 
-\item Insert characters of the the required types at the beginning of the output and fill the rest with characters selected from the alphabet.  Then shuffle the position of the characters based on bytes from the hash to make the site password less guessable.  There are four variants.
+\item Insert characters of the the required types at the beginning of the output and fill the rest with characters selected from the alphabet using bytes of the hash.  Then shuffle the position of the characters based on bytes from the hash to make the site password less guessable.  There are four variants.
 
 \begin{enumerate}
-\item Use the same bytes for both selecting characters and for shuffling.  This approach makes the site password more guessable since the adversary knows the random numbers used for the shuffle.
+\item Use the same bytes for both selecting characters and for shuffling.  This approach makes the site password more guessable since the adversary knows the random numbers used for the shuffle of a particular guess.
 
 \item Use one additional byte from the hash per character for shuffling.  
 
 \item Use one additional byte from the hash as the seed for a pseudo-random number generator.
 
-\item Use one byte of the hash to both select a random character from the allowed set and another byte to select a random place to put it in the site password.  Select another byte if that location is occupied.  Experiments show it uses 3-5 additional bytes per character in the calculated password.  As a result, the hash must return more bytes than may be needed, reducing the number of iterations an adversary needs to do.
+\item Use one byte of the hash to select a random character from the allowed set and another byte to select a random place to put it in the site password.  Select another byte if that location is occupied.  Experiments show this algorithm uses 3-5 additional bytes per character in the calculated password.  
 \end{enumerate}
 \end{enumerate}
 
-These approaches have been tested and are able to generate an acceptable password with reasonable probability in even extreme cases, such as requiring one number, one upper case letter, one lower case letter, and 5 special characters for a 12-character site password.  
+These approaches have been tested and are able to generate an acceptable password with high probability in even extreme cases, such as requiring one number, one upper case letter, one lower case letter, and 5 special characters for a 12-character site password.  However, 2 and 4d compute bits of the hash that may not be used in construction the site password, reducing the number of iterations that can be done in the allowed time.  As a result adversary who only needs to compute the bits actually used in constructing the password gains an advantage.
 
-The approach 4 is guaranteed to find an acceptable password if one exists,{ \em e.g.}, the total number of required characters doesn't exceed the password length.  However, that password may be weak, {\em i.e}, it could consist of many copies of the same character.  As a result, that step may have to be repeated
+Approach 4 is guaranteed to find an acceptable password if one exists,{ \em e.g.}, the total number of required characters doesn't exceed the password length.  However, that password may be weak, {\em i.e}, it could consist of many copies of the same character.  As a result, that step may have to be repeated
 
-The possibility that the password derived from the initial hash doesn't meet the site's rules, which results in more collisions, favors using one of the first three options  The ability to always construct an acceptable site password favors one of the variants of the last one.  A compromise solution is to use 4(a) only when 3 fails to produce an acceptable password, which protects your super password at the cost of making the site password more guessable.  Note that 3 has never failed in over 10 years' use and testing on 100s of additional sites.
+The possibility that the password derived from the initial hash doesn't meet the site's rules, which results in more collisions, favors using one of the first three options.  The ability to always construct an acceptable site password favors one of the variants of the last one.  A compromise solution is to use 4(a) only when 3 fails to produce an acceptable password, which protects your super password at the cost of making the site password more guessable.  Note that 3 has never failed in over 10 years' use and testing on 100s of additional sites.
 
 \section{Further Discussion of 4}
 
 Consider how approach 4 works for the case where the password rules require one each of upper case, lower case, digit, and special character.  These will appear in the first four positions before the shuffle.  The adversary knows that the first character was selected from a set of 52; the second, a set of 26; the third a set of 10; and the fourth from a set specified in the password rules, say 8.  In other words, about 100 times fewer guesses are needed if the adversary can un-shuffle the characters.  That task is simplified in 4a because the adversary knows the bytes to use for the shuffle.
 
-Approach 4b uses additional bytes from the hash to do the shuffle, either one byte as the seed for a pseudo-random number generator or one additional byte per character in the site password.  In either case, the adversary needs fewer online tests because more bytes go into computing the site password.  To see this assume that a guess has all the characters that appear in a known site password but not necessarily in the right order.  In addition, assume the guess has produced a site password structured as it would be before the shuffle.  The adversary then uses the next byte(s) in the hash to shuffle the characters, and rejects the guess if the result doesn't match the known site password.  As a result, the exponent of the number of collisions that must be tested online is reduced from $S-P$ to $S-P-8N$, where $N=L$ if a byte per character is used for the shuffle or $N=1$ if one byte is used as a random number seed.
+Approach 4b uses additional bytes from the hash to do the shuffle, either one byte as the seed for a pseudo-random number generator or one additional byte per character in the site password.  In either case, the adversary needs fewer online tests because more bytes go into computing the site password.  To see this assume that a guess has all the characters that appear in a known site password but not necessarily in the right order.  The adversary then uses the next byte(s) in the hash to unshuffle the characters, and rejects the guess if the result doesn't match the known site password.  As a result, the exponent of the number of collisions that must be tested online is reduced from $S-P$ to $S-P-8N$, where $N=L$ if a byte per character is used for the shuffle or $N=1$ if one byte is used as a random number seed.
 
 \section{Strength Meters}
 
-Users are more likely to pick strong super passwords if there is a meter to warn them of weak ones.  The strength estimator of zxcvbn provides one that looks for a number of patterns that can reduce the number of tries to guess the password.  It falls back on brute force for characters that aren't in any other pattern and uses $10M$, where $M$ is the number of character in the brute force segment.  The rationale is that there are words that don't appear in standard dictionaries but might show up in others.  Their paper gives Teiubesc, which means "I love you" in Romanian as an example.  Better to be conservative they say.
+Users are more likely to pick strong super passwords if there is a meter to warn them of weak ones.  The strength estimator of zxcvbn provides one that looks for a number of patterns that can reduce the number of tries needed to guess the password.  It falls back on brute force for characters that aren't in any other pattern and uses $10M$, where $M$ is the number of character in the brute force segment.  The rationale is that there are words that don't appear in standard dictionaries but might show up in others.  Their paper gives Teiubesc, which means "I love you" in Romanian as an example.  Better to be conservative they say.
 
 This value is supported by the work of Shannon, who estimated that there are between 1 and 2 bits of entropy per character in English text.  That value is certainly an underestimate for someone constructing a password.  It is more likely that there are at least 4 bits of entropy per character in a human chosen password.  Using 4 instead of $log_2 |A|$ for the substrings for which zxcvbn uses brute force is likely to dramatically underestimate the password's strength.
 
-The problem with using the more conservative value for the site password is that it encourages people to use much longer site passwords than they need.  As a result, the entropy difference between it and the super password is likely to be small or even negative, resulting in offline guessing attacks being tractable.
+The problem with using the more conservative value is that it encourages people to use much longer site passwords than they need.  As a result, the entropy difference between it and the super password is likely to be small or even negative, resulting in offline guessing attacks being tractable.
 
-Using a conservative estimate makes perfect sense for the super password, which is picked by a person and matches the zxcvbn use case, but not the site password, which is calculated with a good hash function.  As shown earlier, the likelihood of generating guessable substrings is small but non-zero, hence the need to have a meter.  Calculating the zxcvbn score with $log_2 |A|$ for the brute force substrings is more appropriate for the site password.  There is still the chance to produce words like Teiubesc, but that chance is vanishingly small compared to the probability of a person choosing it.  
+Using a conservative estimate makes perfect sense for the super password, which is picked by a person and matches the zxcvbn use case, but not the site password, which is calculated with a good hash function.  SitePassword only returns site passwords that can only be guessed by brute force,  so calculating the strength with $log_2 |A|$ for these strings is more appropriate.  There is still the chance to produce words like Teiubesc, but that chance is vanishingly small compared to the probability of a person choosing it.  
 
 One way to encourage people to pick super passwords that have more entropy than their site passwords is to report that the super password has less entropy than the modified zxcvbn calculation.  A simple heuristic that can encourage choosing super passwords for which offline guessing is impractical is to reduce the calculated strength by some number of bits.  In this case, the super password strength meter will report Strong only when the super password has approximately that many more bits of entropy than a site password that has just enough entropy for its meter to report Strong.  Decreasing the calculated entropy by 16 bits will result in enough collisions to make offline guessing impractical.
 
 \section{A Key Assumption}
 
-Many of the arguments made in this document hinge on the assumption that the adversary needs fewer online tests the more bytes go into creating the site password.  Given an alphabet of size A and random values in the range $0 \leq r < R$, the number of values of $r$ that produce the same value of $\bmod(R,A)$ is either $\lfloor \bmod(R,A)\rfloor$ or $\lceil \bmod(R,A)\rceil$.  Say that the adversary knows the first character of a site password, say a,  A guess for the super password produces bytes with a value of $0 \leq B < R$. 
+A key assumption made here is that algorithms that compute bits that are not used to construct the site password make it easier to guess your super password.  That's because SitePassword uses the built-in PBKDF2 code that does not provide access to the internal state of the algorithm, while an adversary can use custom code that does.  
+
+Consider an example.  PBKDF2 is not memory intensive, but it can be made so by computing a very large key, say 64 MB.  The built in code will use a lot of memory because it has to store the entire key.  An adversary who knows which bits of the key are used to construct the site password can produce a version that only stores those bits.
+
+Approaches 2 and 4d each compute a hash with more bits than needed for most site passwords.  Computing more bits means that fewer iterations can be done in the same amount of time.  The adversary, on the other hand, can construct a high performance version of PBKDF2 that can be restarted by saving its internal state.
+
+The difference isn't trivial.  Most passwords won't need any extra bits when using algorithm 2, but some will need several times the size of the site password.  The same applies to 4d.  While 3-5 times as many bits are needed for the common case, 10 times that many may be needed.  A smaller number of extra bits reduces the adversary's advantage at the expense of needing the weaker deterministic algorithm more often.
 
 \end{document}  
diff --git a/src/generate.js b/src/generate.js
@@ -73,7 +73,7 @@ async function computePassword(superpw, salt, settings) {
         let lower = settings.allowlower? config.lower: "";
         let digits = settings.allownumber ? config.digits: "";
         let specials = settings.allowspecial ? settings.specials: "";
-        let cset = upper + lower + digits + specials;
+        let cset = specials + digits + upper + lower;
         if (!cset) return "";
         if (settings.startwithletter) {
             let alphabet = "";
@@ -198,7 +198,10 @@ function verifyPassword(pw, settings) {
 export function characters(settings) {
     // generate a set of no more than 256 characters for encoding
     let chars = "";
-    if (settings.allownumber) {
+    if (settings.allowspecial) {
+        chars += settings.specials;
+    }
+     if (settings.allownumber) {
         chars += config.digits;
     }
     if (settings.allowupper) {
@@ -207,10 +210,7 @@ export function characters(settings) {
     if (settings.allowlower) {
         chars += config.lower;
     }
-    if (settings.allowspecial) {
-        chars += settings.specials;
-    }
-    return chars.substring(0, 256); // substring just in case...
+   return chars.substring(0, 256); // substring just in case...
 }
 export function normalize(name) {
     if (name) {