Hashing Emojis in strings is not consistent #11

Erdromian · 2020-04-15T20:29:34Z

Running into an issue where strings don't hash the same cross-platform when emojis are present. I assume this extends to more Unicode as well.

(is (= "1a934e63-1467-50af-b644-a35343cb16b9" (str (hasch/uuid "👍 👍 👍")))) ; in Clojure
(is (= "28fe1faf-09a0-5def-a1e6-78326e03882b" (str (hasch/uuid "👍 👍 👍")))) ; in ClojureScript

whilo · 2020-04-16T02:06:13Z

Ouch. Thanks for reporting this. We have only tested some unicode symbols here: https://github.com/replikativ/hasch/blob/master/test/hasch/test.cljc#L35 It seems this was not sufficient. Do you have some experience with unicode?

Erdromian · 2020-04-16T02:26:11Z

Unfortunately not. An internal user of mine just started using emoticons. I use hasch to validate things cross-platform, so I was considering disabling this specific validation for now.

But since you are actively maintaining this, I wouldn't mind looking into unicode and making a pull request in the next few days if you want.

whilo · 2020-04-16T03:19:04Z

That would be very good. Let me know when you get stuck and I will try to help. This bug is bad.

Erdromian · 2020-04-16T16:25:35Z

What is your setup for running and testing this project? So far, I have been getting stuck with the error Unable to resolve var: cemerick.piggieback/wrap-cljs-repl in this context on startup.

In the meantime, I started taking snippits of the code to test against.

Getting the byte values, things are consistent between the two so far. However, the 'utf8' step in
(defn- ^bytes str->utf8 [x] (-> x str (.getBytes "UTF-8")))
vs
(defn- str->utf8 [x] (-> x str utf8))
Is Returning differently cross-platform.

(defn bytes-to-int ([bytes] (bytes-to-int bytes 0)) ([bytes offset] (reduce + 0 (map (fn [i] (let [shift (* (- 4 1 i) 8)] (bit-shift-left (bit-and (nth bytes (+ i offset)) 0x000000FF) shift))) (range 0 4)))))

#?(:cljs (is (= -308232723 (bytes-to-int (utf8 emoji)))) :clj (is (= 4036989325 (bytes-to-int (.getBytes emoji "UTF-8")))))

Maybe my bytes-to-int function is off, I am just using it for seeing the results ATM. I will try some different ways to visualize the bytes to narrow it down. Then maybe I will look into the .getBytes Java implementation

Erdromian · 2020-04-16T19:39:54Z

Alright... I think the problem is that cljs utf8 coerces numbers to positive automatically, whereas (.getBytes emoji "UTF-8") has negative numbers in it. As soon as I get the project to start, I am going to add some more unicode tests, including emojis, and play with some changes.

I think either the .cljs version of 'utf8' needs to go into a signed int array instead of unsigned, or we need to coerce the results of (.getBytes s "UTF-8") to positive numbers using something like #(if (neg? %) (+ % 256) %) to make the results consistent.

whilo · 2020-04-23T11:12:28Z

Hey, sorry for taking so long, I had to work through some other issues first. I have updated the develop branch to properly compile and run tests with up to date dependencies, either with lein test for Clojure or lein cljsbuild test. The latter needs to be also run by the former...

Could you open a separate branch against develop and update me on your experiments? I think this issue is related to https://lambdaisland.com/blog/2017-06-12-clojure-gotchas-surrogate-pairs. Unfortunately I did not expect this inconsistency when I wrote the test for Unicode, I should have been more careful.

whilo · 2020-05-06T18:14:15Z

@Erdromian Did that fix your issue with the piggieback middleware?

Erdromian · 2020-05-06T19:15:31Z

No, I basically just stopped using it in my project.
As for the fix, I ran out of time and had to move onto my main work for now. I did make a branch and added some more tests to it. However, I couldn't figure out how to make things consistent.

whilo · 2020-05-14T00:16:08Z

Ok, no problem. Thanks for reporting back. How did you make things consistent in your project? Did you manage to make JavaScript's and Java's UTF8 representation match?

mikekap · 2020-08-06T18:39:03Z

Just ran into this - I think the JS algorithm isn't entirely correct and doesn't handle UTF-16 surrogate pairs in JS. I was able to get around this by using the google closure fn that seems to do the right thing: goog.crypt.stringToUtf8ByteArray.

whilo · 2020-08-06T19:09:34Z

This is great news! Thanks so much for coming back here. I can incorporate it as soon as possible, but if you feel like opening a PR and become a contributor then I am also happy to review that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hashing Emojis in strings is not consistent #11

Hashing Emojis in strings is not consistent #11

Erdromian commented Apr 15, 2020

whilo commented Apr 16, 2020

Erdromian commented Apr 16, 2020

whilo commented Apr 16, 2020

Erdromian commented Apr 16, 2020 •

edited

Loading

Erdromian commented Apr 16, 2020

whilo commented Apr 23, 2020

whilo commented May 6, 2020

Erdromian commented May 6, 2020

whilo commented May 14, 2020

mikekap commented Aug 6, 2020

whilo commented Aug 6, 2020

Hashing Emojis in strings is not consistent #11

Hashing Emojis in strings is not consistent #11

Comments

Erdromian commented Apr 15, 2020

whilo commented Apr 16, 2020

Erdromian commented Apr 16, 2020

whilo commented Apr 16, 2020

Erdromian commented Apr 16, 2020 • edited Loading

Erdromian commented Apr 16, 2020

whilo commented Apr 23, 2020

whilo commented May 6, 2020

Erdromian commented May 6, 2020

whilo commented May 14, 2020

mikekap commented Aug 6, 2020

whilo commented Aug 6, 2020

Erdromian commented Apr 16, 2020 •

edited

Loading