Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashing Emojis in strings is not consistent #11

Open
Erdromian opened this issue Apr 15, 2020 · 11 comments
Open

Hashing Emojis in strings is not consistent #11

Erdromian opened this issue Apr 15, 2020 · 11 comments

Comments

@Erdromian
Copy link

Running into an issue where strings don't hash the same cross-platform when emojis are present. I assume this extends to more Unicode as well.

(is (= "1a934e63-1467-50af-b644-a35343cb16b9" (str (hasch/uuid "👍 👍 👍")))) ; in Clojure
(is (= "28fe1faf-09a0-5def-a1e6-78326e03882b" (str (hasch/uuid "👍 👍 👍")))) ; in ClojureScript

@whilo
Copy link
Member

whilo commented Apr 16, 2020

Ouch. Thanks for reporting this. We have only tested some unicode symbols here: https://github.com/replikativ/hasch/blob/master/test/hasch/test.cljc#L35 It seems this was not sufficient. Do you have some experience with unicode?

@Erdromian
Copy link
Author

Unfortunately not. An internal user of mine just started using emoticons. I use hasch to validate things cross-platform, so I was considering disabling this specific validation for now.

But since you are actively maintaining this, I wouldn't mind looking into unicode and making a pull request in the next few days if you want.

@whilo
Copy link
Member

whilo commented Apr 16, 2020

That would be very good. Let me know when you get stuck and I will try to help. This bug is bad.

@Erdromian
Copy link
Author

Erdromian commented Apr 16, 2020

What is your setup for running and testing this project? So far, I have been getting stuck with the error Unable to resolve var: cemerick.piggieback/wrap-cljs-repl in this context on startup.

In the meantime, I started taking snippits of the code to test against.

Getting the byte values, things are consistent between the two so far. However, the 'utf8' step in
(defn- ^bytes str->utf8 [x] (-> x str (.getBytes "UTF-8")))
vs
(defn- str->utf8 [x] (-> x str utf8))
Is Returning differently cross-platform.

(defn bytes-to-int ([bytes] (bytes-to-int bytes 0)) ([bytes offset] (reduce + 0 (map (fn [i] (let [shift (* (- 4 1 i) 8)] (bit-shift-left (bit-and (nth bytes (+ i offset)) 0x000000FF) shift))) (range 0 4)))))

#?(:cljs (is (= -308232723 (bytes-to-int (utf8 emoji)))) :clj (is (= 4036989325 (bytes-to-int (.getBytes emoji "UTF-8")))))

Maybe my bytes-to-int function is off, I am just using it for seeing the results ATM. I will try some different ways to visualize the bytes to narrow it down. Then maybe I will look into the .getBytes Java implementation

@Erdromian
Copy link
Author

Alright... I think the problem is that cljs utf8 coerces numbers to positive automatically, whereas (.getBytes emoji "UTF-8") has negative numbers in it. As soon as I get the project to start, I am going to add some more unicode tests, including emojis, and play with some changes.

I think either the .cljs version of 'utf8' needs to go into a signed int array instead of unsigned, or we need to coerce the results of (.getBytes s "UTF-8") to positive numbers using something like #(if (neg? %) (+ % 256) %) to make the results consistent.

@whilo
Copy link
Member

whilo commented Apr 23, 2020

Hey, sorry for taking so long, I had to work through some other issues first. I have updated the develop branch to properly compile and run tests with up to date dependencies, either with lein test for Clojure or lein cljsbuild test. The latter needs to be also run by the former...

Could you open a separate branch against develop and update me on your experiments? I think this issue is related to https://lambdaisland.com/blog/2017-06-12-clojure-gotchas-surrogate-pairs. Unfortunately I did not expect this inconsistency when I wrote the test for Unicode, I should have been more careful.

@whilo
Copy link
Member

whilo commented May 6, 2020

@Erdromian Did that fix your issue with the piggieback middleware?

@Erdromian
Copy link
Author

No, I basically just stopped using it in my project.
As for the fix, I ran out of time and had to move onto my main work for now. I did make a branch and added some more tests to it. However, I couldn't figure out how to make things consistent.

@whilo
Copy link
Member

whilo commented May 14, 2020

Ok, no problem. Thanks for reporting back. How did you make things consistent in your project? Did you manage to make JavaScript's and Java's UTF8 representation match?

@mikekap
Copy link
Contributor

mikekap commented Aug 6, 2020

Just ran into this - I think the JS algorithm isn't entirely correct and doesn't handle UTF-16 surrogate pairs in JS. I was able to get around this by using the google closure fn that seems to do the right thing: goog.crypt.stringToUtf8ByteArray.

@whilo
Copy link
Member

whilo commented Aug 6, 2020

This is great news! Thanks so much for coming back here. I can incorporate it as soon as possible, but if you feel like opening a PR and become a contributor then I am also happy to review that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants