-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is our realtime sync implementation for strings impacted by an issue with surrogate pairs in diff match patch? Yes, but not much. #6327
Comments
Another comment -- there is a typescript rewrite of diff-match-patch here: google/diff-match-patch#74 It would be cool to create a new github repo based on that, which has my fix(es) to dmp.js in it, plus that, and is targeted at Javascript only. It could also maybe have one wasm version at some point, if performance can be made better than javascript along some dimensions. |
I was worried the typescript version might vanish, so I made a clone here: https://github.com/williamstein/diff-match-patch-typescript |
OK, I looked into this:
I.e., it treats both strings as two characters and transforms one to the other. And in all cases editing in cocalc this works fine, at least if the patch applies cleanly . After all the patch is a function to transform one string to another, and it doesn't matter how that string is interpreted. If there were a merge conflict, i.e., the patch is not applied cleanly, it seems like we could definitely end up with something weird, e.g., an invalid unicode character. So this problem does in fact potentially impact us, but ONLY when there are two people editing the same text at the same time, when there would likely be something mangled anyways. It seems like for some users of dmp who use the algorithm differently, they always run into huge problems due to this bug. But not us. Still, I definitely would like to fix this someday. |
@rgbkrk I finally looked into this. |
I don't know whether, or to what extent, this might impact cocalc. I've so far never been aware of such an issue. Maybe (?) when CoCalc hits it, an error is thrown, and our diff algorithm generates a very large diff that is just "replace the entire document by this other one", so for us things are not efficient, but not broken either. I don't know. It also might be very unlikely to hit in the context of Jupyter notebooks, where most text is ascii, and markdown where we usually write emojis as
:thing:
instead of unicode.In any case, I'll look into this and report back here. Since the original author of DMP doesn't maintain it anymore, it could also make sense to try to modernize the library and make a new independent supported version, which contains fixes for the above issue. As some motivation, the @cocalc/util package has a copy of dmp with at least one important bugfix (from my point of view). That's in https://github.com/sagemathinc/cocalc/blob/master/src/packages/util/dmp.js
NOTE: the file dmp.js had a GPL header applied to it by some automated script that @haraldschilly wrote. However, I just fixed that and reverted the license header back to the original Apache V2 license.
The text was updated successfully, but these errors were encountered: