Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is our realtime sync implementation for strings impacted by an issue with surrogate pairs in diff match patch? Yes, but not much. #6327

Open
williamstein opened this issue Jan 9, 2023 · 4 comments

Comments

@williamstein
Copy link
Contributor

"Surrogate pairs are strings that contain a supplemental code point (especially emojis) that cause diff indices to be offset. It can either mess up the text or cause DMP to error (within toDelta/fromDelta)." See

I don't know whether, or to what extent, this might impact cocalc. I've so far never been aware of such an issue. Maybe (?) when CoCalc hits it, an error is thrown, and our diff algorithm generates a very large diff that is just "replace the entire document by this other one", so for us things are not efficient, but not broken either. I don't know. It also might be very unlikely to hit in the context of Jupyter notebooks, where most text is ascii, and markdown where we usually write emojis as :thing: instead of unicode.

In any case, I'll look into this and report back here. Since the original author of DMP doesn't maintain it anymore, it could also make sense to try to modernize the library and make a new independent supported version, which contains fixes for the above issue. As some motivation, the @cocalc/util package has a copy of dmp with at least one important bugfix (from my point of view). That's in https://github.com/sagemathinc/cocalc/blob/master/src/packages/util/dmp.js

NOTE: the file dmp.js had a GPL header applied to it by some automated script that @haraldschilly wrote. However, I just fixed that and reverted the license header back to the original Apache V2 license.

@williamstein williamstein changed the title is our sync algorithm impacted by an issue with surrogate pairs in diff match patch? is our realtime sync implementation for strings impacted by an issue with surrogate pairs in diff match patch? Jan 9, 2023
@williamstein
Copy link
Contributor Author

Another comment -- there is a typescript rewrite of diff-match-patch here: google/diff-match-patch#74

It would be cool to create a new github repo based on that, which has my fix(es) to dmp.js in it, plus that, and is targeted at Javascript only. It could also maybe have one wasm version at some point, if performance can be made better than javascript along some dimensions.

@williamstein
Copy link
Contributor Author

I was worried the typescript version might vanish, so I made a clone here: https://github.com/williamstein/diff-match-patch-typescript

@williamstein
Copy link
Contributor Author

OK, I looked into this:

image
  • Definitely when making diffs/patches, our current dmp version does treat the two surrogate pairs as just separate characters, i.e., the string s="🌈" is viewed as a string of length 2.
  • If you change that green chinese character to the red one, here's what patch we get:
[[[[0,"\ud83c"],[-1,"\ude2f"],[1,"\ude32"]],0,0,2,2]]

I.e., it treats both strings as two characters and transforms one to the other.

And in all cases editing in cocalc this works fine, at least if the patch applies cleanly . After all the patch is a function to transform one string to another, and it doesn't matter how that string is interpreted.

If there were a merge conflict, i.e., the patch is not applied cleanly, it seems like we could definitely end up with something weird, e.g., an invalid unicode character.

So this problem does in fact potentially impact us, but ONLY when there are two people editing the same text at the same time, when there would likely be something mangled anyways.

It seems like for some users of dmp who use the algorithm differently, they always run into huge problems due to this bug. But not us.

Still, I definitely would like to fix this someday.

@williamstein
Copy link
Contributor Author

@rgbkrk I finally looked into this.

@williamstein williamstein changed the title is our realtime sync implementation for strings impacted by an issue with surrogate pairs in diff match patch? is our realtime sync implementation for strings impacted by an issue with surrogate pairs in diff match patch? Yes, but not much. Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant