Skip to content
This repository has been archived by the owner on Aug 5, 2024. It is now read-only.

JavaScript implementation crashes on Unicode code points #10

Open
sffc opened this issue May 4, 2018 · 5 comments · May be fixed by #80
Open

JavaScript implementation crashes on Unicode code points #10

sffc opened this issue May 4, 2018 · 5 comments · May be fixed by #80

Comments

@sffc
Copy link

sffc commented May 4, 2018

I stumbled upon this project from a bug in a downstream project that uses this library, Codiad.

The following function throws an exception:

function testPatchUnicode() {
  var cp = '\uD800\uDDE4'; // U+101E4; cannot put directly in source file
  var patches = dmp.patch_make(cp + cp + cp + cp + cp + 'a', cp + cp + cp + cp + cp + 'ab');
  dmp.patch_toText(patches);
}

In general, any string that contains a supplemental code point, which are much more common recently with the rise of emoji, causes diff indices to be offset by some number of code points. This leads to strange or undefined behavior when applying the outputted patches.

This is a rather serious bug that is quietly affecting any downstream project that uses this library.

I think the best fix would be to rewrite the patch-to-string function to operate entirely in code point space instead of JavaScript's default code unit space.

This might also affect non-JavaScript implementations; I haven't looked.

P.S. I am on Google's i18n team and have seen issues like this before.

@NeilFraser
Copy link
Contributor

Agreed, this is currently the highest priority issue.

Since you are at Google, you might be interested in this:
https://critique.corp.google.com/#review/165311359
It does a post-diff pass that pastes any split the supplemental code points back together.

I'm not sure why the author rolled it back. But it seems like the right approach. Need to find time to dig into this issue and if it's the right solution port it to the other languages.

@NeilFraser
Copy link
Contributor

For reference, here's the (Google internal) reason for why this patch was rolled back:
https://critique.corp.google.com/#review/179020104

@sffc
Copy link
Author

sffc commented May 22, 2018

I'm hacking in a copy of diff_match_patch locally. Haven't gotten something to completely work yet, but making some progress.

@sffc
Copy link
Author

sffc commented May 22, 2018

The comments on 179020104 and the related bug thread corroborate that a custom code-point-based string implementation is a tractable fix to the issue. That's what I'm trying to do in JavaScript.

@ndvbd
Copy link

ndvbd commented May 24, 2021

Any progress here?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants