Adapt for multibyte strings as UTF-8 #1

Seb35 · 2018-09-29T12:12:14Z

There is no native function mb_str_split, hence writing a loop.

In most cases there are no differences in matches, but there are differences in lengths and indexes. The first added test illustrates this, the previous version gives the same match, but its length was 8 instead of 7 (the "é" counts as 2 bytes) and indexes were 9 / 9 instead of 5 / 7 (the "é" counts as 2 bytes and "’" counts as 3 bytes).

The second test illustrates that in some cases there could be differences in matches because UTF-8 is a multibyte encoding and some individual bytes could match although the whole characters won’t match.

There is no native function mb_str_split, hence writing a loop.

codecov-io · 2018-09-29T12:14:06Z

Codecov Report

Merging #1 into master will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff          @@
##           master     #1   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           4      4           
  Lines         113    117    +4     
=====================================
+ Hits          113    117    +4

Impacted Files	Coverage Δ
src/Solver.php	`100% <100%> (ø)`	⬆️
src/MatchesSolver.php	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb35375...82a891f. Read the comment docs.

Seb35 · 2018-09-29T12:17:38Z

Also, thanks for this library, I searched this algorithm in PHP and it was the only implementation I found. I use it in a more complete environment for diffing texts (using the Ratcliff-Obershelp algorithm on top of this LCS algorithm) in the web interface of Archeo Lex (source code), aiming at an easy viewer of French law texts and particularly its changes.

webignition · 2019-01-11T12:46:18Z

@gonzalom This change is something that I would certainly appreciate. Any chance you can merge?

webignition · 2019-01-11T12:51:20Z

src/Solver.php

-        $charsB = str_split($stringB);
+        $charsA = [];
+        $charsB = [];
+        for ($i=0; $i < max(mb_strlen($stringA), 1); $i++) {


I appreciate that this loop is present due to a lack of mb_str_split().

Can this be logic be factored out to a method (perhaps named mb_str_split() so that the code at this point reads more clearly?

webignition · 2019-01-11T13:08:43Z

src/MatchesSolver.php

@@ -46,7 +46,7 @@ protected function result(
        array $matrix
    ) {
        return array_map(function (Match $result) use ($stringA) {
-            $result->value = substr($stringA, $result->index(), $result->length);
+            $result->value = mb_substr($stringA, $result->index(), $result->length);


mb_substr() is lacking the optional fourth argument. The string will be assumed to be encoded using whatever happens to be the return value of mb_internal_encoding(). This may or may not be the encoding that the string uses.

We can try to get the correct encoding using mb_detect_encoding(). But, hey, that can get tricky since the return value might be false. Oh, and mb_detect_encoding() can get things wrong if a string has invalidly-encoded characters which is why we need to also set the optional third argument to true. Doing so has the (very useful) side benefit of significantly reducing the chance of an invalid encoding attack.

Not entirely sure how you'd work this into these changes, but here's an example of the logic you'd need to use:

function detectMultibyteStringEncoding(string $string): string { $encoding = mb_detect_encoding($string, mb_detect_order(), true); // Return the detected encoding or 'utf-8' if not detected. // This is just an example, ideally 'utf-8' would be defined in a well-named constant. return is_string($encoding) ? $encoding : 'utf-8'; } // ... $result->value = mb_substr($stringA, $result->index(), $result->length, $encoding);

The same applies to all calls to mb_*() functions that can accept an optional $encoding argument, of which there are some more in these set of changes.

Aren't mb strings fun!

Adapt for multibyte strings as UTF-8

82a891f

There is no native function mb_str_split, hence writing a loop.

Seb35 mentioned this pull request Sep 30, 2018

Faire des diffs mot-à-mot nativement Legilibre/Archeo-Lex-web#2

Open

webignition reviewed Jan 11, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt for multibyte strings as UTF-8 #1

Adapt for multibyte strings as UTF-8 #1

Seb35 commented Sep 29, 2018

codecov-io commented Sep 29, 2018 •

edited

Loading

Seb35 commented Sep 29, 2018

webignition commented Jan 11, 2019

webignition Jan 11, 2019

webignition Jan 11, 2019

Adapt for multibyte strings as UTF-8 #1

Are you sure you want to change the base?

Adapt for multibyte strings as UTF-8 #1

Conversation

Seb35 commented Sep 29, 2018

codecov-io commented Sep 29, 2018 • edited Loading

Codecov Report

Seb35 commented Sep 29, 2018

webignition commented Jan 11, 2019

webignition Jan 11, 2019

Choose a reason for hiding this comment

webignition Jan 11, 2019

Choose a reason for hiding this comment

codecov-io commented Sep 29, 2018 •

edited

Loading