Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt for multibyte strings as UTF-8 #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

Seb35
Copy link

@Seb35 Seb35 commented Sep 29, 2018

There is no native function mb_str_split, hence writing a loop.

In most cases there are no differences in matches, but there are differences in lengths and indexes. The first added test illustrates this, the previous version gives the same match, but its length was 8 instead of 7 (the "é" counts as 2 bytes) and indexes were 9 / 9 instead of 5 / 7 (the "é" counts as 2 bytes and "’" counts as 3 bytes).

The second test illustrates that in some cases there could be differences in matches because UTF-8 is a multibyte encoding and some individual bytes could match although the whole characters won’t match.

There is no native function mb_str_split, hence writing a loop.
@codecov-io
Copy link

codecov-io commented Sep 29, 2018

Codecov Report

Merging #1 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master     #1   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           4      4           
  Lines         113    117    +4     
=====================================
+ Hits          113    117    +4
Impacted Files Coverage Δ
src/Solver.php 100% <100%> (ø) ⬆️
src/MatchesSolver.php 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb35375...82a891f. Read the comment docs.

@Seb35
Copy link
Author

Seb35 commented Sep 29, 2018

Also, thanks for this library, I searched this algorithm in PHP and it was the only implementation I found. I use it in a more complete environment for diffing texts (using the Ratcliff-Obershelp algorithm on top of this LCS algorithm) in the web interface of Archeo Lex (source code), aiming at an easy viewer of French law texts and particularly its changes.

@webignition
Copy link

@gonzalom This change is something that I would certainly appreciate. Any chance you can merge?

$charsB = str_split($stringB);
$charsA = [];
$charsB = [];
for ($i=0; $i < max(mb_strlen($stringA), 1); $i++) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate that this loop is present due to a lack of mb_str_split().

Can this be logic be factored out to a method (perhaps named mb_str_split() so that the code at this point reads more clearly?

@@ -46,7 +46,7 @@ protected function result(
array $matrix
) {
return array_map(function (Match $result) use ($stringA) {
$result->value = substr($stringA, $result->index(), $result->length);
$result->value = mb_substr($stringA, $result->index(), $result->length);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mb_substr() is lacking the optional fourth argument. The string will be assumed to be encoded using whatever happens to be the return value of mb_internal_encoding(). This may or may not be the encoding that the string uses.

We can try to get the correct encoding using mb_detect_encoding(). But, hey, that can get tricky since the return value might be false. Oh, and mb_detect_encoding() can get things wrong if a string has invalidly-encoded characters which is why we need to also set the optional third argument to true. Doing so has the (very useful) side benefit of significantly reducing the chance of an invalid encoding attack.

Not entirely sure how you'd work this into these changes, but here's an example of the logic you'd need to use:

function detectMultibyteStringEncoding(string $string): string
{
    $encoding = mb_detect_encoding($string, mb_detect_order(), true);

    // Return the detected encoding or 'utf-8' if not detected. 
    // This is just an example, ideally 'utf-8' would be defined in a well-named constant.
    return is_string($encoding) ? $encoding : 'utf-8';
}

// ...

$result->value = mb_substr($stringA, $result->index(), $result->length, $encoding);

The same applies to all calls to mb_*() functions that can accept an optional $encoding argument, of which there are some more in these set of changes.

Aren't mb strings fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants