-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt for multibyte strings as UTF-8 #1
base: master
Are you sure you want to change the base?
Conversation
There is no native function mb_str_split, hence writing a loop.
Codecov Report
@@ Coverage Diff @@
## master #1 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 4 4
Lines 113 117 +4
=====================================
+ Hits 113 117 +4
Continue to review full report at Codecov.
|
Also, thanks for this library, I searched this algorithm in PHP and it was the only implementation I found. I use it in a more complete environment for diffing texts (using the Ratcliff-Obershelp algorithm on top of this LCS algorithm) in the web interface of Archeo Lex (source code), aiming at an easy viewer of French law texts and particularly its changes. |
@gonzalom This change is something that I would certainly appreciate. Any chance you can merge? |
$charsB = str_split($stringB); | ||
$charsA = []; | ||
$charsB = []; | ||
for ($i=0; $i < max(mb_strlen($stringA), 1); $i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate that this loop is present due to a lack of mb_str_split()
.
Can this be logic be factored out to a method (perhaps named mb_str_split()
so that the code at this point reads more clearly?
@@ -46,7 +46,7 @@ protected function result( | |||
array $matrix | |||
) { | |||
return array_map(function (Match $result) use ($stringA) { | |||
$result->value = substr($stringA, $result->index(), $result->length); | |||
$result->value = mb_substr($stringA, $result->index(), $result->length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mb_substr()
is lacking the optional fourth argument. The string will be assumed to be encoded using whatever happens to be the return value of mb_internal_encoding()
. This may or may not be the encoding that the string uses.
We can try to get the correct encoding using mb_detect_encoding()
. But, hey, that can get tricky since the return value might be false
. Oh, and mb_detect_encoding()
can get things wrong if a string has invalidly-encoded characters which is why we need to also set the optional third argument to true
. Doing so has the (very useful) side benefit of significantly reducing the chance of an invalid encoding attack.
Not entirely sure how you'd work this into these changes, but here's an example of the logic you'd need to use:
function detectMultibyteStringEncoding(string $string): string
{
$encoding = mb_detect_encoding($string, mb_detect_order(), true);
// Return the detected encoding or 'utf-8' if not detected.
// This is just an example, ideally 'utf-8' would be defined in a well-named constant.
return is_string($encoding) ? $encoding : 'utf-8';
}
// ...
$result->value = mb_substr($stringA, $result->index(), $result->length, $encoding);
The same applies to all calls to mb_*()
functions that can accept an optional $encoding
argument, of which there are some more in these set of changes.
Aren't mb strings fun!
There is no native function mb_str_split, hence writing a loop.
In most cases there are no differences in matches, but there are differences in lengths and indexes. The first added test illustrates this, the previous version gives the same match, but its length was 8 instead of 7 (the "é" counts as 2 bytes) and indexes were 9 / 9 instead of 5 / 7 (the "é" counts as 2 bytes and "’" counts as 3 bytes).
The second test illustrates that in some cases there could be differences in matches because UTF-8 is a multibyte encoding and some individual bytes could match although the whole characters won’t match.