Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test: grapheme_extract should slide properly past error bytes. #17404

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Jan 8, 2025

Notes

  • I’m quite confused on the custom syntax in the PHPT file so I may have
    done a poor job creating my new failing test. Still, the behavior is
    confirmed in the code sample below which leads to an infinite loop
    duplicate iteration.

Adds a test to assert that the $next parameter of grapheme_extract() points to the next byte offset in the input $haystack after accounting for the moved offset, according to the docs:

If offset does not point to the first byte of a UTF-8 character,
the start position is moved to the next character boundary.

It seems that the existing behavior is to find the next grapheme boundary from the original provided offset, but if the offset doesn’t point to a valid starting byte, the assigned $next value will point to the byte that was immediately decoded in the same call, leading to re-visiting already-visited characters in the string.

$at = 0;
while ( $at < strlen( "\x85PHP" ) ) {
        $grapheme = grapheme_extract( "\x85PHP", 1, GRAPHEME_EXTR_COUNT, $at, $at );
	// sees 'P', 'P', 'H', 'P'
}

Adds a test to assert that the `$next` parameter of `grapheme_extract()`
points to the next byte offset in the input `$haystack` after accounting
for the moved offset, according to the docs:

> If offset does not point to the first byte of a UTF-8 character,
> the start position is moved to the next character boundary.

It seems that the existing behavior is to find the next grapheme
boundary from the original provided offset, but if the offset doesn’t
point to a valid starting byte, the assigned `$next` value will point
to the byte that was immediately decoded in the same call, leading to
possible infinite loops in user-space code.

```
while ( $at < strlen( $s ) ) {
        $grapheme = grapheme_extract( "\x85PHP", 1, GRAPHEME_EXTR_COUNT, $at, $at );
	// never moves past the second byte, always returns 'P'
}
```
@cmb69
Copy link
Member

cmb69 commented Jan 8, 2025

Possible fix:

 ext/intl/grapheme/grapheme_string.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ext/intl/grapheme/grapheme_string.c b/ext/intl/grapheme/grapheme_string.c
index f33dab8eaf..8fbe24e8dc 100644
--- a/ext/intl/grapheme/grapheme_string.c
+++ b/ext/intl/grapheme/grapheme_string.c
@@ -799,7 +799,7 @@ PHP_FUNCTION(grapheme_extract)
 	if ( -1 != grapheme_ascii_check((unsigned char *)pstr, MIN(size + 1, str_len)) ) {
 		size_t nsize = MIN(size, str_len);
 		if ( NULL != next ) {
-			ZVAL_LONG(next, start+nsize);
+			ZVAL_LONG(next, start+nsize+1);
 		}
 		RETURN_STRINGL(pstr, nsize);
 	}

@dmsnell
Copy link
Contributor Author

dmsnell commented Jan 9, 2025

@cmb69 that change only breaks existing cases. I tried playing around with this yesterday and thought that the following patch should fix it, but it didn’t and then I was confused why.

diff --git a/ext/intl/grapheme/grapheme_string.c b/ext/intl/grapheme/grapheme_string.c
index f33dab8eaf..5b9a28815e 100644
--- a/ext/intl/grapheme/grapheme_string.c
+++ b/ext/intl/grapheme/grapheme_string.c
@@ -833,7 +833,7 @@ PHP_FUNCTION(grapheme_extract)
 	ubrk_close(bi);
 
 	if ( NULL != next ) {
-		ZVAL_LONG(next, start+ret_pos);
+		ZVAL_LONG(next, (pstr-str)+ret_pos);
 	}
 
 	RETURN_STRINGL(((char *)pstr), ret_pos);

In a second look just now I realize that the reason is because we are advancing pstr but not adjusting start, which matches the problem description perfectly. Turns out that does in fact fix it, so I’ve pushed a new commit to this branch.

@devnexen devnexen requested a review from youkidearitai January 9, 2025 05:26
Copy link
Contributor

@youkidearitai youkidearitai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I confirmed pass invalid bytes.

<?php

$at = 0;
while ( $at < strlen( "\x85PH\x85P" ) ) {
        $grapheme = grapheme_extract( "\x85PH\x85P", 1, GRAPHEME_EXTR_COUNT, $at, $at ); // add invalid byte to \x85
        var_dump($grapheme);
}

Result is below:

$ ~/php84/bin/php gh17404.php
string(1) "P"
string(1) "P"
string(1) "H"
string(1) "P"
string(1) "P"
$ sapi/cli/php gh17404.php
string(1) "P"
string(1) "H"
string(1) "P"

Just in case, I suggest write this behavior changes to UPGRADING.

@dmsnell
Copy link
Contributor Author

dmsnell commented Jan 10, 2025

UPGRADING added - good suggestion @youkidearitai. I don’t feel any attachment to the description I wrote so please edit it at your leisure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants