-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test: grapheme_extract should slide properly past error bytes. #17404
base: master
Are you sure you want to change the base?
Test: grapheme_extract should slide properly past error bytes. #17404
Conversation
Adds a test to assert that the `$next` parameter of `grapheme_extract()` points to the next byte offset in the input `$haystack` after accounting for the moved offset, according to the docs: > If offset does not point to the first byte of a UTF-8 character, > the start position is moved to the next character boundary. It seems that the existing behavior is to find the next grapheme boundary from the original provided offset, but if the offset doesn’t point to a valid starting byte, the assigned `$next` value will point to the byte that was immediately decoded in the same call, leading to possible infinite loops in user-space code. ``` while ( $at < strlen( $s ) ) { $grapheme = grapheme_extract( "\x85PHP", 1, GRAPHEME_EXTR_COUNT, $at, $at ); // never moves past the second byte, always returns 'P' } ```
Possible fix: ext/intl/grapheme/grapheme_string.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/ext/intl/grapheme/grapheme_string.c b/ext/intl/grapheme/grapheme_string.c
index f33dab8eaf..8fbe24e8dc 100644
--- a/ext/intl/grapheme/grapheme_string.c
+++ b/ext/intl/grapheme/grapheme_string.c
@@ -799,7 +799,7 @@ PHP_FUNCTION(grapheme_extract)
if ( -1 != grapheme_ascii_check((unsigned char *)pstr, MIN(size + 1, str_len)) ) {
size_t nsize = MIN(size, str_len);
if ( NULL != next ) {
- ZVAL_LONG(next, start+nsize);
+ ZVAL_LONG(next, start+nsize+1);
}
RETURN_STRINGL(pstr, nsize);
} |
@cmb69 that change only breaks existing cases. I tried playing around with this yesterday and thought that the following patch should fix it, but it didn’t and then I was confused why. diff --git a/ext/intl/grapheme/grapheme_string.c b/ext/intl/grapheme/grapheme_string.c
index f33dab8eaf..5b9a28815e 100644
--- a/ext/intl/grapheme/grapheme_string.c
+++ b/ext/intl/grapheme/grapheme_string.c
@@ -833,7 +833,7 @@ PHP_FUNCTION(grapheme_extract)
ubrk_close(bi);
if ( NULL != next ) {
- ZVAL_LONG(next, start+ret_pos);
+ ZVAL_LONG(next, (pstr-str)+ret_pos);
}
RETURN_STRINGL(((char *)pstr), ret_pos); In a second look just now I realize that the reason is because we are advancing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I confirmed pass invalid bytes.
<?php
$at = 0;
while ( $at < strlen( "\x85PH\x85P" ) ) {
$grapheme = grapheme_extract( "\x85PH\x85P", 1, GRAPHEME_EXTR_COUNT, $at, $at ); // add invalid byte to \x85
var_dump($grapheme);
}
Result is below:
$ ~/php84/bin/php gh17404.php
string(1) "P"
string(1) "P"
string(1) "H"
string(1) "P"
string(1) "P"
$ sapi/cli/php gh17404.php
string(1) "P"
string(1) "H"
string(1) "P"
Just in case, I suggest write this behavior changes to UPGRADING
.
Notes
done a poor job creating my new failing test. Still, the behavior is
confirmed in the code sample below which leads to
an infinite loopduplicate iteration.
Adds a test to assert that the
$next
parameter ofgrapheme_extract()
points to the next byte offset in the input$haystack
after accounting for the moved offset, according to the docs:It seems that the existing behavior is to find the next grapheme boundary from the original provided offset, but if the offset doesn’t point to a valid starting byte, the assigned
$next
value will point to the byte that was immediately decoded in the same call, leading to re-visiting already-visited characters in the string.