Avoid UTF-8 validating same string multiple times #3957

nikic · 2019-03-18T12:04:34Z

Fix for https://bugs.php.net/bug.php?id=72685.

We currently have a huge performance problem when implementing lexers working on UTF-8 strings in PHP. This kind of code tends to perform a large number of matches at different offsets on a single string. This is generally fast. However, if /u mode is used, the full string will be UTF-8 validated on each match. This results in quadratic runtime.

This patch fixes the issue by adding a IS_STR_VALID_UTF8 flag, which is set when we have determined that the string is valid UTF8 and further validation is skipped.

A limitation of this approach is that we can't set the flag for interned strings. I think this is not a problem for this use-case which will generally work on dynamic data. If we want to use this flag for other purposes as well (mbstring?) then it might be worthwhile to UTF-8 validate strings during interning. But right now this doesn't seem useful.

nikic · 2019-03-18T12:05:47Z

@dstogov Can you take a look at this?

dstogov · 2019-03-18T13:25:03Z

Looks fine to me.

cmb69 · 2019-03-18T15:17:39Z

This could be a base for solving bug #52998, too.

nikic · 2019-03-18T16:00:55Z

Merged as 2b9acd3 into 7.4.

Avoid UTF-8 validating same string multiple times

ed837a9

KalleZ added the Bug label Mar 18, 2019

nikic closed this Mar 18, 2019

ju1ius mentioned this pull request Nov 28, 2022

feat: allows ZendStr to contain null bytes davidcole1340/ext-php-rs#202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid UTF-8 validating same string multiple times #3957

Avoid UTF-8 validating same string multiple times #3957

nikic commented Mar 18, 2019

nikic commented Mar 18, 2019

dstogov commented Mar 18, 2019

cmb69 commented Mar 18, 2019

nikic commented Mar 18, 2019

Avoid UTF-8 validating same string multiple times #3957

Avoid UTF-8 validating same string multiple times #3957

Conversation

nikic commented Mar 18, 2019

nikic commented Mar 18, 2019

dstogov commented Mar 18, 2019

cmb69 commented Mar 18, 2019

nikic commented Mar 18, 2019