Some extended page names get broken by punycode transformation #2015

inreti-sb · 2024-12-28T17:02:34Z

Short description of the issue

Some (at least german) special characters get broken in extended page names.
Before being stored in the database, special characters are transformed to punycode.
Looks like it is using some non-standard punycode to encode the names and this is where some characters seem to break.

Expected behavior

(Pagetitle → Pagename)

Fuß → fuß
Füße → füße

Actual behavior

Fuß → s
Füße → füsse

Optional: Screenshots/Links that demonstrate the issue

We verified the bug on different and fresh installs and discussed it with screenshots in this forum thread:
https://processwire.com/talk/topic/30444-prepend-date-in-page-name-by-hook-creates-improper-results-on-some-existing-pages/

Optional: Suggestion for a possible fix

The Processwire function to convert unicode characters to ASCII characters seems to have a bug / be non-standard.

Steps to reproduce the issue

Fresh Processwire install with blank profile
Enable Extended Page Name Feature for UTF-8 support like here: https://processwire.com/blog/posts/page-name-charset-utf8/
Create a new page with page title Fuß and save the page
Look at the wrong generated pagename
Compare the according punycode stored in in the database with the standard punycode by a IDN Punycode converter.

Setup/Environment

ProcessWire version: The current newest stable one 3.0.229
(Optional) PHP version: 8.2
(Optional) MariaDB version: 10.11.10
(Optional) Any 3rd party modules that are installed and could be related to the issue: Even on a fresh install with blank site profile.

matjazpotocnik · 2024-12-30T12:41:51Z

I can replicate it.

Interesting:

$p = $pages->add("test_template", "/", "Füße Füße");
$pages->uncacheAll();
d($p->name); // 'füße-füße'

$string="Köln";
d($sanitizer->pageName($string)); // 'kln'
d($sanitizer->pageName($string, 16)); // 'köln'
d($sanitizer->pageNameUTF8($string)); // 'köln'

Same with Fuß and Füße.

In page editor it's correct, until you save:

after save:

ryancramerdesign · 2024-12-31T16:06:16Z

@inreti-sb @matjazpotocnik Thanks, I've pushed a fix for this. It wasn't in the Punycode library that we use, as that is only a fallback if PHP's IDN functions aren't available. And the Punycode library we are using appears to work correctly when forcing it to use it over PHP's IDN functions. It's the the PHP IDN functions that had changed behavior in some recent PHP version. PHP changed the default behavior of idn_to_ascii and idn_to_utf8, moving the previous behavior to a transitional flag argument. Using IDNs that aren't actual domains now requires that flag, so I have added it, which makes the functions work how we want them to again.

There was also another issue related to page names longer than 50 characters, which is longer than is supported in an IDN, so we have to split them up into 12 byte chunks and then only encode or decode the chunks that need it.

Can yo confirm it is also fixed there? Thanks.

matjazpotocnik · 2024-12-31T18:24:11Z

@ryancramerdesign I can confirm it's fixed. Thank you!

inreti-sb · 2025-01-01T19:09:44Z

Happy new year to everyone. Thanks for verifying @matjazpotocnik and thanks so much for the quick fix at the end of the year @ryancramerdesign!

Short pagename seem to work in my first tests, longer ones not really.

E.g.:
Köln gegen Köln in Köln mit allen Köln Köln gegen Köln in Köln mit allen Köln → köln-gegen-kln-in-kln-mit-allen-köln-köln-gegen-köln-in-köln-mit-allen-kln/

Can someone verify this, please?

matjazpotocnik · 2025-01-02T07:10:05Z

You're right; it works on shorter strings but not on longer ones. It looks like 50 is the magic number:

Köln gegen Köln in Köln mit allen Köln Köln   -> 48 bytes, ok
Köln gegen Köln in Köln mit allen Köln KölnA  -> 49 bytes, ok
Köln gegen Köln in Köln mit allen Köln KölnAB -> 50 bytes, not ok
Köln gegen Köln in Köln mit allen Köln KolnAB -> 49 bytes, ok (replaced last ö with o)

matjazpotocnik · 2025-01-02T15:27:16Z

What I found:

It looks like idn_to_ascii works with a length of up to 64 bytes, at least in PHP 8.2
Max IDN length is 255 bytes encoded
Fallback library (very old) works as expected :-) no need for splitting the string. There are some deprecations, but there is a fix (line 154 and 158):

154       $code = $t + (((int) $q - $t) % (static::BASE - $t)); //MP
155       $output .= static::$encodeTable[$code];
156       $q = ($q - $t) / (static::BASE - $t);
157    }
158    $output .= static::$encodeTable[(int) $q]; //MP

idn_to_ascii("öln-in-köln-", 16) return false! error IDNA_ERROR_TRAILING_HYPHEN (16)
idn_to_ascii("0---0.dk") return false! error: IDNA_ERROR_HYPHEN_3_4 (32) - UTS#46 does not allow labels with hyphens in the 3rd and 4th positions.
and so on...

What to do? We could call idn_to_ascii("öln-in-köln-", 16, INTL_IDNA_VARIANT_UTS46, $info); and use $info['result']? Or use fallback library?

ryancramerdesign · 2025-01-02T20:17:11Z

@matjazpotocnik Those sound like good solutions. I'm not sure how best to handle this case though. If the output of the punyEncodeName() changes, then UTF-8 pages that had their names encoded with it after PHP 7.4, might be incorrect. But correcting it means those pages will no longer be accessible by URL, because the URL might very well not be a correct conversion, but it's what's already being used and saved in the DB. So I think I may have to revert back to the original version before the fix, and then use the new/corrected version for new installations after January 2 2025. I'm also thinking that if you set your $config->pageNameCharset to "UTF-8" rather than "UTF8" then it would use the new/correct version. Then I could update any instructions to refer to "UTF-8", but existing installations could continue as-is. It would be great if PHP hadn't changed the behavior of the function in 7.4 but since they did, I just need to figure out the best way to fix it without affecting existing installations.

matjazpotocnik · 2025-01-02T20:36:58Z

I don't know what the best solution would be. Wouldn't changing from "UTF8" to "UTF-8" be confusing? After reading the comment by Bruce MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/, maybe we could use a different name? But then again, that might be even more confusing. Whatever solution would be acceptable if it works correctly and the documentation is in place.

ryancramerdesign · 2025-01-02T20:42:58Z

My thought is it wouldn't be confusing because we wouldn't be telling anyone to use UTF8. It would just be a way for existing installations to be untouched, with no need to even know about UTF-8 vs UTF8. It could be something else, but I don't see any reason to have it be an "either or" option since there would be no reason to have the old/broken behavior except if already stuck with it.

…

On Thu, Jan 2, 2025, 3:37 PM Matjaž Potočnik ***@***.***> wrote: I don't know what the best solution would be. Wouldn't changing from "UTF8" to "UTF-8" be confusing? After reading the comment by Bruce MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/, maybe we could use a different name? But then again, that might be even more confusing. Whatever solution would be acceptable if it works correctly and the documentation is in place. — Reply to this email directly, view it on GitHub <#2015 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACQEUB53MLV4QASGC2IP7T2IWPP7AVCNFSM6AAAAABUJ6VOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRYGM2DOOBRGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ryancramerdesign · 2025-01-02T20:45:33Z

Installs after Jan 2 2025 would use the fixed behavior either way, UTF8 or UTF-8 or whatever alternative name is used.

…

On Thu, Jan 2, 2025, 3:42 PM Ryan Cramer ***@***.***> wrote: My thought is it wouldn't be confusing because we wouldn't be telling anyone to use UTF8. It would just be a way for existing installations to be untouched, with no need to even know about UTF-8 vs UTF8. It could be something else, but I don't see any reason to have it be an "either or" option since there would be no reason to have the old/broken behavior except if already stuck with it. On Thu, Jan 2, 2025, 3:37 PM Matjaž Potočnik ***@***.***> wrote: > I don't know what the best solution would be. Wouldn't changing from > "UTF8" to "UTF-8" be confusing? After reading the comment by Bruce > MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/, > maybe we could use a different name? But then again, that might be even > more confusing. Whatever solution would be acceptable if it works correctly > and the documentation is in place. > > — > Reply to this email directly, view it on GitHub > <#2015 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACQEUB53MLV4QASGC2IP7T2IWPP7AVCNFSM6AAAAABUJ6VOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRYGM2DOOBRGA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

ryancramerdesign · 2025-01-03T17:48:16Z

@matjazpotocnik @inreti-sb I've pushed another update that makes it use the old/buggy behavior for existing installations and the fixed behavior for new installations (after tomorrow). I decided not to do the UTF-8 vs UTF8 toggle because it was too many lines of code to change and I'm feeling too lazy for that. Instead, you can force the behavior by prepending it to your $config->pageNameWhitelist as v1, v2 or v3. The v1 is the old/buggy behavior in PHP 7.4+, the v2 is the dedicated Punycode library, and the v3 is fixed behavior in PHP 7.4+.

$config->pageNameWhitelist = 'v3' . $config->pageNameWhitelist;

This is just to ensure there's a way for existing installations to use the fixed behavior, should they not have concerns about existing page that might already in the DB.

inreti-sb · 2025-01-03T22:15:10Z

Thank you @ryancramerdesign and @matjazpotocnik for making this important fix possible quickly!
In my first tests the fix seem to work perfectly.

matjazpotocnik · 2025-01-04T09:33:21Z

$s="öln-in-köln-";
$a=$sanitizer->punyEncodeName($s,3); // return 'ln-in-kln-' WRONG
$b=$sanitizer->punyDecodeName($a,3); // return 'ln-in-kln-' WRONG

this line is the problem:

if(strlen($_value) >= 50) $value = $info['result'];

IDN return false even on shorter strings, as the input value is incorrect. So, above line should be

if($value === false) $value = $info['result']; or
if(empty($value)) $value = $info['result']; or simply
$value = $info['result'];

ryancramerdesign · 2025-01-05T15:12:03Z

Thanks @matjazpotocnik I've updated it to always use $info[result]

inreti-sb · 2025-01-07T07:26:30Z

Don't know if it's related to this bug and fix, but ! (exclamation marks) are sometimes substituted with - (dashes) although dashes are in the pageNameWhitelist. E.g.:

Werde Teil des Projekts! → werde-teil-des-projekts-
Schluss mit diesem Blödsinn! Gegen Spiele! → schluss-mit-diesem-blödsinn!-gegen-spiele!

matjazpotocnik · 2025-01-07T20:00:00Z

I tried and got:

Werde Teil des Projekts! → werde-teil-des-projekts
Schluss mit diesem Blödsinn! Gegen Spiele! → schluss-mit-diesem-blödsinn-gegen-spiele

inreti-sb · 2025-01-07T20:37:59Z

Did you added ! in the whitelist, @matjazpotocnik ? :)

inreti-sb · 2025-01-08T14:00:12Z

If i add " (customs mark) to the pageNameWhitelist all this characters get substituted by quot (with a dash).
"Foo" → quot-foo-quot

Maybe it is helpful - if not already the case - to set up tests to check the correct pagenames (output) for a list of pagetitles (input)?

BernhardBaumrock · 2025-01-08T14:03:08Z

Maybe it is helpful - if not already the case - to set up tests to check the correct pagenames (output) for a list of pagetitles (input)?

That would be an extremely useful step imho! Please make that happen :)

poljpocket · 2025-01-08T14:58:44Z

@inreti-sb You maybe can take my special cases I've tested as a starting point. They should definitely work now. Let me know if I can help.

…re/processwire-issues#2015

ryancramerdesign · 2025-01-08T15:13:57Z

@inreti-sb The double quote is part of the page name blacklist, so isn't supported even if part of your whitelist. Here's the blacklist:

$blacklist = '/\\%"\'<>?#@:;,+=*^$()[]{}|&';

While exclamation point isn't part of our blacklist, google seems to suggest it isn't supported in Punycode, so maybe it should be added to the blacklist, I'm not sure, I have to test it out. I did just push some other improvements also.

matjazpotocnik · 2025-01-08T16:36:17Z

@inreti-sb I didn't add ! to the whitelist as I think these characters are not allowed:

(){}[]|`¬¦! "£$%^&*"<>:;#~_-+=,@

inreti-sb · 2025-01-09T12:47:53Z

If i add " (customs mark) to the pageNameWhitelist all this characters get substituted by quot (with a dash).
"Foo" → quot-foo-quot

I do not yet know the cause, but this is only happening on pagenames generated by my hook here:
https://processwire.com/talk/topic/30444-prepend-date-in-page-name-by-hook-creates-improper-results-on-some-existing-pages/
Will do more research and perhaps ask for help and write the solution there.

inreti-sb · 2025-01-09T12:59:29Z

Now I use the new Sanitizer.php version but still get:

✗ I have a dream! → i-have-a-dream-
✗ Was! folgt? → was--folgt-
✓ Was! folgt? nach 100 € Katar? → was-folgt-nach-100-katar

I would expect that not allowed characters should be removed.
On shorter pagenames this seems not do be the case.

inreti-sb · 2025-01-09T13:05:30Z

And I would expect and prefer to only have one dash here. What do you think?

✗ This is the dash - test → this-is-the-dash---test

ryancramerdesign added a commit to processwire/processwire that referenced this issue Dec 31, 2024

Fix issue processwire/processwire-issues#2015

4be3890

ryancramerdesign added the status: fixed-please confirm label Dec 31, 2024

ryancramerdesign added a commit to processwire/processwire that referenced this issue Jan 3, 2025

Additional updates for processwire/processwire-issues#2015

e6ace73

ryancramerdesign added a commit to processwire/processwire that referenced this issue Jan 5, 2025

Update for processwire/processwire-issues#2015

00a6baa

ryancramerdesign added a commit to processwire/processwire that referenced this issue Jan 8, 2025

Some more improvements to UTF-8 page names, also related to processwi…

8702840

…re/processwire-issues#2015

ryancramerdesign added a commit to processwire/processwire that referenced this issue Jan 9, 2025

More updates for processwire/processwire-issues#2015

4604c09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some extended page names get broken by punycode transformation #2015

Some extended page names get broken by punycode transformation #2015

inreti-sb commented Dec 28, 2024

matjazpotocnik commented Dec 30, 2024

ryancramerdesign commented Dec 31, 2024

matjazpotocnik commented Dec 31, 2024

inreti-sb commented Jan 1, 2025

matjazpotocnik commented Jan 2, 2025

matjazpotocnik commented Jan 2, 2025

ryancramerdesign commented Jan 2, 2025

matjazpotocnik commented Jan 2, 2025

ryancramerdesign commented Jan 2, 2025 via email

ryancramerdesign commented Jan 2, 2025 via email

ryancramerdesign commented Jan 3, 2025

inreti-sb commented Jan 3, 2025

matjazpotocnik commented Jan 4, 2025

ryancramerdesign commented Jan 5, 2025

inreti-sb commented Jan 7, 2025 •

edited

Loading

matjazpotocnik commented Jan 7, 2025

inreti-sb commented Jan 7, 2025

inreti-sb commented Jan 8, 2025 •

edited

Loading

BernhardBaumrock commented Jan 8, 2025

poljpocket commented Jan 8, 2025 •

edited

Loading

ryancramerdesign commented Jan 8, 2025

matjazpotocnik commented Jan 8, 2025

inreti-sb commented Jan 9, 2025

inreti-sb commented Jan 9, 2025 •

edited

Loading

inreti-sb commented Jan 9, 2025 •

edited

Loading

Some extended page names get broken by punycode transformation #2015

Some extended page names get broken by punycode transformation #2015

Comments

inreti-sb commented Dec 28, 2024

Short description of the issue

Expected behavior

Actual behavior

Optional: Screenshots/Links that demonstrate the issue

Optional: Suggestion for a possible fix

Steps to reproduce the issue

Setup/Environment

matjazpotocnik commented Dec 30, 2024

ryancramerdesign commented Dec 31, 2024

matjazpotocnik commented Dec 31, 2024

inreti-sb commented Jan 1, 2025

matjazpotocnik commented Jan 2, 2025

matjazpotocnik commented Jan 2, 2025

ryancramerdesign commented Jan 2, 2025

matjazpotocnik commented Jan 2, 2025

ryancramerdesign commented Jan 2, 2025 via email

ryancramerdesign commented Jan 2, 2025 via email

ryancramerdesign commented Jan 3, 2025

inreti-sb commented Jan 3, 2025

matjazpotocnik commented Jan 4, 2025

ryancramerdesign commented Jan 5, 2025

inreti-sb commented Jan 7, 2025 • edited Loading

matjazpotocnik commented Jan 7, 2025

inreti-sb commented Jan 7, 2025

inreti-sb commented Jan 8, 2025 • edited Loading

BernhardBaumrock commented Jan 8, 2025

poljpocket commented Jan 8, 2025 • edited Loading

ryancramerdesign commented Jan 8, 2025

matjazpotocnik commented Jan 8, 2025

inreti-sb commented Jan 9, 2025

inreti-sb commented Jan 9, 2025 • edited Loading

inreti-sb commented Jan 9, 2025 • edited Loading

inreti-sb commented Jan 7, 2025 •

edited

Loading

inreti-sb commented Jan 8, 2025 •

edited

Loading

poljpocket commented Jan 8, 2025 •

edited

Loading

inreti-sb commented Jan 9, 2025 •

edited

Loading

inreti-sb commented Jan 9, 2025 •

edited

Loading