Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some extended page names get broken by punycode transformation #2015

Open
inreti-sb opened this issue Dec 28, 2024 · 25 comments
Open

Some extended page names get broken by punycode transformation #2015

inreti-sb opened this issue Dec 28, 2024 · 25 comments

Comments

@inreti-sb
Copy link

Short description of the issue

Some (at least german) special characters get broken in extended page names.
Before being stored in the database, special characters are transformed to punycode.
Looks like it is using some non-standard punycode to encode the names and this is where some characters seem to break.

Expected behavior

(Pagetitle → Pagename)

Fuß → fuß
Füße → füße

Actual behavior

Fuß → s
Füße → füsse

Optional: Screenshots/Links that demonstrate the issue

image png 388272219694a695ee29131ed8009c92

Processwire extended page name bug punycode in database

We verified the bug on different and fresh installs and discussed it with screenshots in this forum thread:
https://processwire.com/talk/topic/30444-prepend-date-in-page-name-by-hook-creates-improper-results-on-some-existing-pages/

Optional: Suggestion for a possible fix

The Processwire function to convert unicode characters to ASCII characters seems to have a bug / be non-standard.

Steps to reproduce the issue

  1. Fresh Processwire install with blank profile
  2. Enable Extended Page Name Feature for UTF-8 support like here: https://processwire.com/blog/posts/page-name-charset-utf8/
  3. Create a new page with page title Fuß and save the page
  4. Look at the wrong generated pagename
  5. Compare the according punycode stored in in the database with the standard punycode by a IDN Punycode converter.

Setup/Environment

  • ProcessWire version: The current newest stable one 3.0.229
  • (Optional) PHP version: 8.2
  • (Optional) MariaDB version: 10.11.10
  • (Optional) Any 3rd party modules that are installed and could be related to the issue: Even on a fresh install with blank site profile.
@matjazpotocnik
Copy link
Collaborator

I can replicate it.

Interesting:

$p = $pages->add("test_template", "/", "Füße Füße");
$pages->uncacheAll();
d($p->name); // 'füße-füße'

$string="Köln";
d($sanitizer->pageName($string)); // 'kln'
d($sanitizer->pageName($string, 16)); // 'köln'
d($sanitizer->pageNameUTF8($string)); // 'köln'

Same with Fuß and Füße. 

In page editor it's correct, until you save:

image

after save:

image

ryancramerdesign added a commit to processwire/processwire that referenced this issue Dec 31, 2024
@ryancramerdesign
Copy link
Member

@inreti-sb @matjazpotocnik Thanks, I've pushed a fix for this. It wasn't in the Punycode library that we use, as that is only a fallback if PHP's IDN functions aren't available. And the Punycode library we are using appears to work correctly when forcing it to use it over PHP's IDN functions. It's the the PHP IDN functions that had changed behavior in some recent PHP version. PHP changed the default behavior of idn_to_ascii and idn_to_utf8, moving the previous behavior to a transitional flag argument. Using IDNs that aren't actual domains now requires that flag, so I have added it, which makes the functions work how we want them to again.

There was also another issue related to page names longer than 50 characters, which is longer than is supported in an IDN, so we have to split them up into 12 byte chunks and then only encode or decode the chunks that need it.

Can yo confirm it is also fixed there? Thanks.

@matjazpotocnik
Copy link
Collaborator

@ryancramerdesign I can confirm it's fixed. Thank you!

@inreti-sb
Copy link
Author

Happy new year to everyone. Thanks for verifying @matjazpotocnik and thanks so much for the quick fix at the end of the year @ryancramerdesign!

Short pagename seem to work in my first tests, longer ones not really.

E.g.:
Köln gegen Köln in Köln mit allen Köln Köln gegen Köln in Köln mit allen Köln → köln-gegen-kln-in-kln-mit-allen-köln-köln-gegen-köln-in-köln-mit-allen-kln/

Can someone verify this, please?

@matjazpotocnik
Copy link
Collaborator

You're right; it works on shorter strings but not on longer ones. It looks like 50 is the magic number:

Köln gegen Köln in Köln mit allen Köln Köln   -> 48 bytes, ok
Köln gegen Köln in Köln mit allen Köln KölnA  -> 49 bytes, ok
Köln gegen Köln in Köln mit allen Köln KölnAB -> 50 bytes, not ok
Köln gegen Köln in Köln mit allen Köln KolnAB -> 49 bytes, ok (replaced last ö with o)

@matjazpotocnik
Copy link
Collaborator

What I found:

  • It looks like idn_to_ascii works with a length of up to 64 bytes, at least in PHP 8.2
  • Max IDN length is 255 bytes encoded
  • Fallback library (very old) works as expected :-) no need for splitting the string. There are some deprecations, but there is a fix (line 154 and 158):
154       $code = $t + (((int) $q - $t) % (static::BASE - $t)); //MP
155       $output .= static::$encodeTable[$code];
156       $q = ($q - $t) / (static::BASE - $t);
157    }
158    $output .= static::$encodeTable[(int) $q]; //MP
  • idn_to_ascii("öln-in-köln-", 16) return false! error IDNA_ERROR_TRAILING_HYPHEN (16)
  • idn_to_ascii("0---0.dk") return false! error: IDNA_ERROR_HYPHEN_3_4 (32) - UTS#46 does not allow labels with hyphens in the 3rd and 4th positions.
  • and so on...

What to do? We could call idn_to_ascii("öln-in-köln-", 16, INTL_IDNA_VARIANT_UTS46, $info); and use $info['result']? Or use fallback library?

@ryancramerdesign
Copy link
Member

@matjazpotocnik Those sound like good solutions. I'm not sure how best to handle this case though. If the output of the punyEncodeName() changes, then UTF-8 pages that had their names encoded with it after PHP 7.4, might be incorrect. But correcting it means those pages will no longer be accessible by URL, because the URL might very well not be a correct conversion, but it's what's already being used and saved in the DB. So I think I may have to revert back to the original version before the fix, and then use the new/corrected version for new installations after January 2 2025. I'm also thinking that if you set your $config->pageNameCharset to "UTF-8" rather than "UTF8" then it would use the new/correct version. Then I could update any instructions to refer to "UTF-8", but existing installations could continue as-is. It would be great if PHP hadn't changed the behavior of the function in 7.4 but since they did, I just need to figure out the best way to fix it without affecting existing installations.

@matjazpotocnik
Copy link
Collaborator

I don't know what the best solution would be. Wouldn't changing from "UTF8" to "UTF-8" be confusing? After reading the comment by Bruce MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/, maybe we could use a different name? But then again, that might be even more confusing. Whatever solution would be acceptable if it works correctly and the documentation is in place.

@ryancramerdesign
Copy link
Member

ryancramerdesign commented Jan 2, 2025 via email

@ryancramerdesign
Copy link
Member

ryancramerdesign commented Jan 2, 2025 via email

ryancramerdesign added a commit to processwire/processwire that referenced this issue Jan 3, 2025
@ryancramerdesign
Copy link
Member

@matjazpotocnik @inreti-sb I've pushed another update that makes it use the old/buggy behavior for existing installations and the fixed behavior for new installations (after tomorrow). I decided not to do the UTF-8 vs UTF8 toggle because it was too many lines of code to change and I'm feeling too lazy for that. Instead, you can force the behavior by prepending it to your $config->pageNameWhitelist as v1, v2 or v3. The v1 is the old/buggy behavior in PHP 7.4+, the v2 is the dedicated Punycode library, and the v3 is fixed behavior in PHP 7.4+.

$config->pageNameWhitelist = 'v3' . $config->pageNameWhitelist; 

This is just to ensure there's a way for existing installations to use the fixed behavior, should they not have concerns about existing page that might already in the DB.

@inreti-sb
Copy link
Author

Thank you @ryancramerdesign and @matjazpotocnik for making this important fix possible quickly!
In my first tests the fix seem to work perfectly.

@matjazpotocnik
Copy link
Collaborator

$s="öln-in-köln-";
$a=$sanitizer->punyEncodeName($s,3); // return 'ln-in-kln-' WRONG
$b=$sanitizer->punyDecodeName($a,3); // return 'ln-in-kln-' WRONG

this line is the problem:

if(strlen($_value) >= 50) $value = $info['result'];

IDN return false even on shorter strings, as the input value is incorrect. So, above line should be

if($value === false) $value = $info['result']; or
if(empty($value)) $value = $info['result']; or simply
$value = $info['result'];

ryancramerdesign added a commit to processwire/processwire that referenced this issue Jan 5, 2025
@ryancramerdesign
Copy link
Member

Thanks @matjazpotocnik I've updated it to always use $info[result]

@inreti-sb
Copy link
Author

inreti-sb commented Jan 7, 2025

Don't know if it's related to this bug and fix, but ! (exclamation marks) are sometimes substituted with - (dashes) although dashes are in the pageNameWhitelist. E.g.:

Werde Teil des Projekts! → werde-teil-des-projekts-
Schluss mit diesem Blödsinn! Gegen Spiele! → schluss-mit-diesem-blödsinn!-gegen-spiele!

@matjazpotocnik
Copy link
Collaborator

I tried and got:

Werde Teil des Projekts! → werde-teil-des-projekts
Schluss mit diesem Blödsinn! Gegen Spiele! → schluss-mit-diesem-blödsinn-gegen-spiele

@inreti-sb
Copy link
Author

Did you added ! in the whitelist, @matjazpotocnik ? :)

@inreti-sb
Copy link
Author

inreti-sb commented Jan 8, 2025

If i add " (customs mark) to the pageNameWhitelist all this characters get substituted by quot (with a dash).
"Foo" → quot-foo-quot

Maybe it is helpful - if not already the case - to set up tests to check the correct pagenames (output) for a list of pagetitles (input)?

@BernhardBaumrock
Copy link

Maybe it is helpful - if not already the case - to set up tests to check the correct pagenames (output) for a list of pagetitles (input)?

That would be an extremely useful step imho! Please make that happen :)

@poljpocket
Copy link

poljpocket commented Jan 8, 2025

@inreti-sb You maybe can take my special cases I've tested as a starting point. They should definitely work now. Let me know if I can help.

@ryancramerdesign
Copy link
Member

@inreti-sb The double quote is part of the page name blacklist, so isn't supported even if part of your whitelist. Here's the blacklist:

$blacklist = '/\\%"\'<>?#@:;,+=*^$()[]{}|&';

While exclamation point isn't part of our blacklist, google seems to suggest it isn't supported in Punycode, so maybe it should be added to the blacklist, I'm not sure, I have to test it out. I did just push some other improvements also.

@matjazpotocnik
Copy link
Collaborator

@inreti-sb I didn't add ! to the whitelist as I think these characters are not allowed:

(){}[]|`¬¦! "£$%^&*"<>:;#~_-+=,@

@inreti-sb
Copy link
Author

If i add " (customs mark) to the pageNameWhitelist all this characters get substituted by quot (with a dash).
"Foo" → quot-foo-quot

I do not yet know the cause, but this is only happening on pagenames generated by my hook here:
https://processwire.com/talk/topic/30444-prepend-date-in-page-name-by-hook-creates-improper-results-on-some-existing-pages/
Will do more research and perhaps ask for help and write the solution there.

@inreti-sb
Copy link
Author

inreti-sb commented Jan 9, 2025

Now I use the new Sanitizer.php version but still get:

✗ I have a dream! → i-have-a-dream-
✗ Was! folgt? → was--folgt-
✓ Was! folgt? nach 100 € Katar? → was-folgt-nach-100-katar

I would expect that not allowed characters should be removed.
On shorter pagenames this seems not do be the case.

@inreti-sb
Copy link
Author

inreti-sb commented Jan 9, 2025

And I would expect and prefer to only have one dash here. What do you think?

✗ This is the dash - test → this-is-the-dash---test

ryancramerdesign added a commit to processwire/processwire that referenced this issue Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants