-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some extended page names get broken by punycode transformation #2015
Comments
I can replicate it. Interesting: $p = $pages->add("test_template", "/", "Füße Füße"); $string="Köln"; Same with Fuß and Füße. In page editor it's correct, until you save: after save: |
@inreti-sb @matjazpotocnik Thanks, I've pushed a fix for this. It wasn't in the Punycode library that we use, as that is only a fallback if PHP's IDN functions aren't available. And the Punycode library we are using appears to work correctly when forcing it to use it over PHP's IDN functions. It's the the PHP IDN functions that had changed behavior in some recent PHP version. PHP changed the default behavior of idn_to_ascii and idn_to_utf8, moving the previous behavior to a transitional flag argument. Using IDNs that aren't actual domains now requires that flag, so I have added it, which makes the functions work how we want them to again. There was also another issue related to page names longer than 50 characters, which is longer than is supported in an IDN, so we have to split them up into 12 byte chunks and then only encode or decode the chunks that need it. Can yo confirm it is also fixed there? Thanks. |
@ryancramerdesign I can confirm it's fixed. Thank you! |
Happy new year to everyone. Thanks for verifying @matjazpotocnik and thanks so much for the quick fix at the end of the year @ryancramerdesign! Short pagename seem to work in my first tests, longer ones not really. E.g.: Can someone verify this, please? |
You're right; it works on shorter strings but not on longer ones. It looks like 50 is the magic number:
|
What I found:
What to do? We could call |
@matjazpotocnik Those sound like good solutions. I'm not sure how best to handle this case though. If the output of the punyEncodeName() changes, then UTF-8 pages that had their names encoded with it after PHP 7.4, might be incorrect. But correcting it means those pages will no longer be accessible by URL, because the URL might very well not be a correct conversion, but it's what's already being used and saved in the DB. So I think I may have to revert back to the original version before the fix, and then use the new/corrected version for new installations after January 2 2025. I'm also thinking that if you set your |
I don't know what the best solution would be. Wouldn't changing from "UTF8" to "UTF-8" be confusing? After reading the comment by Bruce MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/, maybe we could use a different name? But then again, that might be even more confusing. Whatever solution would be acceptable if it works correctly and the documentation is in place. |
My thought is it wouldn't be confusing because we wouldn't be telling
anyone to use UTF8. It would just be a way for existing installations to be
untouched, with no need to even know about UTF-8 vs UTF8. It could be
something else, but I don't see any reason to have it be an "either or"
option since there would be no reason to have the old/broken behavior
except if already stuck with it.
…On Thu, Jan 2, 2025, 3:37 PM Matjaž Potočnik ***@***.***> wrote:
I don't know what the best solution would be. Wouldn't changing from
"UTF8" to "UTF-8" be confusing? After reading the comment by Bruce
MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/,
maybe we could use a different name? But then again, that might be even
more confusing. Whatever solution would be acceptable if it works correctly
and the documentation is in place.
—
Reply to this email directly, view it on GitHub
<#2015 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACQEUB53MLV4QASGC2IP7T2IWPP7AVCNFSM6AAAAABUJ6VOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRYGM2DOOBRGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Installs after Jan 2 2025 would use the fixed behavior either way, UTF8 or
UTF-8 or whatever alternative name is used.
…On Thu, Jan 2, 2025, 3:42 PM Ryan Cramer ***@***.***> wrote:
My thought is it wouldn't be confusing because we wouldn't be telling
anyone to use UTF8. It would just be a way for existing installations to be
untouched, with no need to even know about UTF-8 vs UTF8. It could be
something else, but I don't see any reason to have it be an "either or"
option since there would be no reason to have the old/broken behavior
except if already stuck with it.
On Thu, Jan 2, 2025, 3:37 PM Matjaž Potočnik ***@***.***>
wrote:
> I don't know what the best solution would be. Wouldn't changing from
> "UTF8" to "UTF-8" be confusing? After reading the comment by Bruce
> MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/,
> maybe we could use a different name? But then again, that might be even
> more confusing. Whatever solution would be acceptable if it works correctly
> and the documentation is in place.
>
> —
> Reply to this email directly, view it on GitHub
> <#2015 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACQEUB53MLV4QASGC2IP7T2IWPP7AVCNFSM6AAAAABUJ6VOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRYGM2DOOBRGA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
@matjazpotocnik @inreti-sb I've pushed another update that makes it use the old/buggy behavior for existing installations and the fixed behavior for new installations (after tomorrow). I decided not to do the UTF-8 vs UTF8 toggle because it was too many lines of code to change and I'm feeling too lazy for that. Instead, you can force the behavior by prepending it to your $config->pageNameWhitelist = 'v3' . $config->pageNameWhitelist; This is just to ensure there's a way for existing installations to use the fixed behavior, should they not have concerns about existing page that might already in the DB. |
Thank you @ryancramerdesign and @matjazpotocnik for making this important fix possible quickly! |
this line is the problem:
IDN return false even on shorter strings, as the input value is incorrect. So, above line should be
|
Thanks @matjazpotocnik I've updated it to always use $info[result] |
Don't know if it's related to this bug and fix, but ! (exclamation marks) are sometimes substituted with - (dashes) although dashes are in the pageNameWhitelist. E.g.:
|
I tried and got: Werde Teil des Projekts! → werde-teil-des-projekts |
Did you added ! in the whitelist, @matjazpotocnik ? :) |
If i add " (customs mark) to the pageNameWhitelist all this characters get substituted by quot (with a dash). Maybe it is helpful - if not already the case - to set up tests to check the correct pagenames (output) for a list of pagetitles (input)? |
That would be an extremely useful step imho! Please make that happen :) |
@inreti-sb You maybe can take my special cases I've tested as a starting point. They should definitely work now. Let me know if I can help. |
@inreti-sb The double quote is part of the page name blacklist, so isn't supported even if part of your whitelist. Here's the blacklist: $blacklist = '/\\%"\'<>?#@:;,+=*^$()[]{}|&'; While exclamation point isn't part of our blacklist, google seems to suggest it isn't supported in Punycode, so maybe it should be added to the blacklist, I'm not sure, I have to test it out. I did just push some other improvements also. |
@inreti-sb I didn't add
|
I do not yet know the cause, but this is only happening on pagenames generated by my hook here: |
Now I use the new Sanitizer.php version but still get:
I would expect that not allowed characters should be removed. |
And I would expect and prefer to only have one dash here. What do you think?
|
Short description of the issue
Some (at least german) special characters get broken in extended page names.
Before being stored in the database, special characters are transformed to punycode.
Looks like it is using some non-standard punycode to encode the names and this is where some characters seem to break.
Expected behavior
(Pagetitle → Pagename)
Fuß → fuß
Füße → füße
Actual behavior
Fuß → s
Füße → füsse
Optional: Screenshots/Links that demonstrate the issue
We verified the bug on different and fresh installs and discussed it with screenshots in this forum thread:
https://processwire.com/talk/topic/30444-prepend-date-in-page-name-by-hook-creates-improper-results-on-some-existing-pages/
Optional: Suggestion for a possible fix
The Processwire function to convert unicode characters to ASCII characters seems to have a bug / be non-standard.
Steps to reproduce the issue
Setup/Environment
The text was updated successfully, but these errors were encountered: