-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser hangs on some Unicode numbers and symbols in identifiers #10275
Comments
From [email protected]This is a bug report for perl from dgl@dgl.cx If a source file or eval is in UTF-8 (i.e. use utf8 for files, or a UTF-8 The codepoints that cause a hang depend on the version of unicode perl is using. Perl 5.10.0: Perl 5.12.0 RC3: Flags: Site configuration information for perl 5.12.0: Configured by dgl at Sat Apr 3 10:17:37 BST 2010. Summary of my perl5 (revision 5 version 12 subversion 0) configuration: Locally applied patches: @INC for perl 5.12.0: Environment for perl 5.12.0: |
From @khwilliamsondgl@mogao.nonet (via RT) wrote:
This problem goes away if 'use utf8' is added to the program, which it |
The RT System itself - Status changed from 'new' to 'open' |
From @dglOn Sat, Apr 03, 2010 at 08:28:42PM -0600, karl williamson wrote:
That doesn't always seem to be the case: perl -C -le'print "use utf8;\n\x{212e}"' | perl |
From @khwilliamsonDavid Leadbeater wrote:
It is looping, and eventually my computer runs out of memory. If I |
From @cpansproutThis was broken by change 30148 In Perl_yylex in toke.c: switch (*s) { isIDFIRST_lazy_if returns true for characters in ID_Continue that are /* The ID_Start of Unicode is quite limiting: it assumes a L-class ) Then further down (in toke.c): keylookup: { S_scan_word has: else if (UTF && UTF8_IS_START(*s) && isALNUM_utf8((U8*)s)) { So characters in \p{OtherIDContinue}, such as U+387 and U+1369, get I think scan_word should be using is_utf8_idcont, rather than |
From @cpansproutInline Patchdiff -Nurp blead/t/comp/parser.t blead-74022/t/comp/parser.t
--- blead/t/comp/parser.t 2010-01-05 14:08:09.000000000 -0800
+++ blead-74022/t/comp/parser.t 2010-04-25 14:12:11.000000000 -0700
@@ -3,7 +3,7 @@
# Checks if the parser behaves correctly in edge cases
# (including weird syntax errors)
-print "1..122\n";
+print "1..123\n";
sub failed {
my ($got, $expected, $name) = @_;
@@ -353,6 +353,11 @@ eval q{
};
is($@, "", "multiline whitespace inside substitute expression");
+# bug #74022: Loop on characters in \p{OtherIDContinue}
+# This test hangs if it fails.
+eval chr 0x387;
+is(1,1, '[perl #74022] Parser looping on OtherIDContinue chars');
+
# Add new tests HERE:
# More awkward tests for #line. Keep these at the end, as they will screw
Inline Patchdiff -Nurp blead/toke.c blead-74022/toke.c
--- blead/toke.c 2010-04-23 05:36:10.000000000 -0700
+++ blead-74022/toke.c 2010-04-25 14:03:13.000000000 -0700
@@ -11662,7 +11662,7 @@ S_scan_word(pTHX_ register char *s, char
*d++ = *s++;
*d++ = *s++;
}
- else if (UTF && UTF8_IS_START(*s) && isALNUM_utf8((U8*)s)) {
+ else if (UTF && UTF8_IS_START(*s) && is_utf8_idcont((U8*)s)) {
char *t = s + UTF8SKIP(s);
size_t len;
while (UTF8_IS_CONTINUED(*t) && is_utf8_mark((U8*)t)) |
From @cpansproutOn Apr 25, 2010, at 2:21 PM, Father Chrysostomos wrote:
The tests had not finished running when I sent that. lib/utf8.t is /^(?!\p{IsDigit})[\p{ID_Continue}_]+/ whereas what it actually matches (ignoring package separators) is /^([\p{IsWord}_]\pM?)*/ My patch prevents qq·aaa· from being valid syntax, because U+B7 is So there is a potential for breakage if we make everything match |
From @khwilliamsonFather Chrysostomos wrote:
Thanks for finding this. I've wondered about the comment in handy.h /* The ID_Start of Unicode is quite limiting: it assumes a L-class Jarkko wrote that comment in 2002. Since then (actually quite a long Jarkko wrote me last year that "Unicode knows best". In other words, In 5.12, I took Jarkko's advice, and changed our definitions of \p I had been planning to look at this area too, and your posts spurred me The middle dot that caused your test to fail is one that Unicode has had Actually, I think we should move not to ID_Start, but to Unicode's Unicode is keeping ID_Start around for backwards compatibility. I don't To summarize, I propose that we use Unicode's XID_Start and XID_Continue |
From @khwilliamsonSince this causes Perl to hang, I think it should be addressed somehow Father Chrysostomos wrote:
My first take is that I think we would just change the meanings. The And ID_Continue contains 19 more characters than XID_Continue: So the differences are minimal; we would be recognizing 23 or 19 fewer But I need to further study things to come up with a recommendation
Thanks. Have you considered adding a timeout? test.pl has one that will |
From @cpansproutOn Apr 27, 2010, at 6:56 PM, karl williamson wrote:
Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or In anticipation of this change, I’ve attached a patch that corrects |
From @cpansproutInline Patchdiff -Nurp blead/lib/utf8.t blead-74022/lib/utf8.t
--- blead/lib/utf8.t 2009-11-19 08:51:39.000000000 -0800
+++ blead-74022/lib/utf8.t 2010-04-30 18:11:43.000000000 -0700
@@ -329,8 +329,9 @@ END
SKIP: {
skip("Embedded UTF-8 does not work in EBCDIC", 1) if ord("A") == 193;
use utf8;
- eval qq{is(q \xc3\xbc test \xc3\xbc, qq\xc2\xb7 test \xc2\xb7,
- "utf8 quote delimiters [perl #16823]");};
+ is eval qq{q \xc3\xbc test \xc3\xbc . qq\xc2\xa1 test \xc2\xa1},
+ ' test test ',
+ "utf8 quote delimiters [perl #16823]";
}
# Test the "internals". |
From @khwilliamsonFather Chrysostomos wrote:
That sounds good to me. |
From @cpansproutOn May 2, 2010, at 3:39 PM, karl williamson wrote:
Perhaps die in scan_word if the identifier has no length?
In anticipation of what I think your recommendation will be, I’m
That patch (open_jE1FFxzb.txt) was to fix the existing ‘qq· test ·’ This new patch adds a test to t/comp/parser.t, which does not use test.pl The open_jE1FFxzb.txt patch should be applied first, to keep tests |
From @cpansproutInline Patchdiff -Nurp blead/handy.h blead-74022/handy.h
--- blead/handy.h 2010-04-26 02:12:12.000000000 -0700
+++ blead-74022/handy.h 2010-05-02 18:07:59.000000000 -0700
@@ -626,10 +626,9 @@ parameter, casts can silently truncate a
#define isBLANK_LC_uni(c) isBLANK(c) /* could be wrong */
#define isALNUM_utf8(p) is_utf8_alnum(p)
-/* The ID_Start of Unicode is quite limiting: it assumes a L-class
- * character (meaning that you cannot have, say, a CJK character).
- * Instead, let's allow ID_Continue but not digits. */
-#define isIDFIRST_utf8(p) (is_utf8_idcont(p) && !is_utf8_digit(p))
+/* The isIDFIRST_utf8 macro has changed in the past. See
+ http://rt.perl.org/rt3/Public/Bug/Display.html?id=74022 for details. */
+#define isIDFIRST_utf8(p) is_utf8_idfirst(p)
#define isALPHA_utf8(p) is_utf8_alpha(p)
#define isSPACE_utf8(p) is_utf8_space(p)
#define isDIGIT_utf8(p) is_utf8_digit(p)
diff -Nurp blead/t/comp/parser.t blead-74022/t/comp/parser.t
--- blead/t/comp/parser.t 2010-01-05 14:08:09.000000000 -0800
+++ blead-74022/t/comp/parser.t 2010-05-02 18:16:12.000000000 -0700
@@ -3,7 +3,7 @@
# Checks if the parser behaves correctly in edge cases
# (including weird syntax errors)
-print "1..122\n";
+print "1..123\n";
sub failed {
my ($got, $expected, $name) = @_;
@@ -353,6 +353,11 @@ eval q{
};
is($@, "", "multiline whitespace inside substitute expression");
+# bug #74022: Loop on characters in \p{OtherIDContinue}
+# This test hangs if it fails.
+eval chr 0x387;
+is(1,1, '[perl #74022] Parser looping on OtherIDContinue chars');
+
# Add new tests HERE:
# More awkward tests for #line. Keep these at the end, as they will screw
diff -Nurp blead/toke.c blead-74022/toke.c
--- blead/toke.c 2010-04-26 02:44:11.000000000 -0700
+++ blead-74022/toke.c 2010-05-02 18:08:59.000000000 -0700
@@ -11654,11 +11654,9 @@ S_scan_word(pTHX_ register char *s, char
*d++ = *s++;
*d++ = *s++;
}
- else if (UTF && UTF8_IS_START(*s) && isALNUM_utf8((U8*)s)) {
+ else if (UTF && UTF8_IS_START(*s) && is_utf8_idcont((U8*)s)) {
char *t = s + UTF8SKIP(s);
size_t len;
- while (UTF8_IS_CONTINUED(*t) && is_utf8_mark((U8*)t))
- t += UTF8SKIP(t);
len = t - s;
if (d + len > e)
Perl_croak(aTHX_ ident_too_long);
diff -Nurp blead/utf8.c blead-74022/utf8.c
--- blead/utf8.c 2010-04-15 01:33:10.000000000 -0700
+++ blead-74022/utf8.c 2010-05-02 18:07:13.000000000 -0700
@@ -1327,7 +1327,7 @@ Perl_is_utf8_idfirst(pTHX_ const U8 *p)
if (*p == '_')
return TRUE;
/* is_utf8_idstart would be more logical. */
- return is_utf8_common(p, &PL_utf8_idstart, "IdStart");
+ return is_utf8_common(p, &PL_utf8_idstart, "XIdStart");
}
bool
@@ -1339,7 +1339,7 @@ Perl_is_utf8_idcont(pTHX_ const U8 *p)
if (*p == '_')
return TRUE;
- return is_utf8_common(p, &PL_utf8_idcont, "IdContinue");
+ return is_utf8_common(p, &PL_utf8_idcont, "XIdContinue");
}
bool |
From @cpansproutOn Sun May 02 17:46:08 2010, sprout wrote:
I’ve applied this test patch (but not the fix for the original bug |
From [Unknown Contact. See original ticket]On Sun May 02 17:46:08 2010, sprout wrote:
I’ve applied this test patch (but not the fix for the original bug |
From @khwilliamsonFather Chrysostomos via RT wrote:
I had thought some about this and concluded that the only way to |
From @cpansproutOn Fri Sep 24 13:01:54 2010, public@khwilliamson.com wrote:
I have thought about it a bit now. The problem with allowing Unicode To make sure we don’t change q·foo· unknowingly, I’ve changed the test (As a side note, ECMAScript allows Unicode characters in identifiers, |
From [Unknown Contact. See original ticket]On Fri Sep 24 13:01:54 2010, public@khwilliamson.com wrote:
I have thought about it a bit now. The problem with allowing Unicode To make sure we don’t change q·foo· unknowingly, I’ve changed the test (As a side note, ECMAScript allows Unicode characters in identifiers, |
@cpansprout - Status changed from 'open' to 'resolved' |
From [email protected]
As someone who has within the last 24 hours spent a bit of time This command: % echo "ácütê" | perl -CS -d -S leo Yields this kinda of garbage: Loading DB routines from perl5db.pl version 1.33 Enter h or `h h' for help, or `man perldebug' for more help. main::(/Users/tomchristiansen/scripts/leo:38): I've used -CS on the command line, and I've even used it use 5.010_000; use utf8; I can't think of anything else to do. Oh wait. Yes, I can! % echo "ácütê" | perl -CS -d -S leo Enter h or `h h' for help, or `man perldebug' for more help. main::(/Users/tomchristiansen/scripts/leo:38): See, it's still garbage! What am I supposed to do? And watch this: DB<3> b main::uÊ�opÉ�pá´�ƨdn That was entered as b main::<TAB> and it completed to that CRAP. Heck, even when I type b main::uʍopəpᴉƨdn it ignores me and displays b main::uÊ�opÉ�pá´�ƨdn and then again bitches about Subroutine main::u not found. To add injury to insult, that's illegal UTF-8 up there in its output! It's just totally bollocksed, is what it is. :( --tom |
From [email protected]
Let me be more clear. Perl works perfectly well. So does --tom |
From @cpansproutOn Sun Nov 14 14:32:33 2010, tom christiansen wrote:
Could you send your previous message to perlbug@perl.org without the |
From [Unknown Contact. See original ticket]On Sun Nov 14 14:32:33 2010, tom christiansen wrote:
Could you send your previous message to perlbug@perl.org without the |
Migrated from rt.perl.org#74022 (status was 'resolved')
Searchable as RT74022$
The text was updated successfully, but these errors were encountered: