From 512c0f5a351dd399dbb069e222741582044f88cc Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Fri, 11 Sep 2015 10:08:50 -0600
Subject: [PATCH] PATCH: [perl #125990] panic: reg_node overrun
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This is a result of a design flaw that I introduced in earlier releases
when attempting to fix earlier design flaws in dealing with the outlier
character ß, LATIN SMALL LETTER SHARP S.  The uppercase of this letter
is SS, so that when comparing case-insensitively, it should match 'ss',
and hence, in Unicode terminology, it folds to 'ss'.  This character is
the only one representable without using UTF-8 whose fold is longer than
1 byte, and so has to have special treatment.  Similarly, the sequence
'ss' can match caselessly the single byte ß, and this is the only such
sequence that can match something shorter than it, unless UTF-8 is
involved.  The matter is complicated by the fact that under /di rules,
the ß and 'ss' don't match each other, unless the target string is in
UTF-8.  The solution I used earlier (and continue to use) was to create
a special regnode EXACTFU_SS under /ui rules, in which any ß is folded
to 'ss'.  But under /di rules, a regular EXACTF regnode is used, and any
ß is retained as-is.

The problem reported here arises when something during the sizing pass
tells perl to use /ui rules rather than the /di rules that were in
effect at the beginning.  Recall that perl uses /d rules, for backward
compatibility, unless something overrides them.  This can be a 'use'
declaration, an explicit character set pattern modifier, or something in
the pattern.  This bug happens only with the final case.  There are
several Unicode-defined constructs that can occur in patterns; if one is
found, the perl interpreter infers that Unicode is desired, and switches
from /d to /u for the whole pattern.  Two such constructs are a Unicode
property, \p{}, and a Unicode named character, \N{}.  The
problem-reproducing code for this ticket uses the latter.

The problem was that the switch from /di to /ui was deferred until AFTER
the sizing pass.  (A flag was set when one of these constructs was
encountered to tell the parser to later do the switch.)

During the second pass, the code realizes it is under /ui, so creates an
EXACTFU_SS node and folds the ß into 'ss'.  But the first pass thought
it was under /di, so it sized for just the ß, i.e., for 1 byte, so we
exceed the allocated space and do a wild write.  This may not cause a
problem if the malloc'd space had rounded-up and there were only a few
of these ß characters.

One solution I considered was just keeping a global count of the ß
characters in EXACTF nodes.  One could just add these to the space
reserved if /ui rules ended up being used.  The problem with this is that
nodes that are near their maximum size without the extra space could
exceed it with, and thus have to be split into 2 nodes, and the extra
node would have an unplanned-for header, taking up more unaccounted-for
space.  So that doesn't work.  One could also just reserve two bytes for
every ß in an EXACTF node, thus wasting space unless /ui ends up being
used.  But the bigger problem is that the code that splits nodes would
have to be made more complicated.  It has to find a suitable splitting
spot, by searching through the text of the node, and now it would have
to deal with some of that space not being set.

Instead, I opted to change the code so that when it finds one of these
Unicode-defined constructs, it switches to /u immediately during the
sizing pass.  That means that the parse afterwards knows that it is /u
and allocates the correct space.  (We now have to remain in /u for the
remainder of the pass, so some code had to change that reverted this.)
This fixes the test case in the ticket.  But there remains a problem if
the sizing has happened earlier in the parse before the construct that
changes from /d to /u is encountered.  Like:

    qr/.....ß....\N{}/di

The incorrect sizing has already happened by the time the \N{} is
encountered.  One could solve this by restarting the parse whenever the
/d goes to /u (under /i, as this issue isn't a problem except when
folding ß).  That slows things down.  Instead, I opted to set a global
flag whenever a ß is found in an EXACTF node.  If that flag isn't set at
the time of the /d to /u switch, there's no need to restart the parse.

A 'use utf8' or 'use 5.012' or higher selects /u over /d, so the problem
did not happen with them, nor if the pattern has to be converted to
UTF-8, which restarts the sizing pass, and it only happens with the
sharp s character.  And probably unless there a several ß characters,
the rounding-up of malloc space, would cause this to not be an issue.
These explain why this hasn't been reported from the field.
---
 regcomp.c           | 103 +++++++++++++++++++++++++++++++++-----------
 t/re/pat_advanced.t |  22 ++++++++++
 2 files changed, 101 insertions(+), 24 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index ef33ddb04caf..47b451c6f1ae 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -214,6 +214,7 @@ struct RExC_state_t {
 #define RExC_mysv2	(pRExC_state->mysv2)
 
 #endif
+    bool        seen_unfolded_sharp_s;
 };
 
 #define RExC_flags	(pRExC_state->flags)
@@ -226,6 +227,17 @@ struct RExC_state_t {
 #define RExC_end	(pRExC_state->end)
 #define RExC_parse	(pRExC_state->parse)
 #define RExC_whilem_seen	(pRExC_state->whilem_seen)
+
+/* Set during the sizing pass when there is a LATIN SMALL LETTER SHARP S in any
+ * EXACTF node, hence was parsed under /di rules.  If later in the parse,
+ * something forces the pattern into using /ui rules, the sharp s should be
+ * folded into the sequence 'ss', which takes up more space than previously
+ * calculated.  This means that the sizing pass needs to be restarted.  (The
+ * node also becomes an EXACTFU_SS.)  For all other characters, an EXACTF node
+ * that gets converted to /ui (and EXACTFU) occupies the same amount of space,
+ * so there is no need to resize [perl #125990]. */
+#define RExC_seen_unfolded_sharp_s (pRExC_state->seen_unfolded_sharp_s)
+
 #ifdef RE_TRACK_PATTERN_OFFSETS
 #define RExC_offsets	(pRExC_state->rxi->u.offsets) /* I am not like the
                                                          others */
@@ -328,6 +340,23 @@ struct RExC_state_t {
                                      }                                     \
                              } STMT_END
 
+/* Change from /d into /u rules, and restart the parse if we've already seen
+ * something whose size would increase as a result, by setting *flagp and
+ * returning 'restart_retval'.  RExC_uni_semantics is a flag that indicates
+ * we've change to /u during the parse.  */
+#define REQUIRE_UNI_RULES(flagp, restart_retval)                            \
+    STMT_START {                                                            \
+            if (DEPENDS_SEMANTICS) {                                        \
+                assert(PASS1);                                              \
+                set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET);      \
+                RExC_uni_semantics = 1;                                     \
+                if (RExC_seen_unfolded_sharp_s) {                           \
+                    *flagp |= RESTART_PASS1;                                \
+                    return restart_retval;                                  \
+                }                                                           \
+            }                                                               \
+    } STMT_END
+
 /* This converts the named class defined in regcomp.h to its equivalent class
  * number defined in handy.h. */
 #define namedclass_to_classnum(class)  ((int) ((class) / 2))
@@ -6570,7 +6599,9 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
 
     /* ignore the utf8ness if the pattern is 0 length */
     RExC_utf8 = RExC_orig_utf8 = (plen == 0 || IN_BYTES) ? 0 : SvUTF8(pat);
+
     RExC_uni_semantics = 0;
+    RExC_seen_unfolded_sharp_s = 0;
     RExC_contains_locale = 0;
     RExC_contains_i = 0;
     RExC_strict = cBOOL(pm_flags & RXf_PMf_STRICT);
@@ -6591,8 +6622,8 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
         });
 
   redo_first_pass:
-    /* we jump here if we upgrade the pattern to utf8 and have to
-     * recompile */
+    /* we jump here if we have to recompile, e.g., from upgrading the pattern
+     * to utf8 */
 
     if ((pm_flags & PMf_USE_RE_EVAL)
 		/* this second condition covers the non-regex literal case,
@@ -6626,7 +6657,9 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
     if (rx_flags & PMf_FOLD) {
         RExC_contains_i = 1;
     }
-    if (RExC_utf8 && initial_charset == REGEX_DEPENDS_CHARSET) {
+    if (   initial_charset == REGEX_DEPENDS_CHARSET
+        && (RExC_utf8 ||RExC_uni_semantics))
+    {
 
 	/* Set to use unicode semantics if the pattern is in utf8 and has the
 	 * 'depends' charset specified, as it means unicode when utf8  */
@@ -6717,6 +6750,11 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
                 S_pat_upgrade_to_utf8(aTHX_ pRExC_state, &exp, &plen,
                                     pRExC_state->num_code_blocks);
             }
+            else {
+                DEBUG_PARSE_r(PerlIO_printf(Perl_debug_log,
+                "Need to redo pass 1\n"));
+            }
+
             goto redo_first_pass;
         }
         Perl_croak(aTHX_ "panic: reg returned NULL to re_op_compile for sizing pass, flags=%#"UVxf"", (UV) flags);
@@ -10690,8 +10728,12 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
 
     /* Check for proper termination. */
     if (paren) {
-        /* restore original flags, but keep (?p) */
+        /* restore original flags, but keep (?p) and, if we've changed from /d
+         * rules to /u, keep the /u */
 	RExC_flags = oregflags | (RExC_flags & RXf_PMf_KEEPCOPY);
+        if (DEPENDS_SEMANTICS && RExC_uni_semantics) {
+            set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET);
+        }
 	if (RExC_parse >= RExC_end || *nextchar(pRExC_state) != ')') {
 	    RExC_parse = oregcomp_parse;
 	    vFAIL("Unmatched (");
@@ -11111,15 +11153,17 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state,
   * sequence. *node_p * will be set to a generated node returned by this
   * function calling S_reg().
   *
-  * The final possibility, which happens only when the fourth one would
-  * otherwise be in effect, is that one of those code points requires the
+  * The final possibility, which happens is that it is premature to be calling
+  * this function; that pass1 needs to be restarted.  This can happen when this
+  * changes from /d to /u rules, or when the pattern needs to be upgraded to
+  * UTF-8.  The latter occurs only when the fourth possibility would otherwise
+  * be in effect, and is because one of those code points requires the
   * pattern to be recompiled as UTF-8.  The function returns FALSE, and sets
-  * the RESTART_PASS1 and NEED_UTF8 flags in *flagp.  When this happens, the
-  * caller needs to desist from continuing parsing, and return this information
-  * to its caller.  This is not set for when there is only one code point, as
-  * this can be called as part of an ANYOF node, and they can store
-  * above-Latin1 code points without the pattern having to be in UTF-8.
-  * XXX
+  * the RESTART_PASS1 and NEED_UTF8 flags in *flagp, as appropriate.  When this
+  * happens, the caller needs to desist from continuing parsing, and return
+  * this information to its caller.  This is not set for when there is only one
+  * code point, as this can be called as part of an ANYOF node, and they can
+  * store above-Latin1 code points without the pattern having to be in UTF-8.
   *
   * For non-single-quoted regexes, the tokenizer has resolved character and
   * sequence names inside \N{...} into their Unicode values, normalizing the
@@ -11208,7 +11252,8 @@ S_grok_bslash_N(pTHX_ RExC_state_t *pRExC_state,
 	vFAIL("\\N{NAME} must be resolved by the lexer");
     }
 
-    RExC_uni_semantics = 1; /* Unicode named chars imply Unicode semantics */
+    REQUIRE_UNI_RULES(flagp, FALSE); /* Unicode named chars imply Unicode
+                                        semantics */
 
     if (endbrace == RExC_parse) {   /* empty: \N{} */
         if (cp_count) {
@@ -11933,7 +11978,7 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                         NOT_REACHED; /*NOTREACHED*/
                 }
                 RExC_parse = endbrace;
-                RExC_uni_semantics = 1;
+                REQUIRE_UNI_RULES(flagp, NULL);
 
                 if (PASS2 && op >= BOUNDA) {  /* /aa is same as /a */
                     OP(ret) = BOUNDU;
@@ -12045,6 +12090,8 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                                (bool) RExC_strict,
                                TRUE, /* Allow an optimized regnode result */
                                NULL);
+                if (*flagp & RESTART_PASS1)
+                    return NULL;
                 /* regclass() can only return RESTART_PASS1 and NEED_UTF8 if
                  * multi-char folds are allowed.  */
                 if (!ret)
@@ -12410,6 +12457,8 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                         ) {
                             if (*flagp & NEED_UTF8)
                                 FAIL("panic: grok_bslash_N set NEED_UTF8");
+                            if (*flagp & RESTART_PASS1)
+                                return NULL;
 
                             /* Here, it wasn't a single code point.  Go close
                              * up this EXACTish node.  The switch() prior to
@@ -12711,15 +12760,18 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
                         /* See if the character's fold differs between /d and
                          * /u.  This includes the multi-char fold SHARP S to
                          * 'ss' */
-                        if (maybe_exactfu
+                        if (UNLIKELY(ender == LATIN_SMALL_LETTER_SHARP_S)) {
+                            RExC_seen_unfolded_sharp_s = 1;
+                            maybe_exactfu = FALSE;
+                        }
+                        else if (maybe_exactfu
                             && (PL_fold[ender] != PL_fold_latin1[ender]
 #if    UNICODE_MAJOR_VERSION > 3 /* no multifolds in early Unicode */   \
    || (UNICODE_MAJOR_VERSION == 3 && (   UNICODE_DOT_VERSION > 0)       \
                                       || UNICODE_DOT_DOT_VERSION > 0)
-                                || ender == LATIN_SMALL_LETTER_SHARP_S
-                                || (len > 0
-                                   && isALPHA_FOLD_EQ(ender, 's')
-                                   && isALPHA_FOLD_EQ(*(s-1), 's'))
+                                || (   len > 0
+                                    && isALPHA_FOLD_EQ(ender, 's')
+                                    && isALPHA_FOLD_EQ(*(s-1), 's'))
 #endif
                         )) {
                             maybe_exactfu = FALSE;
@@ -13369,9 +13421,10 @@ S_handle_regex_sets(pTHX_ RExC_state_t *pRExC_state, SV** return_invlist,
         set_regex_charset(&RExC_flags, REGEX_UNICODE_CHARSET);
     }
 
-    RExC_uni_semantics = 1;     /* The use of this operator implies /u.  This
-                                   is required so that the compile time values
-                                   are valid in all runtime cases */
+    REQUIRE_UNI_RULES(flagp, NULL);   /* The use of this operator implies /u.
+                                         This is required so that the compile
+                                         time values are valid in all runtime
+                                         cases */
 
     /* This will return only an ANYOF regnode, or (unlikely) something smaller
      * (such as EXACT).  Thus we can skip most everything if just sizing.  We
@@ -14482,6 +14535,8 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
 
                         if (*flagp & NEED_UTF8)
                             FAIL("panic: grok_bslash_N set NEED_UTF8");
+                        if (*flagp & RESTART_PASS1)
+                            return NULL;
 
                         if (cp_count < 0) {
                             vFAIL("\\N in a character class must be a named character: \\N{...}");
@@ -14702,7 +14757,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
                                                 named */
 
 		/* \p means they want Unicode semantics */
-		RExC_uni_semantics = 1;
+		REQUIRE_UNI_RULES(flagp, NULL);
 		}
 		break;
 	    case 'n':	value = '\n';			break;
@@ -15088,7 +15143,7 @@ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth,
 	/* non-Latin1 code point implies unicode semantics.  Must be set in
 	 * pass1 so is there for the whole of pass 2 */
 	if (value > 255) {
-	    RExC_uni_semantics = 1;
+            REQUIRE_UNI_RULES(flagp, NULL);
 	}
 
         /* Ready to process either the single value, or the completed range.
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 33647f321c1c..e221ececc051 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -2411,6 +2411,27 @@ EOF
                             "Overlapping ranges in user-defined properties");
     }
 
+    { # [perl #125990], the final 2 tests below each caused a panic.
+        # The \0's are not necessary; it could be a printable character
+        # instead, but were in the ticket, so using them.
+        my $sharp_s = chr utf8::unicode_to_native(0xdf);
+        my $string        = ("\0" x 8)
+                          . ($sharp_s x 3)
+                          . ("\0" x 42)
+                          .  "ý";
+        my $folded_string = ("\0" x 8)
+                          . ("ss" x 3)
+                          . ("\0" x 42)
+                          .  "ý";
+        utf8::downgrade($string);
+        utf8::downgrade($folded_string);
+
+        like($string, qr/$string/i, "LATIN SMALL SHARP S matches itself under /id");
+        unlike($folded_string, qr/$string/i, "LATIN SMALL SHARP S doesn't match 'ss' under /di");
+        like($folded_string, qr/\N{}$string/i, "\\N{} earlier than LATIN SMALL SHARP S transforms /di into /ui, matches 'ss'");
+        like($folded_string, qr/$string\N{}/i, "\\N{} after LATIN SMALL SHARP S transforms /di into /ui, matches 'ss'");
+    }
+
     { # Regexp:Grammars was broken:
   # http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2013-06/msg01290.html
         fresh_perl_like('use warnings; "abc" =~ qr{(?&foo){0}abc(?<foo>)}',
@@ -2428,6 +2449,7 @@ EOF
 		      "buffer overflow in TRIE_STORE_REVCHAR");
     }
 
+
     # !!! NOTE that tests that aren't at all likely to crash perl should go
     # a ways above, above these last ones.