avoid inconsistency between \d and [:digit:] when using /a

Since a608946 (Additional PCRE2_EXTRA_ASCII_xxx code, 2023-02-01) PCRE2_EXTRA_ASCII_BSD could be used to restrict \d to ASCII causing the following inconsistent behaviour in UCP mode. PCRE2 version 10.43-DEV 2023-01-15 re> /\d/utf,ucp,ascii_bsd data> ٣ No match data> re> /[[:digit:]]/utf,ucp,ascii_bsd data> ٣ 0: \x{663} It has been suggested[1] that the change to match \p{Nd} when Unicode is enabled for [:digit:] might had been unintentional and a bug, as [:digit:] should be able to be POSIX compatible, so add a new flag PCRE2_EXTRA_ASCII_DIGIT to avoid changing its definition in UCP mode. [1] https://lore.kernel.org/git/CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@mail.gmail.com/
PCRE2Project · Apr 7, 2023 · 4503b22 · 4503b22
1 parent 512be06
commit 4503b22
Show file tree

Hide file tree

Showing 19 changed files with 105 additions and 35 deletions.
diff --git a/ChangeLog b/ChangeLog
@@ -55,23 +55,25 @@ change needed for 9(a) above; (b) fix bugs in ucptest,
 
 12. Integer overflow testing is now centralized in a new function.
 
-13. Made PCRE2_UCP the default in UTF mode in pcre2grep, and added new options 
+13. Made PCRE2_UCP the default in UTF mode in pcre2grep, and added new options
 --case-restrict and --no-ucp.
 
-14. In the debugging printint module (which is normally only linked into 
-pcre2test), avoid the use of a variable called "not" because that's deprecated 
-in C and forbidden in C++. Also rewrite some code to avoid a goto into a block 
+14. In the debugging printint module (which is normally only linked into
+pcre2test), avoid the use of a variable called "not" because that's deprecated
+in C and forbidden in C++. Also rewrite some code to avoid a goto into a block
 that bypassed its initialization (though it didn't actually matter).
 
-15. More minor code adjustments to avoid using reserved C++ words as variable 
-names ("new" and "typename") and another jump that bypassed an (irrelevant) 
+15. More minor code adjustments to avoid using reserved C++ words as variable
+names ("new" and "typename") and another jump that bypassed an (irrelevant)
 initialization.
 
-16. Merged a pull request that removed pcre2_ucptables.c from the list of files 
-to compile in NON-AUTOTOOLS-BUILD because it is #included in pcre2_tables.c. 
-Also adjusted the BUILD.bazel and build.zig files, which had the same issue. At 
+16. Merged a pull request that removed pcre2_ucptables.c from the list of files
+to compile in NON-AUTOTOOLS-BUILD because it is #included in pcre2_tables.c.
+Also adjusted the BUILD.bazel and build.zig files, which had the same issue. At
 the same time, fixed a typo in the Bazel file.
 
+17. Add PCRE2_EXTRA_ASCII_DIGIT to allow [:digit:] to be kept on sync with \d
+even in UCP mode.
 
 Version 10.42 11-December-2022
 ------------------------------

diff --git a/doc/html/pcre2_set_compile_extra_options.html b/doc/html/pcre2_set_compile_extra_options.html
@@ -35,10 +35,11 @@ <h1>pcre2_set_compile_extra_options man page</h1>
   PCRE2_EXTRA_ALT_BSUX                 Extended alternate \u, \U, and \x handling
   PCRE2_EXTRA_ASCII_BSD                \d remains ASCII in UCP mode
   PCRE2_EXTRA_ASCII_BSS                \s remains ASCII in UCP mode
-  PCRE2_EXTRA_ASCII_BSW                \w remains ASFII in UCP mode
-  PCRE2_EXTRA_ASCII_POSIX              POSIX classes remain ASCII in UCP mode 
+  PCRE2_EXTRA_ASCII_BSW                \w remains ASCII in UCP mode
+  PCRE2_EXTRA_ASCII_DIGIT              [:digit:] POSIX class remains ASCII in UCP mode
+  PCRE2_EXTRA_ASCII_POSIX              POSIX classes remain ASCII in UCP mode
   PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL    Treat all invalid escapes as a literal following character
-  PCRE2_EXTRA_CASELESS_RESTRICT        Disable mixed ASCII/non-ASCII  case folding
+  PCRE2_EXTRA_CASELESS_RESTRICT        Disable mixed ASCII/non-ASCII case folding
   PCRE2_EXTRA_ESCAPED_CR_IS_LF         Interpret \r as \n
   PCRE2_EXTRA_MATCH_LINE               Pattern matches whole lines
   PCRE2_EXTRA_MATCH_WORD               Pattern matches "words"

diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html
@@ -1540,7 +1540,7 @@ <h1>pcre2api man page</h1>
 one other case, and for all characters whose code points are greater than
 U+007F. Note that there are two ASCII characters, K and S, that, in addition to
 their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
-sign) and U+017F (long S) respectively. If you do not want this case 
+sign) and U+017F (long S) respectively. If you do not want this case
 equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
 </P>
 <P>
@@ -1887,7 +1887,7 @@ <h1>pcre2api man page</h1>
 This option has two effects. Firstly, it change the way PCRE2 processes \B,
 \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By
 default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
-properties are used to classify characters. There are some PCRE2_EXTRA 
+properties are used to classify characters. There are some PCRE2_EXTRA
 options (see below) that add finer control to this behaviour. More details are
 given in the section on
 <a href="pcre2pattern.html#genericchartypes">generic character types</a>
@@ -1994,6 +1994,11 @@ <h1>pcre2api man page</h1>
 This option forces \w to match only ASCII word characters, even when PCRE2_UCP
 is set. It can be changed within a pattern by means of the (?aW) option
 setting.
+<pre>
+  PCRE2_EXTRA_ASCII_DIGIT
+</pre>
+This option forces the POSIX character class [:digit:] to match only ASCII
+digits, even when PCRE2_UCP is set.
 <pre>
   PCRE2_EXTRA_ASCII_POSIX
 </pre>
@@ -2029,8 +2034,8 @@ <h1>pcre2api man page</h1>
 case-equivalent character sets that contain both ASCII and non-ASCII
 characters. The ASCII letter S is case-equivalent to U+017f (long S) and the
 ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables
-recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a 
-caseless match, both characters must either be ASCII or non-ASCII. The option 
+recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
+caseless match, both characters must either be ASCII or non-ASCII. The option
 can be changed with a pattern by the (?r) option setting.
 <pre>
   PCRE2_EXTRA_ESCAPED_CR_IS_LF

diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
@@ -1526,7 +1526,7 @@ <h1>pcre2pattern man page</h1>
   [:alpha:]  becomes  \p{L}
   [:blank:]  becomes  \h
   [:cntrl:]  becomes  \p{Cc}
-  [:digit:]  becomes  \p{Nd}
+  [:digit:]  becomes  \p{Nd}  unless PCRE2_EXTRA_ASCII_DIGIT is set
   [:lower:]  becomes  \p{Ll}
   [:space:]  becomes  \p{Xps}
   [:upper:]  becomes  \p{Lu}

diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html
@@ -631,6 +631,7 @@ <h1>pcre2test man page</h1>
       ascii_bsd                 set PCRE2_EXTRA_ASCII_BSD
       ascii_bss                 set PCRE2_EXTRA_ASCII_BSS
       ascii_bsw                 set PCRE2_EXTRA_ASCII_BSW
+      ascii_digit               set PCRE2_EXTRA_ASCII_DIGIT
       ascii_posix               set PCRE2_EXTRA_ASCII_POSIX
       auto_callout              set PCRE2_AUTO_CALLOUT
       bad_escape_is_literal     set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL

diff --git a/doc/pcre2.txt b/doc/pcre2.txt
@@ -1953,6 +1953,11 @@ COMPILING A PATTERN
        PCRE2_UCP  is  set.  It can be changed within a pattern by means of the
        (?aW) option setting.
 
+         PCRE2_EXTRA_ASCII_DIGIT
+
+       This option forces the POSIX character class [:digit:]  to  match  only
+       ASCII digits, even when PCRE2_UCP is set.
+
          PCRE2_EXTRA_ASCII_POSIX
 
        This option forces the POSIX character  classes  to  match  only  ASCII
@@ -7688,7 +7693,7 @@ POSIX CHARACTER CLASSES
          [:alpha:]  becomes  \p{L}
          [:blank:]  becomes  \h
          [:cntrl:]  becomes  \p{Cc}
-         [:digit:]  becomes  \p{Nd}
+         [:digit:]  becomes  \p{Nd}  unless PCRE2_EXTRA_ASCII_DIGIT is set
          [:lower:]  becomes  \p{Ll}
          [:space:]  becomes  \p{Xps}
          [:upper:]  becomes  \p{Lu}

diff --git a/doc/pcre2_set_compile_extra_options.3 b/doc/pcre2_set_compile_extra_options.3
@@ -27,15 +27,18 @@ options are:
                                          \ex handling
   PCRE2_EXTRA_ASCII_BSD                \ed remains ASCII in UCP mode
   PCRE2_EXTRA_ASCII_BSS                \es remains ASCII in UCP mode
-  PCRE2_EXTRA_ASCII_BSW                \ew remains ASFII in UCP mode
+  PCRE2_EXTRA_ASCII_BSW                \ew remains ASCII in UCP mode
+.\" JOIN
+  PCRE2_EXTRA_ASCII_DIGIT              [:digit:] POSIX class remains ASCII
+                                         in UCP mode
 .\" JOIN
   PCRE2_EXTRA_ASCII_POSIX              POSIX classes remain ASCII in
-                                         UCP mode 
+                                         UCP mode
 .\" JOIN
   PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL    Treat all invalid escapes as
                                          a literal following character
 .\" JOIN
-  PCRE2_EXTRA_CASELESS_RESTRICT        Disable mixed ASCII/non-ASCII 
+  PCRE2_EXTRA_CASELESS_RESTRICT        Disable mixed ASCII/non-ASCII
                                          case folding
   PCRE2_EXTRA_ESCAPED_CR_IS_LF         Interpret \er as \en
   PCRE2_EXTRA_MATCH_LINE               Pattern matches whole lines

diff --git a/doc/pcre2api.3 b/doc/pcre2api.3
@@ -1482,7 +1482,7 @@ PCRE2_UCP is set, Unicode properties are used for all characters with more than
 one other case, and for all characters whose code points are greater than
 U+007F. Note that there are two ASCII characters, K and S, that, in addition to
 their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
-sign) and U+017F (long S) respectively. If you do not want this case 
+sign) and U+017F (long S) respectively. If you do not want this case
 equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
 .P
 For lower valued characters with only one other case, a lookup table is used
@@ -1838,7 +1838,7 @@ are not representable in UTF-16.
 This option has two effects. Firstly, it change the way PCRE2 processes \eB,
 \eb, \eD, \ed, \eS, \es, \eW, \ew, and some of the POSIX character classes. By
 default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
-properties are used to classify characters. There are some PCRE2_EXTRA 
+properties are used to classify characters. There are some PCRE2_EXTRA
 options (see below) that add finer control to this behaviour. More details are
 given in the section on
 .\" HTML <a href="pcre2pattern.html#genericchartypes">
@@ -1953,6 +1953,11 @@ option setting.
 This option forces \ew to match only ASCII word characters, even when PCRE2_UCP
 is set. It can be changed within a pattern by means of the (?aW) option
 setting.
+.sp
+  PCRE2_EXTRA_ASCII_DIGIT
+.sp
+This option forces the POSIX character class [:digit:] to match only ASCII
+digits, even when PCRE2_UCP is set.
 .sp
   PCRE2_EXTRA_ASCII_POSIX
 .sp
@@ -1987,8 +1992,8 @@ rules, which allow for more than two cases per character. There are two
 case-equivalent character sets that contain both ASCII and non-ASCII
 characters. The ASCII letter S is case-equivalent to U+017f (long S) and the
 ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables
-recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a 
-caseless match, both characters must either be ASCII or non-ASCII. The option 
+recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
+caseless match, both characters must either be ASCII or non-ASCII. The option
 can be changed with a pattern by the (?r) option setting.
 .sp
   PCRE2_EXTRA_ESCAPED_CR_IS_LF

diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3
@@ -1522,7 +1522,7 @@ classes with other sequences, as follows:
   [:alpha:]  becomes  \ep{L}
   [:blank:]  becomes  \eh
   [:cntrl:]  becomes  \ep{Cc}
-  [:digit:]  becomes  \ep{Nd}
+  [:digit:]  becomes  \ep{Nd}  unless PCRE2_EXTRA_ASCII_DIGIT is set
   [:lower:]  becomes  \ep{Ll}
   [:space:]  becomes  \ep{Xps}
   [:upper:]  becomes  \ep{Lu}

diff --git a/doc/pcre2test.1 b/doc/pcre2test.1
@@ -586,6 +586,7 @@ for a description of the effects of these options.
       ascii_bsd                 set PCRE2_EXTRA_ASCII_BSD
       ascii_bss                 set PCRE2_EXTRA_ASCII_BSS
       ascii_bsw                 set PCRE2_EXTRA_ASCII_BSW
+      ascii_digit               set PCRE2_EXTRA_ASCII_DIGIT
       ascii_posix               set PCRE2_EXTRA_ASCII_POSIX
       auto_callout              set PCRE2_AUTO_CALLOUT
       bad_escape_is_literal     set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL

diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt
@@ -566,6 +566,7 @@ PATTERN MODIFIERS
              ascii_bsd                 set PCRE2_EXTRA_ASCII_BSD
              ascii_bss                 set PCRE2_EXTRA_ASCII_BSS
              ascii_bsw                 set PCRE2_EXTRA_ASCII_BSW
+             ascii_digit               set PCRE2_EXTRA_ASCII_DIGIT
              ascii_posix               set PCRE2_EXTRA_ASCII_POSIX
              auto_callout              set PCRE2_AUTO_CALLOUT
              bad_escape_is_literal     set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL

diff --git a/src/pcre2.h.generic b/src/pcre2.h.generic
@@ -158,6 +158,7 @@ D   is inspected during pcre2_dfa_match() execution
 #define PCRE2_EXTRA_ASCII_BSS                0x00000200u  /* C */
 #define PCRE2_EXTRA_ASCII_BSW                0x00000400u  /* C */
 #define PCRE2_EXTRA_ASCII_POSIX              0x00000800u  /* C */
+#define PCRE2_EXTRA_ASCII_DIGIT              0x00001000u  /* C */
 
 /* These are for pcre2_jit_compile(). */
 

diff --git a/src/pcre2.h.in b/src/pcre2.h.in
@@ -158,6 +158,7 @@ D   is inspected during pcre2_dfa_match() execution
 #define PCRE2_EXTRA_ASCII_BSS                0x00000200u  /* C */
 #define PCRE2_EXTRA_ASCII_BSW                0x00000400u  /* C */
 #define PCRE2_EXTRA_ASCII_POSIX              0x00000800u  /* C */
+#define PCRE2_EXTRA_ASCII_DIGIT              0x00001000u  /* C */
 
 /* These are for pcre2_jit_compile(). */
 

diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c
@@ -786,7 +786,8 @@ are allowed. */
     PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES|PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL| \
     PCRE2_EXTRA_ESCAPED_CR_IS_LF|PCRE2_EXTRA_ALT_BSUX| \
     PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK|PCRE2_EXTRA_ASCII_BSD| \
-    PCRE2_EXTRA_ASCII_BSS|PCRE2_EXTRA_ASCII_BSW|PCRE2_EXTRA_ASCII_POSIX)
+    PCRE2_EXTRA_ASCII_BSS|PCRE2_EXTRA_ASCII_BSW|PCRE2_EXTRA_ASCII_POSIX| \
+    PCRE2_EXTRA_ASCII_DIGIT)
 
 /* Compile time error code numbers. They are given names so that they can more
 easily be tracked. When a new number is added, the tables called eint1 and
@@ -3581,7 +3582,8 @@ while (ptr < ptrend)
 
 #ifdef SUPPORT_UNICODE
         if ((options & PCRE2_UCP) != 0 &&
-            (xoptions & PCRE2_EXTRA_ASCII_POSIX) == 0)
+            (xoptions & PCRE2_EXTRA_ASCII_POSIX) == 0 &&
+            !(posix_class == 7 && (xoptions & PCRE2_EXTRA_ASCII_DIGIT) != 0))
           {
           int ptype = posix_substitutes[2*posix_class];
           int pvalue = posix_substitutes[2*posix_class + 1];

diff --git a/src/pcre2test.c b/src/pcre2test.c
@@ -651,6 +651,7 @@ static modstruct modlist[] = {
   { "ascii_bsd",                   MOD_CTC,  MOD_OPT, PCRE2_EXTRA_ASCII_BSD,      CO(extra_options) },
   { "ascii_bss",                   MOD_CTC,  MOD_OPT, PCRE2_EXTRA_ASCII_BSS,      CO(extra_options) },
   { "ascii_bsw",                   MOD_CTC,  MOD_OPT, PCRE2_EXTRA_ASCII_BSW,      CO(extra_options) },
+  { "ascii_digit",                 MOD_CTC,  MOD_OPT, PCRE2_EXTRA_ASCII_DIGIT,    CO(extra_options) },
   { "ascii_posix",                 MOD_CTC,  MOD_OPT, PCRE2_EXTRA_ASCII_POSIX,    CO(extra_options) },
   { "auto_callout",                MOD_PAT,  MOD_OPT, PCRE2_AUTO_CALLOUT,         PO(options) },
   { "bad_escape_is_literal",       MOD_CTC,  MOD_OPT, PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL, CO(extra_options) },
@@ -4294,13 +4295,14 @@ show_compile_extra_options(uint32_t options, const char *before,
   const char *after)
 {
 if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
-else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s",
+else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
   before,
   ((options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) != 0)? " allow_surrogate_escapes" : "",
   ((options & PCRE2_EXTRA_ALT_BSUX) != 0)? " alt_bsux" : "",
   ((options & PCRE2_EXTRA_ASCII_BSD) != 0)? " ascii_bsd" : "",
   ((options & PCRE2_EXTRA_ASCII_BSS) != 0)? " ascii_bss" : "",
   ((options & PCRE2_EXTRA_ASCII_BSW) != 0)? " ascii_bsw" : "",
+  ((options & PCRE2_EXTRA_ASCII_DIGIT) != 0)? " ascii_digit" : "",
   ((options & PCRE2_EXTRA_ASCII_POSIX) != 0)? " ascii_posix" : "",
   ((options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) != 0)? " bad_escape_is_literal" : "",
   ((options & PCRE2_EXTRA_CASELESS_RESTRICT) != 0)? " caseless_restrict" : "",

diff --git a/testdata/testinput5 b/testdata/testinput5
@@ -1215,6 +1215,8 @@
 
 /[[:digit:]]/B,ucp
 
+/[[:digit:]]/B,ucp,ascii_digit
+
 /[[:graph:]]/B,ucp
 
 /[[:print:]]/B,ucp
@@ -1227,7 +1229,7 @@
 
 /[[:xdigit:]]/B,ucp
 
-# Unicode properties for \b abd \B
+# Unicode properties for \b and \B
 
 /\b...\B/utf,ucp
     abc_
@@ -2431,6 +2433,12 @@
 /[[:digit:]]+/utf,ucp
     123\x{660}456
 
+/[[:digit:]]+/utf,ucp,ascii_digit
+    123\x{660}456
+
+/[[:digit:]]+/g,utf,ucp,ascii_digit
+    123\x{660}456
+
 /[[:digit:]]+/utf,ucp,ascii_posix
     123\x{660}456
 

diff --git a/testdata/testinput7 b/testdata/testinput7
@@ -1657,7 +1657,7 @@
 /^[\p{Xwd}]+/utf
     ABCD1234\x{6ca}\x{a6c}\x{10a7}_
 
-# Unicode properties for \b abd \B 
+# Unicode properties for \b and \B
 
 /\b...\B/utf,ucp
     abc_
@@ -2435,9 +2435,15 @@
 /[[:digit:]]+/utf,ucp
     123\x{660}456
 
+/[[:digit:]]+/utf,ucp,ascii_digit
+    123\x{660}456
+
+/[[:digit:]]+/g,utf,ucp,ascii_digit
+    123\x{660}456
+
 /[[:digit:]]+/utf,ucp,ascii_posix
     123\x{660}456
-    
+
 />[[:space:]]+</utf,ucp
     >\x{a0} \x{a0}<
     >\x{a0}\x{a0}\x{a0}<

diff --git a/testdata/testoutput5 b/testdata/testoutput5
@@ -2520,6 +2520,14 @@ No match
         End
 ------------------------------------------------------------------
 
+/[[:digit:]]/B,ucp,ascii_digit
+------------------------------------------------------------------
+        Bra
+        [0-9]
+        Ket
+        End
+------------------------------------------------------------------
+
 /[[:graph:]]/B,ucp
 ------------------------------------------------------------------
         Bra
@@ -2568,7 +2576,7 @@ No match
         End
 ------------------------------------------------------------------
 
-# Unicode properties for \b abd \B
+# Unicode properties for \b and \B
 
 /\b...\B/utf,ucp
     abc_
@@ -5359,6 +5367,15 @@ No match
     123\x{660}456
  0: 123\x{660}456
 
+/[[:digit:]]+/utf,ucp,ascii_digit
+    123\x{660}456
+ 0: 123
+
+/[[:digit:]]+/g,utf,ucp,ascii_digit
+    123\x{660}456
+ 0: 123
+ 0: 456
+
 /[[:digit:]]+/utf,ucp,ascii_posix
     123\x{660}456
  0: 123