Skip to content

Commit

Permalink
avoid inconsistency between \d and [:digit:] when using /a
Browse files Browse the repository at this point in the history
Since a608946 (Additional PCRE2_EXTRA_ASCII_xxx code, 2023-02-01)
PCRE2_EXTRA_ASCII_BSD could be used to restrict \d to ASCII causing
the following inconsistent behaviour in UCP mode.

  PCRE2 version 10.43-DEV 2023-01-15
    re> /\d/utf,ucp,ascii_bsd
  data> ٣
  No match
  data>
  re> /[[:digit:]]/utf,ucp,ascii_bsd
  data> ٣
    0: \x{663}

It has been suggested[1] that the change to match \p{Nd} when Unicode
is enabled for [:digit:] might had been unintentional and a bug, as
[:digit:] should be able to be POSIX compatible, so add a new flag
PCRE2_EXTRA_ASCII_DIGIT to avoid changing its definition in UCP mode.

[1] https://lore.kernel.org/git/CANgJU+U+xXsh9psd0z5Xjr+Se5QgdKkjQ7LUQ-PdUULSN3n4+g@mail.gmail.com/
  • Loading branch information
carenas committed Apr 7, 2023
1 parent 512be06 commit 4503b22
Show file tree
Hide file tree
Showing 19 changed files with 105 additions and 35 deletions.
20 changes: 11 additions & 9 deletions ChangeLog
Original file line number Diff line number Diff line change
Expand Up @@ -55,23 +55,25 @@ change needed for 9(a) above; (b) fix bugs in ucptest,

12. Integer overflow testing is now centralized in a new function.

13. Made PCRE2_UCP the default in UTF mode in pcre2grep, and added new options
13. Made PCRE2_UCP the default in UTF mode in pcre2grep, and added new options
--case-restrict and --no-ucp.

14. In the debugging printint module (which is normally only linked into
pcre2test), avoid the use of a variable called "not" because that's deprecated
in C and forbidden in C++. Also rewrite some code to avoid a goto into a block
14. In the debugging printint module (which is normally only linked into
pcre2test), avoid the use of a variable called "not" because that's deprecated
in C and forbidden in C++. Also rewrite some code to avoid a goto into a block
that bypassed its initialization (though it didn't actually matter).

15. More minor code adjustments to avoid using reserved C++ words as variable
names ("new" and "typename") and another jump that bypassed an (irrelevant)
15. More minor code adjustments to avoid using reserved C++ words as variable
names ("new" and "typename") and another jump that bypassed an (irrelevant)
initialization.

16. Merged a pull request that removed pcre2_ucptables.c from the list of files
to compile in NON-AUTOTOOLS-BUILD because it is #included in pcre2_tables.c.
Also adjusted the BUILD.bazel and build.zig files, which had the same issue. At
16. Merged a pull request that removed pcre2_ucptables.c from the list of files
to compile in NON-AUTOTOOLS-BUILD because it is #included in pcre2_tables.c.
Also adjusted the BUILD.bazel and build.zig files, which had the same issue. At
the same time, fixed a typo in the Bazel file.

17. Add PCRE2_EXTRA_ASCII_DIGIT to allow [:digit:] to be kept on sync with \d
even in UCP mode.

Version 10.42 11-December-2022
------------------------------
Expand Down
7 changes: 4 additions & 3 deletions doc/html/pcre2_set_compile_extra_options.html
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,11 @@ <h1>pcre2_set_compile_extra_options man page</h1>
PCRE2_EXTRA_ALT_BSUX Extended alternate \u, \U, and \x handling
PCRE2_EXTRA_ASCII_BSD \d remains ASCII in UCP mode
PCRE2_EXTRA_ASCII_BSS \s remains ASCII in UCP mode
PCRE2_EXTRA_ASCII_BSW \w remains ASFII in UCP mode
PCRE2_EXTRA_ASCII_POSIX POSIX classes remain ASCII in UCP mode
PCRE2_EXTRA_ASCII_BSW \w remains ASCII in UCP mode
PCRE2_EXTRA_ASCII_DIGIT [:digit:] POSIX class remains ASCII in UCP mode
PCRE2_EXTRA_ASCII_POSIX POSIX classes remain ASCII in UCP mode
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as a literal following character
PCRE2_EXTRA_CASELESS_RESTRICT Disable mixed ASCII/non-ASCII case folding
PCRE2_EXTRA_CASELESS_RESTRICT Disable mixed ASCII/non-ASCII case folding
PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \r as \n
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
PCRE2_EXTRA_MATCH_WORD Pattern matches "words"
Expand Down
13 changes: 9 additions & 4 deletions doc/html/pcre2api.html
Original file line number Diff line number Diff line change
Expand Up @@ -1540,7 +1540,7 @@ <h1>pcre2api man page</h1>
one other case, and for all characters whose code points are greater than
U+007F. Note that there are two ASCII characters, K and S, that, in addition to
their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
sign) and U+017F (long S) respectively. If you do not want this case
sign) and U+017F (long S) respectively. If you do not want this case
equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
</P>
<P>
Expand Down Expand Up @@ -1887,7 +1887,7 @@ <h1>pcre2api man page</h1>
This option has two effects. Firstly, it change the way PCRE2 processes \B,
\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By
default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
properties are used to classify characters. There are some PCRE2_EXTRA
properties are used to classify characters. There are some PCRE2_EXTRA
options (see below) that add finer control to this behaviour. More details are
given in the section on
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
Expand Down Expand Up @@ -1994,6 +1994,11 @@ <h1>pcre2api man page</h1>
This option forces \w to match only ASCII word characters, even when PCRE2_UCP
is set. It can be changed within a pattern by means of the (?aW) option
setting.
<pre>
PCRE2_EXTRA_ASCII_DIGIT
</pre>
This option forces the POSIX character class [:digit:] to match only ASCII
digits, even when PCRE2_UCP is set.
<pre>
PCRE2_EXTRA_ASCII_POSIX
</pre>
Expand Down Expand Up @@ -2029,8 +2034,8 @@ <h1>pcre2api man page</h1>
case-equivalent character sets that contain both ASCII and non-ASCII
characters. The ASCII letter S is case-equivalent to U+017f (long S) and the
ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables
recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
caseless match, both characters must either be ASCII or non-ASCII. The option
recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
caseless match, both characters must either be ASCII or non-ASCII. The option
can be changed with a pattern by the (?r) option setting.
<pre>
PCRE2_EXTRA_ESCAPED_CR_IS_LF
Expand Down
2 changes: 1 addition & 1 deletion doc/html/pcre2pattern.html
Original file line number Diff line number Diff line change
Expand Up @@ -1526,7 +1526,7 @@ <h1>pcre2pattern man page</h1>
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:cntrl:] becomes \p{Cc}
[:digit:] becomes \p{Nd}
[:digit:] becomes \p{Nd} unless PCRE2_EXTRA_ASCII_DIGIT is set
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
Expand Down
1 change: 1 addition & 0 deletions doc/html/pcre2test.html
Original file line number Diff line number Diff line change
Expand Up @@ -631,6 +631,7 @@ <h1>pcre2test man page</h1>
ascii_bsd set PCRE2_EXTRA_ASCII_BSD
ascii_bss set PCRE2_EXTRA_ASCII_BSS
ascii_bsw set PCRE2_EXTRA_ASCII_BSW
ascii_digit set PCRE2_EXTRA_ASCII_DIGIT
ascii_posix set PCRE2_EXTRA_ASCII_POSIX
auto_callout set PCRE2_AUTO_CALLOUT
bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
Expand Down
7 changes: 6 additions & 1 deletion doc/pcre2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1953,6 +1953,11 @@ COMPILING A PATTERN
PCRE2_UCP is set. It can be changed within a pattern by means of the
(?aW) option setting.

PCRE2_EXTRA_ASCII_DIGIT

This option forces the POSIX character class [:digit:] to match only
ASCII digits, even when PCRE2_UCP is set.

PCRE2_EXTRA_ASCII_POSIX

This option forces the POSIX character classes to match only ASCII
Expand Down Expand Up @@ -7688,7 +7693,7 @@ POSIX CHARACTER CLASSES
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:cntrl:] becomes \p{Cc}
[:digit:] becomes \p{Nd}
[:digit:] becomes \p{Nd} unless PCRE2_EXTRA_ASCII_DIGIT is set
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
Expand Down
9 changes: 6 additions & 3 deletions doc/pcre2_set_compile_extra_options.3
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,18 @@ options are:
\ex handling
PCRE2_EXTRA_ASCII_BSD \ed remains ASCII in UCP mode
PCRE2_EXTRA_ASCII_BSS \es remains ASCII in UCP mode
PCRE2_EXTRA_ASCII_BSW \ew remains ASFII in UCP mode
PCRE2_EXTRA_ASCII_BSW \ew remains ASCII in UCP mode
.\" JOIN
PCRE2_EXTRA_ASCII_DIGIT [:digit:] POSIX class remains ASCII
in UCP mode
.\" JOIN
PCRE2_EXTRA_ASCII_POSIX POSIX classes remain ASCII in
UCP mode
UCP mode
.\" JOIN
PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL Treat all invalid escapes as
a literal following character
.\" JOIN
PCRE2_EXTRA_CASELESS_RESTRICT Disable mixed ASCII/non-ASCII
PCRE2_EXTRA_CASELESS_RESTRICT Disable mixed ASCII/non-ASCII
case folding
PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \er as \en
PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines
Expand Down
13 changes: 9 additions & 4 deletions doc/pcre2api.3
Original file line number Diff line number Diff line change
Expand Up @@ -1482,7 +1482,7 @@ PCRE2_UCP is set, Unicode properties are used for all characters with more than
one other case, and for all characters whose code points are greater than
U+007F. Note that there are two ASCII characters, K and S, that, in addition to
their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin
sign) and U+017F (long S) respectively. If you do not want this case
sign) and U+017F (long S) respectively. If you do not want this case
equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT.
.P
For lower valued characters with only one other case, a lookup table is used
Expand Down Expand Up @@ -1838,7 +1838,7 @@ are not representable in UTF-16.
This option has two effects. Firstly, it change the way PCRE2 processes \eB,
\eb, \eD, \ed, \eS, \es, \eW, \ew, and some of the POSIX character classes. By
default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
properties are used to classify characters. There are some PCRE2_EXTRA
properties are used to classify characters. There are some PCRE2_EXTRA
options (see below) that add finer control to this behaviour. More details are
given in the section on
.\" HTML <a href="pcre2pattern.html#genericchartypes">
Expand Down Expand Up @@ -1953,6 +1953,11 @@ option setting.
This option forces \ew to match only ASCII word characters, even when PCRE2_UCP
is set. It can be changed within a pattern by means of the (?aW) option
setting.
.sp
PCRE2_EXTRA_ASCII_DIGIT
.sp
This option forces the POSIX character class [:digit:] to match only ASCII
digits, even when PCRE2_UCP is set.
.sp
PCRE2_EXTRA_ASCII_POSIX
.sp
Expand Down Expand Up @@ -1987,8 +1992,8 @@ rules, which allow for more than two cases per character. There are two
case-equivalent character sets that contain both ASCII and non-ASCII
characters. The ASCII letter S is case-equivalent to U+017f (long S) and the
ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables
recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
caseless match, both characters must either be ASCII or non-ASCII. The option
recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a
caseless match, both characters must either be ASCII or non-ASCII. The option
can be changed with a pattern by the (?r) option setting.
.sp
PCRE2_EXTRA_ESCAPED_CR_IS_LF
Expand Down
2 changes: 1 addition & 1 deletion doc/pcre2pattern.3
Original file line number Diff line number Diff line change
Expand Up @@ -1522,7 +1522,7 @@ classes with other sequences, as follows:
[:alpha:] becomes \ep{L}
[:blank:] becomes \eh
[:cntrl:] becomes \ep{Cc}
[:digit:] becomes \ep{Nd}
[:digit:] becomes \ep{Nd} unless PCRE2_EXTRA_ASCII_DIGIT is set
[:lower:] becomes \ep{Ll}
[:space:] becomes \ep{Xps}
[:upper:] becomes \ep{Lu}
Expand Down
1 change: 1 addition & 0 deletions doc/pcre2test.1
Original file line number Diff line number Diff line change
Expand Up @@ -586,6 +586,7 @@ for a description of the effects of these options.
ascii_bsd set PCRE2_EXTRA_ASCII_BSD
ascii_bss set PCRE2_EXTRA_ASCII_BSS
ascii_bsw set PCRE2_EXTRA_ASCII_BSW
ascii_digit set PCRE2_EXTRA_ASCII_DIGIT
ascii_posix set PCRE2_EXTRA_ASCII_POSIX
auto_callout set PCRE2_AUTO_CALLOUT
bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
Expand Down
1 change: 1 addition & 0 deletions doc/pcre2test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,7 @@ PATTERN MODIFIERS
ascii_bsd set PCRE2_EXTRA_ASCII_BSD
ascii_bss set PCRE2_EXTRA_ASCII_BSS
ascii_bsw set PCRE2_EXTRA_ASCII_BSW
ascii_digit set PCRE2_EXTRA_ASCII_DIGIT
ascii_posix set PCRE2_EXTRA_ASCII_POSIX
auto_callout set PCRE2_AUTO_CALLOUT
bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
Expand Down
1 change: 1 addition & 0 deletions src/pcre2.h.generic
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_EXTRA_ASCII_BSS 0x00000200u /* C */
#define PCRE2_EXTRA_ASCII_BSW 0x00000400u /* C */
#define PCRE2_EXTRA_ASCII_POSIX 0x00000800u /* C */
#define PCRE2_EXTRA_ASCII_DIGIT 0x00001000u /* C */

/* These are for pcre2_jit_compile(). */

Expand Down
1 change: 1 addition & 0 deletions src/pcre2.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_EXTRA_ASCII_BSS 0x00000200u /* C */
#define PCRE2_EXTRA_ASCII_BSW 0x00000400u /* C */
#define PCRE2_EXTRA_ASCII_POSIX 0x00000800u /* C */
#define PCRE2_EXTRA_ASCII_DIGIT 0x00001000u /* C */

/* These are for pcre2_jit_compile(). */

Expand Down
6 changes: 4 additions & 2 deletions src/pcre2_compile.c
Original file line number Diff line number Diff line change
Expand Up @@ -786,7 +786,8 @@ are allowed. */
PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES|PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL| \
PCRE2_EXTRA_ESCAPED_CR_IS_LF|PCRE2_EXTRA_ALT_BSUX| \
PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK|PCRE2_EXTRA_ASCII_BSD| \
PCRE2_EXTRA_ASCII_BSS|PCRE2_EXTRA_ASCII_BSW|PCRE2_EXTRA_ASCII_POSIX)
PCRE2_EXTRA_ASCII_BSS|PCRE2_EXTRA_ASCII_BSW|PCRE2_EXTRA_ASCII_POSIX| \
PCRE2_EXTRA_ASCII_DIGIT)

/* Compile time error code numbers. They are given names so that they can more
easily be tracked. When a new number is added, the tables called eint1 and
Expand Down Expand Up @@ -3581,7 +3582,8 @@ while (ptr < ptrend)

#ifdef SUPPORT_UNICODE
if ((options & PCRE2_UCP) != 0 &&
(xoptions & PCRE2_EXTRA_ASCII_POSIX) == 0)
(xoptions & PCRE2_EXTRA_ASCII_POSIX) == 0 &&
!(posix_class == 7 && (xoptions & PCRE2_EXTRA_ASCII_DIGIT) != 0))
{
int ptype = posix_substitutes[2*posix_class];
int pvalue = posix_substitutes[2*posix_class + 1];
Expand Down
4 changes: 3 additions & 1 deletion src/pcre2test.c
Original file line number Diff line number Diff line change
Expand Up @@ -651,6 +651,7 @@ static modstruct modlist[] = {
{ "ascii_bsd", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ASCII_BSD, CO(extra_options) },
{ "ascii_bss", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ASCII_BSS, CO(extra_options) },
{ "ascii_bsw", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ASCII_BSW, CO(extra_options) },
{ "ascii_digit", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ASCII_DIGIT, CO(extra_options) },
{ "ascii_posix", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ASCII_POSIX, CO(extra_options) },
{ "auto_callout", MOD_PAT, MOD_OPT, PCRE2_AUTO_CALLOUT, PO(options) },
{ "bad_escape_is_literal", MOD_CTC, MOD_OPT, PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL, CO(extra_options) },
Expand Down Expand Up @@ -4294,13 +4295,14 @@ show_compile_extra_options(uint32_t options, const char *before,
const char *after)
{
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s",
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
before,
((options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) != 0)? " allow_surrogate_escapes" : "",
((options & PCRE2_EXTRA_ALT_BSUX) != 0)? " alt_bsux" : "",
((options & PCRE2_EXTRA_ASCII_BSD) != 0)? " ascii_bsd" : "",
((options & PCRE2_EXTRA_ASCII_BSS) != 0)? " ascii_bss" : "",
((options & PCRE2_EXTRA_ASCII_BSW) != 0)? " ascii_bsw" : "",
((options & PCRE2_EXTRA_ASCII_DIGIT) != 0)? " ascii_digit" : "",
((options & PCRE2_EXTRA_ASCII_POSIX) != 0)? " ascii_posix" : "",
((options & PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL) != 0)? " bad_escape_is_literal" : "",
((options & PCRE2_EXTRA_CASELESS_RESTRICT) != 0)? " caseless_restrict" : "",
Expand Down
10 changes: 9 additions & 1 deletion testdata/testinput5
Original file line number Diff line number Diff line change
Expand Up @@ -1215,6 +1215,8 @@

/[[:digit:]]/B,ucp

/[[:digit:]]/B,ucp,ascii_digit

/[[:graph:]]/B,ucp

/[[:print:]]/B,ucp
Expand All @@ -1227,7 +1229,7 @@

/[[:xdigit:]]/B,ucp

# Unicode properties for \b abd \B
# Unicode properties for \b and \B

/\b...\B/utf,ucp
abc_
Expand Down Expand Up @@ -2431,6 +2433,12 @@
/[[:digit:]]+/utf,ucp
123\x{660}456

/[[:digit:]]+/utf,ucp,ascii_digit
123\x{660}456

/[[:digit:]]+/g,utf,ucp,ascii_digit
123\x{660}456

/[[:digit:]]+/utf,ucp,ascii_posix
123\x{660}456

Expand Down
10 changes: 8 additions & 2 deletions testdata/testinput7
Original file line number Diff line number Diff line change
Expand Up @@ -1657,7 +1657,7 @@
/^[\p{Xwd}]+/utf
ABCD1234\x{6ca}\x{a6c}\x{10a7}_

# Unicode properties for \b abd \B
# Unicode properties for \b and \B

/\b...\B/utf,ucp
abc_
Expand Down Expand Up @@ -2435,9 +2435,15 @@
/[[:digit:]]+/utf,ucp
123\x{660}456

/[[:digit:]]+/utf,ucp,ascii_digit
123\x{660}456

/[[:digit:]]+/g,utf,ucp,ascii_digit
123\x{660}456

/[[:digit:]]+/utf,ucp,ascii_posix
123\x{660}456

/>[[:space:]]+</utf,ucp
>\x{a0} \x{a0}<
>\x{a0}\x{a0}\x{a0}<
Expand Down
19 changes: 18 additions & 1 deletion testdata/testoutput5
Original file line number Diff line number Diff line change
Expand Up @@ -2520,6 +2520,14 @@ No match
End
------------------------------------------------------------------

/[[:digit:]]/B,ucp,ascii_digit
------------------------------------------------------------------
Bra
[0-9]
Ket
End
------------------------------------------------------------------

/[[:graph:]]/B,ucp
------------------------------------------------------------------
Bra
Expand Down Expand Up @@ -2568,7 +2576,7 @@ No match
End
------------------------------------------------------------------

# Unicode properties for \b abd \B
# Unicode properties for \b and \B

/\b...\B/utf,ucp
abc_
Expand Down Expand Up @@ -5359,6 +5367,15 @@ No match
123\x{660}456
0: 123\x{660}456

/[[:digit:]]+/utf,ucp,ascii_digit
123\x{660}456
0: 123

/[[:digit:]]+/g,utf,ucp,ascii_digit
123\x{660}456
0: 123
0: 456

/[[:digit:]]+/utf,ucp,ascii_posix
123\x{660}456
0: 123
Expand Down
Loading

0 comments on commit 4503b22

Please sign in to comment.