Determine if there are further sets to special-case in RegexCompiler / source generator #67056

stephentoub · 2022-03-23T17:33:42Z

In RegexCompiler and the RegexGenerator source generator, we try to make matching as efficient as possible, and that includes trying to generate the most efficient matching of character classes we can. We special-case various kinds of character classes, e.g. those that contain a single range (e.g. [a-z]), those that contain two characters that are just cased versions of each other (e.g. [Aa]), those that contain just two or three characters (e.g. ['"]), those that represent known built-in sets (e.g. \w, \d, \s), those that represent a single Unicode category, etc. For everything else, we fall back to a general scheme where we emit a bitmap in which to look up the character. The fallback is generally fast, but the customized approaches are typically faster. For .NET 7, we should spend a little more time determining whether there are any additional category of sets it'd be worth special-casing.

Looking at our corpus of regexes, here are the most popular sets we don't currently special-case (the leading number represents how many times they occur):

767:   [0-9A-Fa-f]
583:   [A-Za-z]
510:   [0-9A-Za-z]
304:   [0-9a-z]
245:   [0-9A-Z_a-z]
241:   [A-Z_a-z]
219:   [\s\S]
199:   [-\w]
195:   [-0-9A-Z_a-z]
183:   [^<>]
168:   [-0-9A-Za-z]
140:   [.\w]
115:   [^\n\r]
108:   [0-9a-f]
89 :   [^[]]
84 :   [aeio]
82 :   [i\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
82 :   [-.0-9A-Z_a-z]
81 :   [-.0-9A-Za-z]
78 :   [-.\w]
78 :   [AOao]
78 :   [a-f\d]
77 :   ['aeo]
72 :   [-0-9a-z]
66 :   [0-9A-Z]
63 :   [^>\s]
51 :   [0-9A-F]
51 :   [-0-9_a-z]
51 :   [mnrs]
49 :   [^{}]
48 :   [A-Fa-f\d]
48 :   [.0-9A-Z_a-z]
46 :   [.\d]
45 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
44 :   [--/_]
43 :   [AEae]
43 :   [\p{L}]
41 :   [_\w]
39 :   [^A-Za-z]
38 :   [0-9_a-z]
38 :   [^"\\]
37 :   [-_\w]
36 :   [-._~\d]
35 :   [.0-9]
35 :   [-A-Za-z]
35 :   [^"']
33 :   [-\s]
30 :   [^*/]
28 :   [;\s]
27 :   ['aeio]
25 :   [<\w]
25 :   [\t \S]
24 :   [\w\W]
24 :   [\u4E03\u4E09\u4E5D\u4E8C\u4E94\u516B\u516D\u56DB]
24 :   [+/-9A-Za-z]
24 :   [\u4E00\u4E09\u4E8C\u4E94\u516D\u56DB\u5929\u65E5]
24 :   [ \w]
23 :   [^|\s]
23 :   [^\n\S]
22 :   [^0-9A-Za-z]
22 :   [^"'>]
22 :   [,\s]
22 :   [:ACEPS[aceps]
21 :   [.0-9A-Za-z]
21 :   [\S]
21 :   [!#-'*+-/=?^`{-~\w]
21 :   [NRSnrs]
21 :   [^\t ,;]
21 :   [^\t ,/]
20 :   [!#-'*+-/-9=?^-~]
20 :   [\S\s]
20 :   [^ ).;_]
20 :   [\W]
20 :   [-\w\d]
20 :   [^\n\r\S]
20 :   [0-9A-FXa-fx]
19 :   [^@\w\s]
19 :   [-0-9]
19 :   [(\s]
19 :   [--/\s]
19 :   [\u2E80-\u2FDF\u3040-\u30FA\u30FC-\u312F\u3200-\u32FF\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF]
19 :   [2468]
18 :   [^\t ]
18 :   [_\p{L}]
18 :   [EUeu]
18 :   ["([`\s]
18 :   [^"(),.;[]`\s]
18 :   [")]`\s]
18 :   [^#*,/?[]{}\s]
17 :   [0-9A-z]
16 :   [-.0-9_a-z]
16 :   [\t\n\r ]
16 :   [^=|]
16 :   [^@|\s]
16 :   [\s\p{P}\p{S}]
16 :   [^\n\r;^]
15 :   [-.0-9a-z]
15 :   [!$&-,:;=@]
15 :   [\w\d]
15 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF\d]
15 :   [\w\s]
15 :   [/\s]
14 :   [a-z\d]
14 :   [_a-z]
14 :   [_\W]
14 :   [-.0-9]
14 :   [^/?]
14 :   [A-Za-z\d]
14 :   [^"\s]
14 :   [-A-Z_a-z]
14 :   [.\s]
14 :   [\u4EDF\u4F70\u5341\u5343\u62FE\u767E]
13 :   [^;\s]
13 :   [\t-\r ]
13 :   [!#-'*+-/-9=?A-Z^-~]
13 :   [^/:?[]]
12 :   [!#-'*+-/=?^-`{-~\d]
12 :   [^()<>\s]
12 :   [$0-9A-Z_a-z]
12 :   [^=\s]
12 :   [0-35-9]
12 :   [13579]
11 :   [\u0001-\t\v-\u007F\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
11 :   [^0-9A-Z_a-z]
11 :   [%+-.0-9A-Z_a-z]
11 :   [^">]
11 :   [:\w]
11 :   [13578]
11 :   [^#/?]
11 :   [^#?]
11 :   [%+-|\w\s]

Quickly skimming the list:

Can we do something more efficient for hex? It'd be nice if Create APIs to deal with processing ASCII text (as bytes) #28230 could include an IsHexDigit that regex could just call and that was optimal.
Can we do something more efficient for letters or for letters and digits? As above, it'd be nice if there were an Ascii.IsLetter and Ascii.IsLetterOrDigit that were as efficient as possible.
Some of the sets are strange, e.g. [\s\S] is just a strange way of writing a set that matches everything. We already special-case the "everything" set when it's in its canonical form. Should we convert such sets to be canonical?
We currently support special-casing sets that contain just two or three items, but not sets that are negated with just two or three items (meaning they support everything other than those). That should be an easy way to get some more of these.
Can we interpret a set like [AOao] to be two case-insensitive chars and emit something like ((c |= 0x20) == a) | (c == 'o') (and would that be faster than the bitmap)?
Our current bitmap scheme generates a bitmap for ASCII. That covers almost everything in the list, but even then we might be looking things up unnecessarily. We do a check like if (c < 128 && bitmap[c]), but we can tell from the set whether we should actually have a lower bound than 128... if we determine the largest possible character value is less than 128, we can both narrow for how many characters we'll hit the bitmap and also decrease the size of the bitmap. For non-ASCII, should we special-case any Unicode ranges based on the pattern to generate a lookup table there?

cc: @joperezr, @GrabYourPitchforks

The text was updated successfully, but these errors were encountered:

ghost · 2022-03-23T17:34:32Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

In RegexCompiler and the RegexGenerator source generator, we try to make matching as efficient as possible, and that includes trying to generate the most efficient matching of character classes we can. We special-case various kinds of character classes, e.g. those that contain a single range (e.g. [a-z]), those that contain two characters that are just cased versions of each other (e.g. [Aa]), those that contain just two or three characters (e.g. ['"]), those that represent known built-in sets (e.g. \w, \d, \s), those that represent a single Unicode category, etc. For everything else, we fall back to a general scheme where we emit a bitmap in which to look up the character. The fallback is generally fast, but the customized approaches are typically faster. For .NET 7, we should spend a little more time determining whether there are any additional category of sets it'd be worth special-casing.

Looking at our corpus of regexes, here are the most popular sets we don't currently special-case (the leading number represents how many times they occur):

767:   [0-9A-Fa-f]
583:   [A-Za-z]
510:   [0-9A-Za-z]
304:   [0-9a-z]
245:   [0-9A-Z_a-z]
241:   [A-Z_a-z]
219:   [\s\S]
199:   [-\w]
195:   [-0-9A-Z_a-z]
183:   [^<>]
168:   [-0-9A-Za-z]
140:   [.\w]
115:   [^\n\r]
108:   [0-9a-f]
89 :   [^[]]
84 :   [aeio]
82 :   [i\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
82 :   [-.0-9A-Z_a-z]
81 :   [-.0-9A-Za-z]
78 :   [-.\w]
78 :   [AOao]
78 :   [a-f\d]
77 :   ['aeo]
72 :   [-0-9a-z]
66 :   [0-9A-Z]
63 :   [^>\s]
51 :   [0-9A-F]
51 :   [-0-9_a-z]
51 :   [mnrs]
49 :   [^{}]
48 :   [A-Fa-f\d]
48 :   [.0-9A-Z_a-z]
46 :   [.\d]
45 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
44 :   [--/_]
43 :   [AEae]
43 :   [\p{L}]
41 :   [_\w]
39 :   [^A-Za-z]
38 :   [0-9_a-z]
38 :   [^"\\]
37 :   [-_\w]
36 :   [-._~\d]
35 :   [.0-9]
35 :   [-A-Za-z]
35 :   [^"']
33 :   [-\s]
30 :   [^*/]
28 :   [;\s]
27 :   ['aeio]
25 :   [<\w]
25 :   [\t \S]
24 :   [\w\W]
24 :   [\u4E03\u4E09\u4E5D\u4E8C\u4E94\u516B\u516D\u56DB]
24 :   [+/-9A-Za-z]
24 :   [\u4E00\u4E09\u4E8C\u4E94\u516D\u56DB\u5929\u65E5]
24 :   [ \w]
23 :   [^|\s]
23 :   [^\n\S]
22 :   [^0-9A-Za-z]
22 :   [^"'>]
22 :   [,\s]
22 :   [:ACEPS[aceps]
21 :   [.0-9A-Za-z]
21 :   [\S]
21 :   [!#-'*+-/=?^`{-~\w]
21 :   [NRSnrs]
21 :   [^\t ,;]
21 :   [^\t ,/]
20 :   [!#-'*+-/-9=?^-~]
20 :   [\S\s]
20 :   [^ ).;_]
20 :   [\W]
20 :   [-\w\d]
20 :   [^\n\r\S]
20 :   [0-9A-FXa-fx]
19 :   [^@\w\s]
19 :   [-0-9]
19 :   [(\s]
19 :   [--/\s]
19 :   [\u2E80-\u2FDF\u3040-\u30FA\u30FC-\u312F\u3200-\u32FF\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF]
19 :   [2468]
18 :   [^\t ]
18 :   [_\p{L}]
18 :   [EUeu]
18 :   ["([`\s]
18 :   [^"(),.;[]`\s]
18 :   [")]`\s]
18 :   [^#*,/?[]{}\s]
17 :   [0-9A-z]
16 :   [-.0-9_a-z]
16 :   [\t\n\r ]
16 :   [^=|]
16 :   [^@|\s]
16 :   [\s\p{P}\p{S}]
16 :   [^\n\r;^]
15 :   [-.0-9a-z]
15 :   [!$&-,:;=@]
15 :   [\w\d]
15 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF\d]
15 :   [\w\s]
15 :   [/\s]
14 :   [a-z\d]
14 :   [_a-z]
14 :   [_\W]
14 :   [-.0-9]
14 :   [^/?]
14 :   [A-Za-z\d]
14 :   [^"\s]
14 :   [-A-Z_a-z]
14 :   [.\s]
14 :   [\u4EDF\u4F70\u5341\u5343\u62FE\u767E]
13 :   [^;\s]
13 :   [\t-\r ]
13 :   [!#-'*+-/-9=?A-Z^-~]
13 :   [^/:?[]]
12 :   [!#-'*+-/=?^-`{-~\d]
12 :   [^()<>\s]
12 :   [$0-9A-Z_a-z]
12 :   [^=\s]
12 :   [0-35-9]
12 :   [13579]
11 :   [\u0001-\t\v-\u007F\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
11 :   [^0-9A-Z_a-z]
11 :   [%+-.0-9A-Z_a-z]
11 :   [^">]
11 :   [:\w]
11 :   [13578]
11 :   [^#/?]
11 :   [^#?]
11 :   [%+-|\w\s]

Quickly skimming the list:

Can we do something more efficient for hex? It'd be nice if Create APIs to deal with processing ASCII text (as bytes) #28230 could include an IsHexDigit that regex could just call and that was optimal.
Can we do something more efficient for letters or for letters and digits? As above, it'd be nice if there were an Ascii.IsLetter and Ascii.IsLetterOrDigit that were as efficient as possible.
Some of the sets are strange, e.g. [\s\S] is just a strange way of writing a set that matches everything. We already special-case the "everything" set when it's in its canonical form. Should we convert such sets to be canonical?
We currently support special-casing sets that contain just two or three items, but not sets that are negated with just two or three items (meaning they support everything other than those). That should be an easy way to get some more of these.
Can we interpret a set like [AOao] to be two case-insensitive chars and emit something like ((c |= 0x20) == a) | (c == 'o') (and would that be faster than the bitmap)?

cc: @joperezr, @GrabYourPitchforks

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `untriaged`
Milestone:	7.0.0

danmoseley · 2022-03-23T23:51:39Z

Looking at the options for these in your corpus, some of them effectively collapse together. eg I see that a bunch of uses of [0-9a-f] and [0-9A-F] are often used in patterns that have IgnoreCase, so they're equivalent to [0-9A-Fa-f], and conversely the latter is often used in patterns that would be equivalent if IgnoreCase was set. After the "ignore case work" they will all fold together during parsing. That makes them collectively more interesting to optimize for.

[0-9A-Z_a-z] (and [0-9A-Z_] [0-9_a-z] case insensitively) ought to be treated as \w already, which we already special case?

Some of these correspond to Unicode blocks eg \p{Nd} == [0-9] -- the parser lowers them to the same form, right?

stephentoub · 2022-03-24T00:55:52Z

[0-9A-Z_a-z] (and [0-9A-Z_] [0-9_a-z] case insensitively) ought to be treated as \w already, which we already special case?

Only if ECMAScript is set, which is very rare. Otherwise by default \w is much more than ASCII.

Some of these correspond to Unicode blocks eg \p{Nd} == [0-9]

As with \w, the Unicode categories contain, well, Unicode :-) Nd is much more than 0-9. But we already special-case a single Unicode category, anyway.

stephentoub · 2022-03-24T14:28:04Z

eg I see that a bunch of uses of [0-9a-f] and [0-9A-F] are often used in patterns that have IgnoreCase, so they're equivalent to [0-9A-Fa-f]

This will only be true with IgnoreCase | CultureInvariant. With other cultures, case folding around i, k, and S make it so that's not the mapping (e.g. k not only maps to upper-case K but also to the Kelvin symbol).

It's also many fewer than one might think. Turns out these patterns are often case-sensitive. See below...

After the "ignore case work" they will all fold together during parsing.

Earlier in .NET 7 I already put in place a limited version of the folding work, limited to just ASCII, to just small ranges (but large enough to handle ASCII letters and digits), etc. That's already enough to get many of these cases, such that the results you're seeing above already factor that in (since it's all done at parsing time). It's just hindered by the special-casing (e.g. for k, which is of course in the a-z range) previously cited.

I regenerated the above table to include whether the set is IgnoreCase / InvariantCulture:

767:   [0-9A-Fa-f]
486:   [A-Za-z]
406:   [0-9A-Za-z]
223:   [A-Z_a-z]
206:   [0-9A-Z_a-z]
175:   [0-9a-z] (IgnoreCase)
168:   [-0-9A-Z_a-z]
156:   [^<>]
147:   [-\w]
139:   [-0-9A-Za-z]
133:   [\s\S]
114:   [0-9a-z]
108:   [0-9a-f]
106:   [^\n\r]
88 :   [0-9A-Za-z] (IgnoreCase)
86 :   [.\w] (IgnoreCase)
86 :   [\s\S] (IgnoreCase)
84 :   [aeio]
83 :   [A-Za-z] (IgnoreCase)
78 :   [AOao]
77 :   ['aeo]
67 :   [-.0-9A-Za-z]
66 :   [0-9A-Z]
65 :   [i\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] (IgnoreCase)
62 :   [^[]]
56 :   [-.\w]
54 :   [.\w]
51 :   [0-9A-F]
51 :   [-\w] (IgnoreCase)
51 :   [mnrs]
47 :   [^{}]
44 :   [-.0-9A-Z_a-z]
44 :   [--/_]
43 :   [AEae]
41 :   [\p{L}]
40 :   [.0-9A-Z_a-z]
38 :   [-0-9a-z]
38 :   [^A-Za-z]
38 :   [^"\\]
37 :   [-.0-9A-Z_a-z] (IgnoreCase)
36 :   [-._~\d]
36 :   [-0-9_a-z] (IgnoreCase)
35 :   [.0-9]
34 :   [0-9A-Z_a-z] (IgnoreCase)
33 :   [_\w]
33 :   [a-f\d] (IgnoreCase)
33 :   [^>\s] (IgnoreCase)
33 :   [a-f\d] (IgnoreCase, CultureInvariant)
32 :   [-A-Za-z]
32 :   [-\s]
30 :   [^>\s]
29 :   [.\d]
27 :   [-0-9A-Za-z] (IgnoreCase)
27 :   [-0-9A-Z_a-z] (IgnoreCase)
27 :   ['aeio]
25 :   [-0-9a-z] (IgnoreCase)
25 :   [-_\w]
25 :   [A-Fa-f\d]
25 :   [<\w]
25 :   [\t \S] (IgnoreCase)
24 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] (IgnoreCase)
24 :   [^<>] (IgnoreCase)
24 :   [+/-9A-Za-z]
23 :   [^|\s]
23 :   [^\n\S]
22 :   [^[]] (IgnoreCase)
22 :   [-.\w] (IgnoreCase)
22 :   [:ACEPS[aceps]
21 :   [^0-9A-Za-z]
21 :   [NRSnrs]
21 :   [ \w]
21 :   [^\t ,;]
21 :   [^\t ,/]
20 :   [\u4E03\u4E09\u4E5D\u4E8C\u4E94\u516B\u516D\u56DB]
20 :   [^ ).;_]
20 :   [-\w\d]
20 :   [A-Fa-f\d] (IgnoreCase, CultureInvariant)
20 :   [^\n\r\S]
20 :   [0-9A-FXa-fx]
19 :   [;\s]
19 :   [.0-9A-Za-z]
19 :   [^"']
19 :   [\S\s]
19 :   [-0-9]
19 :   [,\s]
19 :   [\u2E80-\u2FDF\u3040-\u30FA\u30FC-\u312F\u3200-\u32FF\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF]
19 :   [2468]
18 :   [^*/] (IgnoreCase)
18 :   [EUeu]
18 :   ["([`\s] (IgnoreCase)
18 :   [^"(),.;[]`\s] (IgnoreCase)
18 :   [")]`\s] (IgnoreCase)
18 :   [^#*,/?[]{}\s]
17 :   [\w\W]
17 :   [0-9_a-z] (IgnoreCase)
17 :   [A-Z_a-z] (IgnoreCase)
17 :   [i\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] (IgnoreCase, CultureInvariant)
17 :   [(\s]
16 :   [\t\n\r ]
16 :   [^"'] (IgnoreCase)
16 :   [0-9A-Za-z] (IgnoreCase, CultureInvariant)
16 :   [^=|]
16 :   [^@|\s]
16 :   [.\d] (IgnoreCase)
16 :   [\s\p{P}\p{S}]
16 :   [^\n\r;^]
15 :   [0-9a-z] (IgnoreCase, CultureInvariant)
15 :   [!$&-,:;=@]
15 :   [\u4E00\u4E09\u4E8C\u4E94\u516D\u56DB\u5929\u65E5]
14 :   [A-Za-z] (IgnoreCase, CultureInvariant)
14 :   [^@\w\s]
14 :   [^"'>]
14 :   [_\W]
14 :   [-.0-9]
14 :   [_\p{L}]
14 :   [-A-Z_a-z]
14 :   [-0-9_a-z]
14 :   [\W] (IgnoreCase)
14 :   [\S] (IgnoreCase)
14 :   [/\s] (IgnoreCase)
13 :   [0-9_a-z] (IgnoreCase, CultureInvariant)
13 :   [!#-'*+-/=?^`{-~\w] (IgnoreCase)
13 :   [\t-\r ]
13 :   [\w\d]
13 :   [\w\s]
13 :   [^/:?[]]
12 :   [!#-'*+-/=?^-`{-~\d]
12 :   [^*/]
12 :   [^/?] (IgnoreCase)
12 :   [0-35-9]
12 :   [\u4EDF\u4F70\u5341\u5343\u62FE\u767E]
12 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] (IgnoreCase, CultureInvariant)
12 :   [13579]
12 :   [a-f\d]
11 :   [-.0-9A-Za-z] (IgnoreCase)
11 :   [^0-9A-Z_a-z]
11 :   [-.0-9a-z] (IgnoreCase)
11 :   [^\t ] (IgnoreCase, CultureInvariant)
11 :   [$0-9A-Z_a-z]
11 :   [--/\s]
11 :   [-_\w] (IgnoreCase)
11 :   [13578]
11 :   [A-Za-z\d]
11 :   [%+-|\w\s]
10 :   [0-9A-z]
10 :   [-.0-9_a-z] (IgnoreCase)
10 :   [^"\s]
10 :   [^=\s]
10 :   ['+-.]
10 :   [!#-'*+-/-9=?A-Z^-~]
10 :   [\d\w]
10 :   [c-gil-oq-uwxz] (IgnoreCase)
10 :   [abd-jm-oq-tvwyz] (IgnoreCase)
10 :   [acdf-ik-orsu-z] (IgnoreCase)
10 :   [DEJKMOZdejkmoz\u212A]
10 :   [CEGHR-Uceghr-u]
10 :   [i-kmor] (IgnoreCase)
10 :   [abd-il-np-uwy] (IgnoreCase)
10 :   [KMNRTUkmnrtu\u212A]
10 :   [DEL-OQ-Tdel-oq-t]
10 :   [EMOPemop]
10 :   [eg-imnprwyz] (IgnoreCase)
10 :   [a-cikr-vy] (IgnoreCase)
10 :   [AC-EGHK-Zac-eghk-z\u212A]
10 :   [ace-gilopruz] (IgnoreCase)
10 :   [AE-HK-NR-TWYae-hk-nr-twy\u212A]
10 :   [EOSUWeosuw]
10 :   [a-eg-or-vx-z] (IgnoreCase)
10 :   [CDF-HJ-PRTVWZcdf-hj-prtvwz\u212A]
10 :   [AGKSYZagksyz\u212A]
10 :   [aceginu] (IgnoreCase)
10 :   [ETUetu]
10 :   [EFS-Uefs-u]
10 :   [AMRWamrw]
10 :   [^\n\r"*/:<>?\\|] (IgnoreCase)

stephentoub · 2022-03-24T14:41:45Z

(As an aside, the number of patterns that specify RegexOptions.InvariantCulture without also specifying RegexOptions.IgnoreCase or using (?i) is non-trivial, about 5% of our corpus. We should look at improving the docs here to highlight that doing so is a nop.)

joperezr · 2022-03-24T14:52:14Z

As part of the IgnoreCase work I was planning to update the docs we have for the relevant RegexOptions (IgnoreCase, CultureInvariant, ECMAScript) to better explain how are they involved when matching and how they interact to each other so the behavior people get is more expected.

joperezr · 2022-03-24T17:18:45Z

We should look at improving the docs here to highlight that doing so is a nop

I wonder if the confusion here is that people expect us to do linguistic comparisons when setting CultureInvariant, which is why they set that but not set IgnoreCase as well. Anyway, I do agree that better documentation is in order here so as said earlier, I'll improve this as part of the IgnoreCase work.

stephentoub added the area-System.Text.RegularExpressions label Mar 23, 2022

stephentoub added this to the 7.0.0 milestone Mar 23, 2022

stephentoub changed the title ~~Determine if there are further sets to special-cased in RegexCompiler / source generator~~ Determine if there are further sets to special-case in RegexCompiler / source generator Mar 23, 2022

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 23, 2022

stephentoub mentioned this issue Mar 25, 2022

Add tighter bound to range check for matching Regex char classes #67133

Merged

jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 28, 2022

stephentoub self-assigned this Mar 30, 2022

stephentoub mentioned this issue Mar 30, 2022

Improve handling of common Regex sets #67365

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 5, 2022

stephentoub closed this as completed in #67365 Apr 6, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 6, 2022

ghost locked as resolved and limited conversation to collaborators May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine if there are further sets to special-case in RegexCompiler / source generator #67056

Determine if there are further sets to special-case in RegexCompiler / source generator #67056

stephentoub commented Mar 23, 2022 •

edited

Loading

ghost commented Mar 23, 2022

danmoseley commented Mar 23, 2022 •

edited

Loading

stephentoub commented Mar 24, 2022

stephentoub commented Mar 24, 2022

stephentoub commented Mar 24, 2022 •

edited

Loading

joperezr commented Mar 24, 2022

joperezr commented Mar 24, 2022

Determine if there are further sets to special-case in RegexCompiler / source generator #67056

Determine if there are further sets to special-case in RegexCompiler / source generator #67056

Comments

stephentoub commented Mar 23, 2022 • edited Loading

ghost commented Mar 23, 2022

danmoseley commented Mar 23, 2022 • edited Loading

stephentoub commented Mar 24, 2022

stephentoub commented Mar 24, 2022

stephentoub commented Mar 24, 2022 • edited Loading

joperezr commented Mar 24, 2022

joperezr commented Mar 24, 2022

stephentoub commented Mar 23, 2022 •

edited

Loading

danmoseley commented Mar 23, 2022 •

edited

Loading

stephentoub commented Mar 24, 2022 •

edited

Loading