Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine if there are further sets to special-case in RegexCompiler / source generator #67056

Closed
stephentoub opened this issue Mar 23, 2022 · 7 comments · Fixed by #67365
Closed

Comments

@stephentoub
Copy link
Member

stephentoub commented Mar 23, 2022

In RegexCompiler and the RegexGenerator source generator, we try to make matching as efficient as possible, and that includes trying to generate the most efficient matching of character classes we can. We special-case various kinds of character classes, e.g. those that contain a single range (e.g. [a-z]), those that contain two characters that are just cased versions of each other (e.g. [Aa]), those that contain just two or three characters (e.g. ['"]), those that represent known built-in sets (e.g. \w, \d, \s), those that represent a single Unicode category, etc. For everything else, we fall back to a general scheme where we emit a bitmap in which to look up the character. The fallback is generally fast, but the customized approaches are typically faster. For .NET 7, we should spend a little more time determining whether there are any additional category of sets it'd be worth special-casing.

Looking at our corpus of regexes, here are the most popular sets we don't currently special-case (the leading number represents how many times they occur):

767:   [0-9A-Fa-f]
583:   [A-Za-z]
510:   [0-9A-Za-z]
304:   [0-9a-z]
245:   [0-9A-Z_a-z]
241:   [A-Z_a-z]
219:   [\s\S]
199:   [-\w]
195:   [-0-9A-Z_a-z]
183:   [^<>]
168:   [-0-9A-Za-z]
140:   [.\w]
115:   [^\n\r]
108:   [0-9a-f]
89 :   [^[]]
84 :   [aeio]
82 :   [i\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
82 :   [-.0-9A-Z_a-z]
81 :   [-.0-9A-Za-z]
78 :   [-.\w]
78 :   [AOao]
78 :   [a-f\d]
77 :   ['aeo]
72 :   [-0-9a-z]
66 :   [0-9A-Z]
63 :   [^>\s]
51 :   [0-9A-F]
51 :   [-0-9_a-z]
51 :   [mnrs]
49 :   [^{}]
48 :   [A-Fa-f\d]
48 :   [.0-9A-Z_a-z]
46 :   [.\d]
45 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
44 :   [--/_]
43 :   [AEae]
43 :   [\p{L}]
41 :   [_\w]
39 :   [^A-Za-z]
38 :   [0-9_a-z]
38 :   [^"\\]
37 :   [-_\w]
36 :   [-._~\d]
35 :   [.0-9]
35 :   [-A-Za-z]
35 :   [^"']
33 :   [-\s]
30 :   [^*/]
28 :   [;\s]
27 :   ['aeio]
25 :   [<\w]
25 :   [\t \S]
24 :   [\w\W]
24 :   [\u4E03\u4E09\u4E5D\u4E8C\u4E94\u516B\u516D\u56DB]
24 :   [+/-9A-Za-z]
24 :   [\u4E00\u4E09\u4E8C\u4E94\u516D\u56DB\u5929\u65E5]
24 :   [ \w]
23 :   [^|\s]
23 :   [^\n\S]
22 :   [^0-9A-Za-z]
22 :   [^"'>]
22 :   [,\s]
22 :   [:ACEPS[aceps]
21 :   [.0-9A-Za-z]
21 :   [\S]
21 :   [!#-'*+-/=?^`{-~\w]
21 :   [NRSnrs]
21 :   [^\t ,;]
21 :   [^\t ,/]
20 :   [!#-'*+-/-9=?^-~]
20 :   [\S\s]
20 :   [^ ).;_]
20 :   [\W]
20 :   [-\w\d]
20 :   [^\n\r\S]
20 :   [0-9A-FXa-fx]
19 :   [^@\w\s]
19 :   [-0-9]
19 :   [(\s]
19 :   [--/\s]
19 :   [\u2E80-\u2FDF\u3040-\u30FA\u30FC-\u312F\u3200-\u32FF\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF]
19 :   [2468]
18 :   [^\t ]
18 :   [_\p{L}]
18 :   [EUeu]
18 :   ["([`\s]
18 :   [^"(),.;[]`\s]
18 :   [")]`\s]
18 :   [^#*,/?[]{}\s]
17 :   [0-9A-z]
16 :   [-.0-9_a-z]
16 :   [\t\n\r ]
16 :   [^=|]
16 :   [^@|\s]
16 :   [\s\p{P}\p{S}]
16 :   [^\n\r;^]
15 :   [-.0-9a-z]
15 :   [!$&-,:;=@]
15 :   [\w\d]
15 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF\d]
15 :   [\w\s]
15 :   [/\s]
14 :   [a-z\d]
14 :   [_a-z]
14 :   [_\W]
14 :   [-.0-9]
14 :   [^/?]
14 :   [A-Za-z\d]
14 :   [^"\s]
14 :   [-A-Z_a-z]
14 :   [.\s]
14 :   [\u4EDF\u4F70\u5341\u5343\u62FE\u767E]
13 :   [^;\s]
13 :   [\t-\r ]
13 :   [!#-'*+-/-9=?A-Z^-~]
13 :   [^/:?[]]
12 :   [!#-'*+-/=?^-`{-~\d]
12 :   [^()<>\s]
12 :   [$0-9A-Z_a-z]
12 :   [^=\s]
12 :   [0-35-9]
12 :   [13579]
11 :   [\u0001-\t\v-\u007F\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
11 :   [^0-9A-Z_a-z]
11 :   [%+-.0-9A-Z_a-z]
11 :   [^">]
11 :   [:\w]
11 :   [13578]
11 :   [^#/?]
11 :   [^#?]
11 :   [%+-|\w\s]

Quickly skimming the list:

  • Can we do something more efficient for hex? It'd be nice if Create APIs to deal with processing ASCII text (as bytes) #28230 could include an IsHexDigit that regex could just call and that was optimal.
  • Can we do something more efficient for letters or for letters and digits? As above, it'd be nice if there were an Ascii.IsLetter and Ascii.IsLetterOrDigit that were as efficient as possible.
  • Some of the sets are strange, e.g. [\s\S] is just a strange way of writing a set that matches everything. We already special-case the "everything" set when it's in its canonical form. Should we convert such sets to be canonical?
  • We currently support special-casing sets that contain just two or three items, but not sets that are negated with just two or three items (meaning they support everything other than those). That should be an easy way to get some more of these.
  • Can we interpret a set like [AOao] to be two case-insensitive chars and emit something like ((c |= 0x20) == a) | (c == 'o') (and would that be faster than the bitmap)?
  • Our current bitmap scheme generates a bitmap for ASCII. That covers almost everything in the list, but even then we might be looking things up unnecessarily. We do a check like if (c < 128 && bitmap[c]), but we can tell from the set whether we should actually have a lower bound than 128... if we determine the largest possible character value is less than 128, we can both narrow for how many characters we'll hit the bitmap and also decrease the size of the bitmap. For non-ASCII, should we special-case any Unicode ranges based on the pattern to generate a lookup table there?

cc: @joperezr, @GrabYourPitchforks

@stephentoub stephentoub added this to the 7.0.0 milestone Mar 23, 2022
@stephentoub stephentoub changed the title Determine if there are further sets to special-cased in RegexCompiler / source generator Determine if there are further sets to special-case in RegexCompiler / source generator Mar 23, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 23, 2022
@ghost
Copy link

ghost commented Mar 23, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

In RegexCompiler and the RegexGenerator source generator, we try to make matching as efficient as possible, and that includes trying to generate the most efficient matching of character classes we can. We special-case various kinds of character classes, e.g. those that contain a single range (e.g. [a-z]), those that contain two characters that are just cased versions of each other (e.g. [Aa]), those that contain just two or three characters (e.g. ['"]), those that represent known built-in sets (e.g. \w, \d, \s), those that represent a single Unicode category, etc. For everything else, we fall back to a general scheme where we emit a bitmap in which to look up the character. The fallback is generally fast, but the customized approaches are typically faster. For .NET 7, we should spend a little more time determining whether there are any additional category of sets it'd be worth special-casing.

Looking at our corpus of regexes, here are the most popular sets we don't currently special-case (the leading number represents how many times they occur):

767:   [0-9A-Fa-f]
583:   [A-Za-z]
510:   [0-9A-Za-z]
304:   [0-9a-z]
245:   [0-9A-Z_a-z]
241:   [A-Z_a-z]
219:   [\s\S]
199:   [-\w]
195:   [-0-9A-Z_a-z]
183:   [^<>]
168:   [-0-9A-Za-z]
140:   [.\w]
115:   [^\n\r]
108:   [0-9a-f]
89 :   [^[]]
84 :   [aeio]
82 :   [i\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
82 :   [-.0-9A-Z_a-z]
81 :   [-.0-9A-Za-z]
78 :   [-.\w]
78 :   [AOao]
78 :   [a-f\d]
77 :   ['aeo]
72 :   [-0-9a-z]
66 :   [0-9A-Z]
63 :   [^>\s]
51 :   [0-9A-F]
51 :   [-0-9_a-z]
51 :   [mnrs]
49 :   [^{}]
48 :   [A-Fa-f\d]
48 :   [.0-9A-Z_a-z]
46 :   [.\d]
45 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
44 :   [--/_]
43 :   [AEae]
43 :   [\p{L}]
41 :   [_\w]
39 :   [^A-Za-z]
38 :   [0-9_a-z]
38 :   [^"\\]
37 :   [-_\w]
36 :   [-._~\d]
35 :   [.0-9]
35 :   [-A-Za-z]
35 :   [^"']
33 :   [-\s]
30 :   [^*/]
28 :   [;\s]
27 :   ['aeio]
25 :   [<\w]
25 :   [\t \S]
24 :   [\w\W]
24 :   [\u4E03\u4E09\u4E5D\u4E8C\u4E94\u516B\u516D\u56DB]
24 :   [+/-9A-Za-z]
24 :   [\u4E00\u4E09\u4E8C\u4E94\u516D\u56DB\u5929\u65E5]
24 :   [ \w]
23 :   [^|\s]
23 :   [^\n\S]
22 :   [^0-9A-Za-z]
22 :   [^"'>]
22 :   [,\s]
22 :   [:ACEPS[aceps]
21 :   [.0-9A-Za-z]
21 :   [\S]
21 :   [!#-'*+-/=?^`{-~\w]
21 :   [NRSnrs]
21 :   [^\t ,;]
21 :   [^\t ,/]
20 :   [!#-'*+-/-9=?^-~]
20 :   [\S\s]
20 :   [^ ).;_]
20 :   [\W]
20 :   [-\w\d]
20 :   [^\n\r\S]
20 :   [0-9A-FXa-fx]
19 :   [^@\w\s]
19 :   [-0-9]
19 :   [(\s]
19 :   [--/\s]
19 :   [\u2E80-\u2FDF\u3040-\u30FA\u30FC-\u312F\u3200-\u32FF\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF]
19 :   [2468]
18 :   [^\t ]
18 :   [_\p{L}]
18 :   [EUeu]
18 :   ["([`\s]
18 :   [^"(),.;[]`\s]
18 :   [")]`\s]
18 :   [^#*,/?[]{}\s]
17 :   [0-9A-z]
16 :   [-.0-9_a-z]
16 :   [\t\n\r ]
16 :   [^=|]
16 :   [^@|\s]
16 :   [\s\p{P}\p{S}]
16 :   [^\n\r;^]
15 :   [-.0-9a-z]
15 :   [!$&-,:;=@]
15 :   [\w\d]
15 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF\d]
15 :   [\w\s]
15 :   [/\s]
14 :   [a-z\d]
14 :   [_a-z]
14 :   [_\W]
14 :   [-.0-9]
14 :   [^/?]
14 :   [A-Za-z\d]
14 :   [^"\s]
14 :   [-A-Z_a-z]
14 :   [.\s]
14 :   [\u4EDF\u4F70\u5341\u5343\u62FE\u767E]
13 :   [^;\s]
13 :   [\t-\r ]
13 :   [!#-'*+-/-9=?A-Z^-~]
13 :   [^/:?[]]
12 :   [!#-'*+-/=?^-`{-~\d]
12 :   [^()<>\s]
12 :   [$0-9A-Z_a-z]
12 :   [^=\s]
12 :   [0-35-9]
12 :   [13579]
11 :   [\u0001-\t\v-\u007F\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]
11 :   [^0-9A-Z_a-z]
11 :   [%+-.0-9A-Z_a-z]
11 :   [^">]
11 :   [:\w]
11 :   [13578]
11 :   [^#/?]
11 :   [^#?]
11 :   [%+-|\w\s]

Quickly skimming the list:

  • Can we do something more efficient for hex? It'd be nice if Create APIs to deal with processing ASCII text (as bytes) #28230 could include an IsHexDigit that regex could just call and that was optimal.
  • Can we do something more efficient for letters or for letters and digits? As above, it'd be nice if there were an Ascii.IsLetter and Ascii.IsLetterOrDigit that were as efficient as possible.
  • Some of the sets are strange, e.g. [\s\S] is just a strange way of writing a set that matches everything. We already special-case the "everything" set when it's in its canonical form. Should we convert such sets to be canonical?
  • We currently support special-casing sets that contain just two or three items, but not sets that are negated with just two or three items (meaning they support everything other than those). That should be an easy way to get some more of these.
  • Can we interpret a set like [AOao] to be two case-insensitive chars and emit something like ((c |= 0x20) == a) | (c == 'o') (and would that be faster than the bitmap)?

cc: @joperezr, @GrabYourPitchforks

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, untriaged

Milestone: 7.0.0

@danmoseley
Copy link
Member

danmoseley commented Mar 23, 2022

Looking at the options for these in your corpus, some of them effectively collapse together. eg I see that a bunch of uses of [0-9a-f] and [0-9A-F] are often used in patterns that have IgnoreCase, so they're equivalent to [0-9A-Fa-f], and conversely the latter is often used in patterns that would be equivalent if IgnoreCase was set. After the "ignore case work" they will all fold together during parsing. That makes them collectively more interesting to optimize for.

[0-9A-Z_a-z] (and [0-9A-Z_] [0-9_a-z] case insensitively) ought to be treated as \w already, which we already special case?

Some of these correspond to Unicode blocks eg \p{Nd} == [0-9] -- the parser lowers them to the same form, right?

@stephentoub
Copy link
Member Author

[0-9A-Z_a-z] (and [0-9A-Z_] [0-9_a-z] case insensitively) ought to be treated as \w already, which we already special case?

Only if ECMAScript is set, which is very rare. Otherwise by default \w is much more than ASCII.

Some of these correspond to Unicode blocks eg \p{Nd} == [0-9]

As with \w, the Unicode categories contain, well, Unicode :-) Nd is much more than 0-9. But we already special-case a single Unicode category, anyway.

@stephentoub
Copy link
Member Author

eg I see that a bunch of uses of [0-9a-f] and [0-9A-F] are often used in patterns that have IgnoreCase, so they're equivalent to [0-9A-Fa-f]

This will only be true with IgnoreCase | CultureInvariant. With other cultures, case folding around i, k, and S make it so that's not the mapping (e.g. k not only maps to upper-case K but also to the Kelvin symbol).

It's also many fewer than one might think. Turns out these patterns are often case-sensitive. See below...

After the "ignore case work" they will all fold together during parsing.

Earlier in .NET 7 I already put in place a limited version of the folding work, limited to just ASCII, to just small ranges (but large enough to handle ASCII letters and digits), etc. That's already enough to get many of these cases, such that the results you're seeing above already factor that in (since it's all done at parsing time). It's just hindered by the special-casing (e.g. for k, which is of course in the a-z range) previously cited.

I regenerated the above table to include whether the set is IgnoreCase / InvariantCulture:

767:   [0-9A-Fa-f]
486:   [A-Za-z]
406:   [0-9A-Za-z]
223:   [A-Z_a-z]
206:   [0-9A-Z_a-z]
175:   [0-9a-z] (IgnoreCase)
168:   [-0-9A-Z_a-z]
156:   [^<>]
147:   [-\w]
139:   [-0-9A-Za-z]
133:   [\s\S]
114:   [0-9a-z]
108:   [0-9a-f]
106:   [^\n\r]
88 :   [0-9A-Za-z] (IgnoreCase)
86 :   [.\w] (IgnoreCase)
86 :   [\s\S] (IgnoreCase)
84 :   [aeio]
83 :   [A-Za-z] (IgnoreCase)
78 :   [AOao]
77 :   ['aeo]
67 :   [-.0-9A-Za-z]
66 :   [0-9A-Z]
65 :   [i\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] (IgnoreCase)
62 :   [^[]]
56 :   [-.\w]
54 :   [.\w]
51 :   [0-9A-F]
51 :   [-\w] (IgnoreCase)
51 :   [mnrs]
47 :   [^{}]
44 :   [-.0-9A-Z_a-z]
44 :   [--/_]
43 :   [AEae]
41 :   [\p{L}]
40 :   [.0-9A-Z_a-z]
38 :   [-0-9a-z]
38 :   [^A-Za-z]
38 :   [^"\\]
37 :   [-.0-9A-Z_a-z] (IgnoreCase)
36 :   [-._~\d]
36 :   [-0-9_a-z] (IgnoreCase)
35 :   [.0-9]
34 :   [0-9A-Z_a-z] (IgnoreCase)
33 :   [_\w]
33 :   [a-f\d] (IgnoreCase)
33 :   [^>\s] (IgnoreCase)
33 :   [a-f\d] (IgnoreCase, CultureInvariant)
32 :   [-A-Za-z]
32 :   [-\s]
30 :   [^>\s]
29 :   [.\d]
27 :   [-0-9A-Za-z] (IgnoreCase)
27 :   [-0-9A-Z_a-z] (IgnoreCase)
27 :   ['aeio]
25 :   [-0-9a-z] (IgnoreCase)
25 :   [-_\w]
25 :   [A-Fa-f\d]
25 :   [<\w]
25 :   [\t \S] (IgnoreCase)
24 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] (IgnoreCase)
24 :   [^<>] (IgnoreCase)
24 :   [+/-9A-Za-z]
23 :   [^|\s]
23 :   [^\n\S]
22 :   [^[]] (IgnoreCase)
22 :   [-.\w] (IgnoreCase)
22 :   [:ACEPS[aceps]
21 :   [^0-9A-Za-z]
21 :   [NRSnrs]
21 :   [ \w]
21 :   [^\t ,;]
21 :   [^\t ,/]
20 :   [\u4E03\u4E09\u4E5D\u4E8C\u4E94\u516B\u516D\u56DB]
20 :   [^ ).;_]
20 :   [-\w\d]
20 :   [A-Fa-f\d] (IgnoreCase, CultureInvariant)
20 :   [^\n\r\S]
20 :   [0-9A-FXa-fx]
19 :   [;\s]
19 :   [.0-9A-Za-z]
19 :   [^"']
19 :   [\S\s]
19 :   [-0-9]
19 :   [,\s]
19 :   [\u2E80-\u2FDF\u3040-\u30FA\u30FC-\u312F\u3200-\u32FF\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF]
19 :   [2468]
18 :   [^*/] (IgnoreCase)
18 :   [EUeu]
18 :   ["([`\s] (IgnoreCase)
18 :   [^"(),.;[]`\s] (IgnoreCase)
18 :   [")]`\s] (IgnoreCase)
18 :   [^#*,/?[]{}\s]
17 :   [\w\W]
17 :   [0-9_a-z] (IgnoreCase)
17 :   [A-Z_a-z] (IgnoreCase)
17 :   [i\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] (IgnoreCase, CultureInvariant)
17 :   [(\s]
16 :   [\t\n\r ]
16 :   [^"'] (IgnoreCase)
16 :   [0-9A-Za-z] (IgnoreCase, CultureInvariant)
16 :   [^=|]
16 :   [^@|\s]
16 :   [.\d] (IgnoreCase)
16 :   [\s\p{P}\p{S}]
16 :   [^\n\r;^]
15 :   [0-9a-z] (IgnoreCase, CultureInvariant)
15 :   [!$&-,:;=@]
15 :   [\u4E00\u4E09\u4E8C\u4E94\u516D\u56DB\u5929\u65E5]
14 :   [A-Za-z] (IgnoreCase, CultureInvariant)
14 :   [^@\w\s]
14 :   [^"'>]
14 :   [_\W]
14 :   [-.0-9]
14 :   [_\p{L}]
14 :   [-A-Z_a-z]
14 :   [-0-9_a-z]
14 :   [\W] (IgnoreCase)
14 :   [\S] (IgnoreCase)
14 :   [/\s] (IgnoreCase)
13 :   [0-9_a-z] (IgnoreCase, CultureInvariant)
13 :   [!#-'*+-/=?^`{-~\w] (IgnoreCase)
13 :   [\t-\r ]
13 :   [\w\d]
13 :   [\w\s]
13 :   [^/:?[]]
12 :   [!#-'*+-/=?^-`{-~\d]
12 :   [^*/]
12 :   [^/?] (IgnoreCase)
12 :   [0-35-9]
12 :   [\u4EDF\u4F70\u5341\u5343\u62FE\u767E]
12 :   [a-z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] (IgnoreCase, CultureInvariant)
12 :   [13579]
12 :   [a-f\d]
11 :   [-.0-9A-Za-z] (IgnoreCase)
11 :   [^0-9A-Z_a-z]
11 :   [-.0-9a-z] (IgnoreCase)
11 :   [^\t ] (IgnoreCase, CultureInvariant)
11 :   [$0-9A-Z_a-z]
11 :   [--/\s]
11 :   [-_\w] (IgnoreCase)
11 :   [13578]
11 :   [A-Za-z\d]
11 :   [%+-|\w\s]
10 :   [0-9A-z]
10 :   [-.0-9_a-z] (IgnoreCase)
10 :   [^"\s]
10 :   [^=\s]
10 :   ['+-.]
10 :   [!#-'*+-/-9=?A-Z^-~]
10 :   [\d\w]
10 :   [c-gil-oq-uwxz] (IgnoreCase)
10 :   [abd-jm-oq-tvwyz] (IgnoreCase)
10 :   [acdf-ik-orsu-z] (IgnoreCase)
10 :   [DEJKMOZdejkmoz\u212A]
10 :   [CEGHR-Uceghr-u]
10 :   [i-kmor] (IgnoreCase)
10 :   [abd-il-np-uwy] (IgnoreCase)
10 :   [KMNRTUkmnrtu\u212A]
10 :   [DEL-OQ-Tdel-oq-t]
10 :   [EMOPemop]
10 :   [eg-imnprwyz] (IgnoreCase)
10 :   [a-cikr-vy] (IgnoreCase)
10 :   [AC-EGHK-Zac-eghk-z\u212A]
10 :   [ace-gilopruz] (IgnoreCase)
10 :   [AE-HK-NR-TWYae-hk-nr-twy\u212A]
10 :   [EOSUWeosuw]
10 :   [a-eg-or-vx-z] (IgnoreCase)
10 :   [CDF-HJ-PRTVWZcdf-hj-prtvwz\u212A]
10 :   [AGKSYZagksyz\u212A]
10 :   [aceginu] (IgnoreCase)
10 :   [ETUetu]
10 :   [EFS-Uefs-u]
10 :   [AMRWamrw]
10 :   [^\n\r"*/:<>?\\|] (IgnoreCase)

@stephentoub
Copy link
Member Author

stephentoub commented Mar 24, 2022

(As an aside, the number of patterns that specify RegexOptions.InvariantCulture without also specifying RegexOptions.IgnoreCase or using (?i) is non-trivial, about 5% of our corpus. We should look at improving the docs here to highlight that doing so is a nop.)

@joperezr
Copy link
Member

As part of the IgnoreCase work I was planning to update the docs we have for the relevant RegexOptions (IgnoreCase, CultureInvariant, ECMAScript) to better explain how are they involved when matching and how they interact to each other so the behavior people get is more expected.

@joperezr
Copy link
Member

We should look at improving the docs here to highlight that doing so is a nop

I wonder if the confusion here is that people expect us to do linguistic comparisons when setting CultureInvariant, which is why they set that but not set IgnoreCase as well. Anyway, I do agree that better documentation is in order here so as said earlier, I'll improve this as part of the IgnoreCase work.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 28, 2022
@stephentoub stephentoub self-assigned this Mar 30, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 5, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants