-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine if there are further sets to special-case in RegexCompiler / source generator #67056
Comments
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsIn RegexCompiler and the RegexGenerator source generator, we try to make matching as efficient as possible, and that includes trying to generate the most efficient matching of character classes we can. We special-case various kinds of character classes, e.g. those that contain a single range (e.g. Looking at our corpus of regexes, here are the most popular sets we don't currently special-case (the leading number represents how many times they occur):
Quickly skimming the list:
cc: @joperezr, @GrabYourPitchforks
|
Looking at the options for these in your corpus, some of them effectively collapse together. eg I see that a bunch of uses of
Some of these correspond to Unicode blocks eg |
Only if ECMAScript is set, which is very rare. Otherwise by default \w is much more than ASCII.
As with \w, the Unicode categories contain, well, Unicode :-) Nd is much more than 0-9. But we already special-case a single Unicode category, anyway. |
This will only be true with IgnoreCase | CultureInvariant. With other cultures, case folding around i, k, and S make it so that's not the mapping (e.g. k not only maps to upper-case K but also to the Kelvin symbol). It's also many fewer than one might think. Turns out these patterns are often case-sensitive. See below...
Earlier in .NET 7 I already put in place a limited version of the folding work, limited to just ASCII, to just small ranges (but large enough to handle ASCII letters and digits), etc. That's already enough to get many of these cases, such that the results you're seeing above already factor that in (since it's all done at parsing time). It's just hindered by the special-casing (e.g. for k, which is of course in the a-z range) previously cited. I regenerated the above table to include whether the set is IgnoreCase / InvariantCulture:
|
(As an aside, the number of patterns that specify RegexOptions.InvariantCulture without also specifying RegexOptions.IgnoreCase or using |
As part of the IgnoreCase work I was planning to update the docs we have for the relevant RegexOptions (IgnoreCase, CultureInvariant, ECMAScript) to better explain how are they involved when matching and how they interact to each other so the behavior people get is more expected. |
I wonder if the confusion here is that people expect us to do linguistic comparisons when setting CultureInvariant, which is why they set that but not set IgnoreCase as well. Anyway, I do agree that better documentation is in order here so as said earlier, I'll improve this as part of the IgnoreCase work. |
In RegexCompiler and the RegexGenerator source generator, we try to make matching as efficient as possible, and that includes trying to generate the most efficient matching of character classes we can. We special-case various kinds of character classes, e.g. those that contain a single range (e.g.
[a-z]
), those that contain two characters that are just cased versions of each other (e.g.[Aa]
), those that contain just two or three characters (e.g.['"]
), those that represent known built-in sets (e.g.\w
,\d
,\s
), those that represent a single Unicode category, etc. For everything else, we fall back to a general scheme where we emit a bitmap in which to look up the character. The fallback is generally fast, but the customized approaches are typically faster. For .NET 7, we should spend a little more time determining whether there are any additional category of sets it'd be worth special-casing.Looking at our corpus of regexes, here are the most popular sets we don't currently special-case (the leading number represents how many times they occur):
Quickly skimming the list:
[\s\S]
is just a strange way of writing a set that matches everything. We already special-case the "everything" set when it's in its canonical form. Should we convert such sets to be canonical?[AOao]
to be two case-insensitive chars and emit something like((c |= 0x20) == a) | (c == 'o')
(and would that be faster than the bitmap)?if (c < 128 && bitmap[c])
, but we can tell from the set whether we should actually have a lower bound than 128... if we determine the largest possible character value is less than 128, we can both narrow for how many characters we'll hit the bitmap and also decrease the size of the bitmap. For non-ASCII, should we special-case any Unicode ranges based on the pattern to generate a lookup table there?cc: @joperezr, @GrabYourPitchforks
The text was updated successfully, but these errors were encountered: