-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use IndexOf(..., span) in Regex FindFirstChar #60888
Conversation
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsWe currently use Boyer-Moore to find a multi-character prefix in FindFirstChar. That's good, and when it was originally added it was very likely better than IndexOf(..., string), but since then the latter has been vectorized and its throughput has improved significantly (we also plan to improve it further). While there are situations where Boyer-Moore does better, for general text searching and typical cases, IndexOf wins. So, this commit adds the ability for FindFirstChar to use IndexOf to search for a span, and prefers that over Boyer-Moore when it's applicable, namely when we're case-sensitive and left-to-right processing.
|
We currently use Boyer-Moore to find a multi-character prefix in FindFirstChar. That's good, and when it was originally added it was very likely better than IndexOf(..., string), but since then the latter has been vectorized and its throughput has improved significantly (we also plan to improve it further). While there are situations where Boyer-Moore does better, for general text searching and typical cases, IndexOf wins. So, this commit adds the ability for FindFirstChar to use IndexOf to search for a span, and prefers that over Boyer-Moore when it's applicable, namely when we're case-sensitive and left-to-right processing.
b812054
to
68e92a7
Compare
Any perf numbers? |
With this: [Params(RegexOptions.None, RegexOptions.Compiled)]
public RegexOptions Options { get; set; }
[GlobalSetup]
public void Setup() => _regexes = File.ReadAllLines(@"words.txt").Select(word => new Regex(word, Options)).ToArray();
private readonly string _haystack = File.ReadAllText(@"i386.txt");
private Regex[] _regexes;
[Benchmark]
public int SearchAll()
{
int sum = 0;
foreach (Regex r in _regexes)
{
sum += Count(r, _haystack);
}
return sum;
}
private static int Count(Regex r, string input)
{
int count = 0;
Match m = r.Match(input);
while (m.Success)
{
count++;
m = m.NextMatch();
}
return count;
} where those two words.txt and i386.txt files are the ones from https://github.com/cloudflare/sliceslice-rs/tree/a27b76c8527d44d5b3534c84b878d8289eacb7ff/data, I get these numbers:
Now obviously there are cases where this would do worse, in particular cases where the IndexOf overhead doesn't pay off (like a string of all the same character), but on the whole this seems a better approach. The resulting code is simpler and any improvements we make to IndexOf will implicitly benefit here. Plus we can subsequently change how this is all structured to avoid even creating the Boyer-Moore tables if they're not going to be used, and decrease the ctor costs a bit. |
We currently use Boyer-Moore to find a multi-character prefix in FindFirstChar. That's good, and when it was originally added it was very likely better than IndexOf(..., string), but since then the latter has been vectorized and its throughput has improved significantly (we also plan to improve it further). While there are situations where Boyer-Moore does better, for general text searching and typical cases, IndexOf wins. So, this commit adds the ability for FindFirstChar to use IndexOf to search for a span, and prefers that over Boyer-Moore when it's applicable, namely when we're case-sensitive and left-to-right processing.