Use IndexOf(..., span) in Regex FindFirstChar #60888

stephentoub · 2021-10-26T21:23:51Z

We currently use Boyer-Moore to find a multi-character prefix in FindFirstChar. That's good, and when it was originally added it was very likely better than IndexOf(..., string), but since then the latter has been vectorized and its throughput has improved significantly (we also plan to improve it further). While there are situations where Boyer-Moore does better, for general text searching and typical cases, IndexOf wins. So, this commit adds the ability for FindFirstChar to use IndexOf to search for a span, and prefers that over Boyer-Moore when it's applicable, namely when we're case-sensitive and left-to-right processing.

ghost · 2021-10-26T21:23:55Z

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently use Boyer-Moore to find a multi-character prefix in FindFirstChar. That's good, and when it was originally added it was very likely better than IndexOf(..., string), but since then the latter has been vectorized and its throughput has improved significantly (we also plan to improve it further). While there are situations where Boyer-Moore does better, for general text searching and typical cases, IndexOf wins. So, this commit adds the ability for FindFirstChar to use IndexOf to search for a span, and prefers that over Boyer-Moore when it's applicable, namely when we're case-sensitive and left-to-right processing.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	7.0.0

We currently use Boyer-Moore to find a multi-character prefix in FindFirstChar. That's good, and when it was originally added it was very likely better than IndexOf(..., string), but since then the latter has been vectorized and its throughput has improved significantly (we also plan to improve it further). While there are situations where Boyer-Moore does better, for general text searching and typical cases, IndexOf wins. So, this commit adds the ability for FindFirstChar to use IndexOf to search for a span, and prefers that over Boyer-Moore when it's applicable, namely when we're case-sensitive and left-to-right processing.

tannergooding · 2021-10-27T15:19:13Z

Any perf numbers?

stephentoub · 2021-10-27T15:59:47Z

Any perf numbers?

With this:

[Params(RegexOptions.None, RegexOptions.Compiled)]
public RegexOptions Options { get; set; }

[GlobalSetup]
public void Setup() => _regexes = File.ReadAllLines(@"words.txt").Select(word => new Regex(word, Options)).ToArray();

private readonly string _haystack = File.ReadAllText(@"i386.txt");
private Regex[] _regexes;

[Benchmark]
public int SearchAll()
{
    int sum = 0;
    foreach (Regex r in _regexes)
    {
        sum += Count(r, _haystack);
    }
    return sum;
}

private static int Count(Regex r, string input)
{
    int count = 0;
    Match m = r.Match(input);
    while (m.Success)
    {
        count++;
        m = m.NextMatch();
    }
    return count;
}

where those two words.txt and i386.txt files are the ones from https://github.com/cloudflare/sliceslice-rs/tree/a27b76c8527d44d5b3534c84b878d8289eacb7ff/data, I get these numbers:

Method	Toolchain	Options	Mean	Error	StdDev	Ratio
SearchAll	main\corerun.exe	None	3.211 s	0.0237 s	0.0221 s	1.00
SearchAll	pr\corerun.exe	None	1.934 s	0.0105 s	0.0098 s	0.60

SearchAll	main\corerun.exe	Compiled	2.602 s	0.0194 s	0.0182 s	1.00
SearchAll	pr\corerun.exe	Compiled	1.963 s	0.0282 s	0.0264 s	0.75

Now obviously there are cases where this would do worse, in particular cases where the IndexOf overhead doesn't pay off (like a string of all the same character), but on the whole this seems a better approach. The resulting code is simpler and any improvements we make to IndexOf will implicitly benefit here. Plus we can subsequently change how this is all structured to avoid even creating the Boyer-Moore tables if they're not going to be used, and decrease the ctor costs a bit.

stephentoub added the area-System.Text.RegularExpressions label Oct 26, 2021

stephentoub added this to the 7.0.0 milestone Oct 26, 2021

stephentoub added 2 commits October 26, 2021 17:45

Centralize check for whether IndexOf is supported

68e92a7

stephentoub force-pushed the ffcindexof branch from b812054 to 68e92a7 Compare October 26, 2021 22:04

runfoapp bot mentioned this pull request Oct 26, 2021

System.Net.Sockets.Tests.SendPacketsAsync.SendPacketsElement_FileZeroCount_Success sometimes fails #60017

Closed

tannergooding approved these changes Oct 27, 2021

View reviewed changes

stephentoub merged commit 034d5e2 into dotnet:main Oct 27, 2021

stephentoub deleted the ffcindexof branch October 27, 2021 16:44

ghost locked as resolved and limited conversation to collaborators Nov 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use IndexOf(..., span) in Regex FindFirstChar #60888

Use IndexOf(..., span) in Regex FindFirstChar #60888

stephentoub commented Oct 26, 2021

ghost commented Oct 26, 2021

tannergooding commented Oct 27, 2021

stephentoub commented Oct 27, 2021

Use IndexOf(..., span) in Regex FindFirstChar #60888

Use IndexOf(..., span) in Regex FindFirstChar #60888

Conversation

stephentoub commented Oct 26, 2021

ghost commented Oct 26, 2021

tannergooding commented Oct 27, 2021

stephentoub commented Oct 27, 2021