Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use IndexOf(..., span) in Regex FindFirstChar #60888

Merged
merged 2 commits into from
Oct 27, 2021

Conversation

stephentoub
Copy link
Member

We currently use Boyer-Moore to find a multi-character prefix in FindFirstChar. That's good, and when it was originally added it was very likely better than IndexOf(..., string), but since then the latter has been vectorized and its throughput has improved significantly (we also plan to improve it further). While there are situations where Boyer-Moore does better, for general text searching and typical cases, IndexOf wins. So, this commit adds the ability for FindFirstChar to use IndexOf to search for a span, and prefers that over Boyer-Moore when it's applicable, namely when we're case-sensitive and left-to-right processing.

@ghost
Copy link

ghost commented Oct 26, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently use Boyer-Moore to find a multi-character prefix in FindFirstChar. That's good, and when it was originally added it was very likely better than IndexOf(..., string), but since then the latter has been vectorized and its throughput has improved significantly (we also plan to improve it further). While there are situations where Boyer-Moore does better, for general text searching and typical cases, IndexOf wins. So, this commit adds the ability for FindFirstChar to use IndexOf to search for a span, and prefers that over Boyer-Moore when it's applicable, namely when we're case-sensitive and left-to-right processing.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

We currently use Boyer-Moore to find a multi-character prefix in FindFirstChar.  That's good, and when it was originally added it was very likely better than IndexOf(..., string), but since then the latter has been vectorized and its throughput has improved significantly (we also plan to improve it further).  While there are situations where Boyer-Moore does better, for general text searching and typical cases, IndexOf wins.  So, this commit adds the ability for FindFirstChar to use IndexOf to search for a span, and prefers that over Boyer-Moore when it's applicable, namely when we're case-sensitive and left-to-right processing.
@tannergooding
Copy link
Member

Any perf numbers?

@stephentoub
Copy link
Member Author

Any perf numbers?

With this:

[Params(RegexOptions.None, RegexOptions.Compiled)]
public RegexOptions Options { get; set; }

[GlobalSetup]
public void Setup() => _regexes = File.ReadAllLines(@"words.txt").Select(word => new Regex(word, Options)).ToArray();

private readonly string _haystack = File.ReadAllText(@"i386.txt");
private Regex[] _regexes;

[Benchmark]
public int SearchAll()
{
    int sum = 0;
    foreach (Regex r in _regexes)
    {
        sum += Count(r, _haystack);
    }
    return sum;
}

private static int Count(Regex r, string input)
{
    int count = 0;
    Match m = r.Match(input);
    while (m.Success)
    {
        count++;
        m = m.NextMatch();
    }
    return count;
}

where those two words.txt and i386.txt files are the ones from https://github.com/cloudflare/sliceslice-rs/tree/a27b76c8527d44d5b3534c84b878d8289eacb7ff/data, I get these numbers:

Method Toolchain Options Mean Error StdDev Ratio
SearchAll main\corerun.exe None 3.211 s 0.0237 s 0.0221 s 1.00
SearchAll pr\corerun.exe None 1.934 s 0.0105 s 0.0098 s 0.60
SearchAll main\corerun.exe Compiled 2.602 s 0.0194 s 0.0182 s 1.00
SearchAll pr\corerun.exe Compiled 1.963 s 0.0282 s 0.0264 s 0.75

Now obviously there are cases where this would do worse, in particular cases where the IndexOf overhead doesn't pay off (like a string of all the same character), but on the whole this seems a better approach. The resulting code is simpler and any improvements we make to IndexOf will implicitly benefit here. Plus we can subsequently change how this is all structured to avoid even creating the Boyer-Moore tables if they're not going to be used, and decrease the ctor costs a bit.

@stephentoub stephentoub merged commit 034d5e2 into dotnet:main Oct 27, 2021
@stephentoub stephentoub deleted the ffcindexof branch October 27, 2021 16:44
@ghost ghost locked as resolved and limited conversation to collaborators Nov 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants