Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance issue in ReusableStringBuilder.cs with large string and many appends #1593

Closed
kfritz-msft opened this issue Jan 20, 2017 · 10 comments

Comments

@kfritz-msft
Copy link

I’ve hit a bottleneck related to the stringbuilder usage in msbuild.

The problem is with this code (starting at line 86):

// Backwards because the end of the string is (by observation of Australian Government build) more likely to be different earlier in the loop.
// For example, C:\project1, C:\project2
for (int i = borrowedBuilder.Length - 1; i >= 0; --i)
{
    if (borrowedBuilder[i] != other[i])
    {
        return false;
    }
}

In my scenario, the borrowedBuilder was created by reading a file with about 300K lines (so at least that many Append() calls), and total file size is about 45MB (file is admittedly larger than it needs to be, and will be addressed independently).

I had built a quick demo app that simulates this, and the problem is with indexing into the string builder to get the character. It takes forever to run this loop (maybe an hour?). In my demo app, if I were to create a temporary string from the string builder, the comparison loop is really fast (maybe a second?). I didn’t do actual measurements since it is very obviously different.

I also just noticed that if a StringBuilder is initialized with the file size as capacity, the problem also goes away. I didn't trace throughout all of the code, so maybe that isn't trivial to get? In my hacky demo app, I just did this to mitigate it:

FileInfo fi = new FileInfo(filepath);

StringBuilder sb = new StringBuilder((int)fi.Length);

But even if that isn't possible, calling ToString() on the string builder and using that for comparison is a huge improvement when the size is large.

@danmoseley
Copy link
Member

danmoseley commented Mar 2, 2017

@kfritz-msft was it really going through this loop only once? I did a trivial test below and it takes 0.3sec.

Which isn't to say it can't be improved.

namespace ConsoleApp1
{
    class Program
    {
        static int Main(string[] args)
        {
            var str1 = new string('x', 45000000); 
            var str2 = new string('x', 45000000); 
            var sb1 = new StringBuilder(str1);
            var sb2 = new StringBuilder(str2);

            var sw = new Stopwatch();
            sw.Start();            

            for (int i = sb1.Length - 1; i >= 0; --i)
            {
                if (sb1[i] != sb2[i])
                {
                    Console.WriteLine("not equal");
                    return 1;
                }
            }

            sw.Stop();

            Console.WriteLine("equal");

            Console.WriteLine($"Elapsed {sw.Elapsed}");

            return 0;
        }
    }
}

@kfritz-msft
Copy link
Author

Thanks for taking a look. Yes, that version is fast as the StringBuilder is initialized with the right size and there are no appends. I should have included a full repro app, sorry. The problem only happens when indexing into a StringBuilder that was filled up with a lot of appends. If you pre-init the size, it is fine. I think the StringBuilder was created with default size, and the file I had was really big, and each line got appended into the StringBuilder.

I didn't debug actual MSBuild code, but this should repro the behavior I saw when running MSBuild (it will appear to hang when the last loop is executing):

using System;
using System.Text;

namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
StringBuilder sb = new StringBuilder();

        // simulate filling it up with appends from a file that has lots of files listed in it
        for(int i = 0; i < 300000; i++)
        {
            sb.Append(new string('x', 50));
            sb.Append(";");
        }

        string a = sb.ToString();

        if(a == sb.ToString())
        {
            Console.WriteLine("call ToString() first, and fast");
        }

        // repro loop
        for (int i = a.Length - 1; i >= 0; i--)
        {
            if (a[i] != sb[i])
            {
                break;
            }
        }
        Console.WriteLine("done, and really slow");
    }
}

}

@danmoseley
Copy link
Member

I realized I never opened an issue for this. I now opened one. Any fix would not be in Desktop for some time, so it would probably be reasonable for MSBuild to work around this (in both places here and here) with a simple check of the length eg., over 500 or 1000 then do ToString().

Perhaps you'd like to offer such a PR @kfritz-msft ? I am not on the MSBuild team but I expect they'd take it if you included some simple measurements.

@danmoseley
Copy link
Member

danmoseley commented Dec 12, 2017

@Alois-xx I would imagine the change would be something like

            if (length > 40_000)
            {
                return String.Equals(_borrowedBuilder.ToString(), other, StringComparison.Ordinal);
            }

in the two places linked, plus explanatory comment. Note: one of those puts the length in a local, both should probably do that, as it improves the JITted code.

Plus a test that puts >40K into a string to hit both places.

And MSBuild folks may want to ask for some perf measurement because of the extra branch in this "hot" path.

@davkean maybe you could assign to @Alois-xx if you agree with this change

@Alois-xx
Copy link
Contributor

Alois-xx commented Jan 1, 2018

@danmosemsft ,@davkean: I have create a PR for that 1593_ReusableStringBuilderPerf.

Local profilling with a sample msbuild target that appends all dlls from Windows\System32*.dll up to 5 times and then compares it against a string shows an improvement of a factor two.

grafik

Other issues talking about an exponential perf hit are not due to the StringBuilder issue but could be due to msbuild batching where a target is call n^x times which could indeed lead to dramatic slowdowns. The Stringbuilder issue a "simple" O(n) problem which can be improved by a factor two, but if the target is called many times one should rework the msbuild script to create smaller lists.
The PR commit message also contains also the msbuild target file which I did use to test the performance before/after the change.

@davkean
Copy link
Member

davkean commented Jan 1, 2018

Nice, don't see the PR?

@Alois-xx
Copy link
Contributor

Alois-xx commented Jan 1, 2018

#2834

AndyGerlicher pushed a commit that referenced this issue Jan 4, 2018
…#2834)

Fix for issue #1593 where MSBuild spends a long time checking if a string is equal char by char of a large StringBuilder instance.
* Profiling showed that StringBuilder[xx] character access performance is much slower compared to create the string and compare the two string directly.
* A second optimization is to cache the created string until the StringBuilder is modified again which gives another 15%.

This issue pops up when large directories with many files are stored in MSBuild lists which are then compared for equality which will internally cause MSBuild to stringify and compare the large lists.
@danmoseley
Copy link
Member

Thanks @Alois-xx ! When MSBuild can depend on .NET Core 2.1, your code can be replaced with the new Equals overload and become more efficient.

@danmoseley
Copy link
Member

And thanks @kfritz-msft for flagging this so we could fix it.

@danmoseley
Copy link
Member

dotnet/coreclr#15759

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants