Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: InvariantCultureIgnoreCase 2x to 6x times slower on Linux #29565

Closed
sebastienros opened this issue May 16, 2019 · 49 comments · Fixed by dotnet/coreclr#24889
Closed

Perf: InvariantCultureIgnoreCase 2x to 6x times slower on Linux #29565

sebastienros opened this issue May 16, 2019 · 49 comments · Fixed by dotnet/coreclr#24889
Assignees
Labels
area-System.Globalization os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Milestone

Comments

@sebastienros
Copy link
Member

While investigating why the Data Updates TechEmpower benchmark was much slower on Linux than on Windows, we identified the usage of InvariantCultureIgnoreCase in the Npgsql driver was responsible.

Seems worst on ASP.NET (6 times slower) than benchmarkdotnet (2 times slower), where there are no concurrent calls.

Here is the code that was used in BDN and ASP.NET to repro the numbers:

[Benchmark]
public object TryAddInvariantCultureIgnoreCase()
{
    var data = new Dictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
    for (var i = 0; i < N; i++)
    {
        data.TryAdd("Id_", i);
    }

    return data;
}

On ASP.NET, with 100 inserts, we have 80K RPS on Windows and 15K RPS on Linux.

The trace doesn't go further than this

image

Linux

BenchmarkDotNet=v0.11.5, OS=debian 9
Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host] : .NET Core ? (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26308), 64bit RyuJIT

Toolchain=InProcessEmitToolchain  IterationCount=3  LaunchCount=1
WarmupCount=3
Method N Mean Error StdDev
TryAddOrdinalIgnoreCase 10 275.4 ns 440.5 ns 24.15 ns
TryAddInvariantCultureIgnoreCase 10 3,442.1 ns 484.8 ns 26.57 ns
TryAddOrdinalIgnoreCase 50 1,238.5 ns 301.0 ns 16.50 ns
TryAddInvariantCultureIgnoreCase 50 16,930.4 ns 2,142.9 ns 117.46 ns
TryAddOrdinalIgnoreCase 100 2,401.3 ns 426.6 ns 23.38 ns
TryAddInvariantCultureIgnoreCase 100 33,979.7 ns 6,911.2 ns 378.83 ns

Windows

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.107 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host] : .NET Core 3.0.0-preview6-27713-13 (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26308), 64bit RyuJI

Toolchain=InProcessEmitToolchain  IterationCount=3  LaunchCount=1
WarmupCount=3
Method N Mean Error StdDev
TryAddOrdinalIgnoreCase 10 239.3 ns 2.354 ns 0.1290 ns
TryAddInvariantCultureIgnoreCase 10 1,694.5 ns 23.130 ns 1.2679 ns
TryAddOrdinalIgnoreCase 50 1,145.2 ns 7.520 ns 0.4122 ns
TryAddInvariantCultureIgnoreCase 50 8,332.2 ns 166.891 ns 9.1479 ns
TryAddOrdinalIgnoreCase 100 2,241.1 ns 379.392 ns 20.7958 ns
TryAddInvariantCultureIgnoreCase 100 16,191.5 ns 12.769 ns 0.6999 ns
@tarekgh
Copy link
Member

tarekgh commented May 16, 2019

Seems worst on ASP.NET (6 times slower) than benchmarkdotnet (2 times slower), where there are no concurrent calls.

This will be good to be investigated by the Dictionary owner first to look at if we have a locking in the collection side that may cause showing much slower in the ASP.NET scenario.

@sebastienros would be nice if you can attache the ASP.NET sample repro you used when seeing the 6x degradation.

@sebastienros
Copy link
Member Author

Here is the controller to add to a new project from the default API template:

    [Route("api/[controller]")]
    [ApiController]
    public class ValuesController : ControllerBase
    {
        [HttpGet]
        [Route("InvariantCultureIgnoreCase/{count}")]
        public ActionResult InvariantCultureIgnoreCase(int count)
        {
            var data = new Dictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
            for (var i = 0; i < count; i++)
            {
                data.TryAdd("Id_", i);
            }

            return Ok();
        }

        [HttpGet]
        [Route("OrdinalIgnoreCase/{count}")]
        public ActionResult OrdinalIgnoreCase(int count)
        {
            var data = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
            for (var i = 0; i < count; i++)
            {
                data.TryAdd("Id_", i);
            }

            return Ok();
        }
    }

@danmoseley
Copy link
Member

@sebastienros do we know whether this was a regression from past release? Also, is this critical for meeting 3.0 perf goals?
If not we should move to future milestone please.

@tarekgh
Copy link
Member

tarekgh commented May 30, 2019

I don't think this is a regression.

@sebastienros
Copy link
Member Author

This is not a regression, I already filed a related issue a year ago.

@mjsabby
Copy link
Contributor

mjsabby commented May 30, 2019

We should fix this right? 2x-6x for such a critical operation seems bad.

@tarekgh
Copy link
Member

tarekgh commented May 30, 2019

Yes we should look at it but I don't think this is urgent for 3.0 as it is not a regression. I am going to move it to the future milestone.

@terrajobst
Copy link
Contributor

Tagging @adamsitnik as Dan assigned him to take a look

@danmoseley
Copy link
Member

@mjsabby is right, it's significant enough we should investigate.

@adamsitnik
Copy link
Member

From a quick look at the profiles and the code, I can tell that we are using different APIs to get the sortKey when computing the hash when comparing strings (Windows offers us something nice here). I will try to find a way to make the Linux implementation faster.

@tarekgh
Copy link
Member

tarekgh commented May 31, 2019

@adamsitnik from @sebastienros investigation, I believe he mentioned not seeing that much difference when executing in a single threaded scenario while the regression is seen in the ASP.NET scenario with mutli-threaded. in other word, @sebastienros data show if measuring the difference between Linux and Windows only in console app when calling InvariantCultureIgnoreCase, not seeing that much regression.

@sebastienros feel free to correct me if I spelled anything wrong here.

I can tell that we are using different APIs to get the sortKey when computing the hash when comparing strings

This is expected as we depend on different components on different OS's.

@adamsitnik
Copy link
Member

This is expected as we depend on different components on different OS's.

I just want to make sure we use the best available method in the right way according to spec.

@tarekgh
Copy link
Member

tarekgh commented May 31, 2019

I just want to make sure we use the best available method in the right way according to spec.

Yes, that is fine. I would suggest to measure this scenario in isolation first before jumping in trying to optimize it. As I mentioned there could be another factor in the ASP.NET scenario (something like involving locking with the collections for instance) causing this difference.

@adamsitnik
Copy link
Member

I would suggest to measure this scenario in isolation first before jumping in trying to optimize it.

When I wrote "From a quick look at the profiles" I meant that I have already profiled that.

Anyway, thanks for the hint. The more people repeat "measure, measure, measure" the better!

@sebastienros
Copy link
Member Author

FYI this issue has probably the same root cause as this one: https://github.com/dotnet/corefx/issues/29462

@tarekgh
Copy link
Member

tarekgh commented May 31, 2019

FYI this issue has probably the same root cause as this one: #26054

I doubt that because the string comparisons are not using the sortkeys. but maybe similar issue.

@adamsitnik
Copy link
Member

I have used the recommendations provided in the ICU User Guide and made it two times faster dotnet/coreclr#24889

@billwert
Copy link
Member

billwert commented Jun 2, 2019

Great that we were able to optimize this! @sebastienros what was the overall impact of this on the techempower scenario?

@brianrob

@tarekgh
Copy link
Member

tarekgh commented Jun 2, 2019

Yes @billwert point is important as I am expecting there is some other places contributing to the perf differences.

@sebastienros
Copy link
Member Author

sebastienros commented Jun 3, 2019

@billwert For the DataUpdate scenario this was terrible when used with Npgsql and Dapper, which I assume is very common. We'll see how much of an issue this was for the other scenarios when the fix is in. And also how much is this change fixing the DataUpdates scenario. As an example, ADO.NET is 10 times faster than Dapper on this scenario. I would expect a difference of 10% instead.

@brianrob
Copy link
Member

brianrob commented Jun 3, 2019

@sebastienros do you have a trace that shows the overall cost of the slowdown on the scenario? I'd like for us to prioritize this bug based on the potential benefit we'd get from a fix.

@sebastienros
Copy link
Member Author

@brianrob I pasted a screenshot of the best details I could get from the trace. I can show you the command line I used, maybe you would know how to get a better one, or get better data out of it.

@tarekgh
Copy link
Member

tarekgh commented Jun 3, 2019

Just to tell, @jkotas mentioned important comment dotnet/coreclr#24889 (comment) which most likely contributing to the delay in the multi-threading scenario.

@danmoseley
Copy link
Member

danmoseley commented Jun 4, 2019

@adamsitnik maybe you still need this bug to track further investigation

@adamsitnik
Copy link
Member

adamsitnik commented Jun 4, 2019

maybe you still need this bug to track further investigation

Sure, I did not know that GH is going to close the issue from other repo OOTB when I use the "fixes" keyword.

I am still looking at the traces with lock. It looks like a different lock than the Jan pointed out is a problem (I have removed the pointed one and changed the implementation to ThreadStatic and it did not help).

@jkotas jkotas reopened this Jun 4, 2019
@wangzhaoguan
Copy link

我们遇到了同样的问题,dotnet core在Linux平台下的性能,比windows差好几倍,核心态CPU非常高。今天下载了3.0 Preview验证,发现此问题仍然存在。真实使用场景是向Oracle批量插入数据时,ODP内部频繁调用的String.Compare(s1, s2, true)触发了calling System.Globalization.Native.so!pthread_mutex_lock。lldb的堆栈日志显示如下


Thread 233
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libpthread.so.0!__lll_lock_wait + 0x1d
Child-SP RetAddr Caller, Callee
00007FDEC67F7600 00007ff89d019709 (MethodDesc 00007ff89ca66428 + 0xa9 DomainNeutralILStubClass.IL_STUB_PInvoke(SafeSortHandle, Char*, Int32, Char*, Int32, System.Globalization.CompareOptions))
00007FDEC67F7608 00007ff917d6cdcb libpthread.so.0!_L_lock_883 + 0xf, calling libpthread.so.0!__lll_lock_wait
00007FDEC67F7640 00007fe359f3ceb9 libicui18n.so.50!ucol_strcoll_50 + 0x479, calling libicui18n.so.50!___lldb_unnamed_symbol90$$libicui18n.so.50
00007FDEC67F7670 00007ff916538ec1 libcoreclr.so!MetaSig::SkipArg() + 0x31 [/root/coreclr/src/vm/siginfo.cpp:866], calling libcoreclr.so!SigParser::SkipExactlyOne() [/root/coreclr/src/utilcode/sigparser.cpp:16]
00007FDEC67F7690 00007ff89c30149c System.Globalization.Native.so!GetCollatorFromSortHandle(_sort_handle*, int, UErrorCode*) + 0x3c [/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/stl_tree.h:502], calling System.Globalization.Native.so!pthread_mutex_lock
00007FDEC67F76D0 00007ff89c3015db System.Globalization.Native.so!GlobalizationNative_CompareString + 0x3b [/root/coreclr/src/corefx/System.Globalization.Native/collation.cpp:437], calling System.Globalization.Native.so!GetCollatorFromSortHandle(_sort_handle*, int, UErrorCode*)
00007FDEC67F7710 00007ff89d019709 (MethodDesc 00007ff89ca66428 + 0xa9 DomainNeutralILStubClass.IL_STUB_PInvoke(SafeSortHandle, Char*, Int32, Char*, Int32, System.Globalization.CompareOptions))
00007FDEC67F7750 00007ff89d019709 (MethodDesc 00007ff89ca66428 + 0xa9 DomainNeutralILStubClass.IL_STUB_PInvoke(SafeSortHandle, Char*, Int32, Char*, Int32, System.Globalization.CompareOptions))
00007FDEC67F77B0 00007ff89cee3bb3 (MethodDesc 00007ff89ca79198 + 0x53 System.Globalization.CompareInfo.CompareString(System.ReadOnlySpan1<Char>, System.ReadOnlySpan1, System.Globalization.CompareOptions)), calling 00007ff89cba6dc0 (stub for Interop+Globalization.CompareString(SafeSortHandle, Char*, Int32, Char*, Int32, System.Globalization.CompareOptions))
00007FDEC67F77F0 00007ff89cee15e2 (MethodDesc 00007ff89ca78e50 + 0x152 System.Globalization.CompareInfo.Compare(System.String, System.String, System.Globalization.CompareOptions)), calling (MethodDesc 00007ff89ca79198 + 0 System.Globalization.CompareInfo.CompareString(System.ReadOnlySpan1<Char>, System.ReadOnlySpan1, System.Globalization.CompareOptions))
00007FDEC67F7850 00007ff89ce9bff6 (MethodDesc 00007ff89ca6f6b0 + 0xf6 System.String.Compare(System.String, System.String, System.StringComparison))
00007FDEC67F78A0 00007ff8aab77938 (MethodDesc 00007ff8a4c97e20 + 0x1d8 Oracle.ManagedDataAccess.Client.OracleParameterCollection.FindParamByName(System.String)), calling (MethodDesc 00007ff89ca6f6b0 + 0 System.String.Compare(System.String, System.String, System.StringComparison))
00007FDEC67F78B0 00007ff8aab97d7d (MethodDesc 00007ff89f0c1258 + 0x9d System.Collections.ArrayList.Contains(System.Object))
00007FDEC67F78F0 00007ff8ac448748 (MethodDesc 00007ff8a4ccba40 + 0x168 OracleInternal.ServiceObjects.OracleCommandImpl.GetBindByPositionBasedParameterCollection(Oracle.ManagedDataAccess.Client.OracleParameterCollection, System.Collections.ArrayList, Boolean)), calling (MethodDesc 00007ff8a4c97e20 + 0 Oracle.ManagedDataAccess.Client.OracleParameterCollection.FindParamByName(System.String))
00007FDEC67F7940 00007ff8ac452f59 (MethodDesc 00007ff8a4ccb8a8 + 0x4e9 OracleInternal.ServiceObjects.OracleCommandImpl.ExecuteNonQuery(System.String, Oracle.ManagedDataAccess.Client.OracleParameterCollection, System.Data.CommandType, OracleInternal.ServiceObjects.OracleConnectionImpl, Int32, Int64, OracleInternal.ServiceObjects.OracleDependencyImpl, Int64[] ByRef, Oracle.ManagedDataAccess.Client.OracleParameterCollection ByRef, Boolean ByRef, Oracle.ManagedDataAccess.Client.OracleException ByRef, Oracle.ManagedDataAccess.Client.OracleConnection, Oracle.ManagedDataAccess.Client.OracleLogicalTransaction ByRef, Boolean)), calling (MethodDesc 00007ff8a4ccba40 + 0 OracleInternal.ServiceObjects.OracleCommandImpl.GetBindByPositionBasedParameterCollection(Oracle.ManagedDataAccess.Client.OracleParameterCollection, System.Collections.ArrayList, Boolean))
00007FDEC67F79F0 00007ff89cec2f24 (MethodDesc 00007ff89ca74700 + 0x94 System.Number.TryFormatInt32(Int32, System.ReadOnlySpan1<Char>, System.IFormatProvider, System.Span1, Int32 ByRef)), calling (MethodDesc 00007ff89ca74800 + 0 System.Number.TryUInt32ToDecStr(UInt32, Int32, System.Span1<Char>, Int32 ByRef)) 00007FDEC67F7A00 00007ff916746fd0 libcoreclr.so!SVR::gc_heap::adjust_limit_clr(unsigned char*, unsigned long, alloc_context*, SVR::heap_segment*, int, int) + 0x3b0 [/root/coreclr/src/gc/gc.cpp:11474], calling libcoreclr.so!memset 00007FDEC67F7A70 00007ff916749127 libcoreclr.so!SVR::gc_heap::a_fit_segment_end_p(int, SVR::heap_segment*, unsigned long, alloc_context*, int, int*) + 0x427 [/root/coreclr/src/gc/gc.cpp:12171], calling libcoreclr.so!SVR::gc_heap::adjust_limit_clr(unsigned char*, unsigned long, alloc_context*, SVR::heap_segment*, int, int) [/root/coreclr/src/gc/gc.cpp:11364] 00007FDEC67F7B00 00007ff916749b16 libcoreclr.so!SVR::gc_heap::allocate_small(int, unsigned long, alloc_context*, int) + 0x4a6 [/root/coreclr/src/gc/gc.cpp:12371], calling libcoreclr.so!SVR::gc_heap::a_fit_segment_end_p(int, SVR::heap_segment*, unsigned long, alloc_context*, int, int*) [/root/coreclr/src/gc/gc.cpp:12085] 00007FDEC67F7B90 00007ff91674ad83 libcoreclr.so!SVR::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int) + 0x1f3 [/root/coreclr/src/gc/gc.cpp:13208], calling libcoreclr.so!SVR::gc_heap::allocate_small(int, unsigned long, alloc_context*, int) [/root/coreclr/src/gc/gc.cpp:12315] 00007FDEC67F7E60 00007ff8ac45509d (MethodDesc 00007ff8a4c90580 + 0x1fd Oracle.ManagedDataAccess.Client.OracleCommand.ExecuteNonQuery()), calling (MethodDesc 00007ff8a4ccb8a8 + 0 OracleInternal.ServiceObjects.OracleCommandImpl.ExecuteNonQuery(System.String, Oracle.ManagedDataAccess.Client.OracleParameterCollection, System.Data.CommandType, OracleInternal.ServiceObjects.OracleConnectionImpl, Int32, Int64, OracleInternal.ServiceObjects.OracleDependencyImpl, Int64[] ByRef, Oracle.ManagedDataAccess.Client.OracleParameterCollection ByRef, Boolean ByRef, Oracle.ManagedDataAccess.Client.OracleException ByRef, Oracle.ManagedDataAccess.Client.OracleConnection, Oracle.ManagedDataAccess.Client.OracleLogicalTransaction ByRef, Boolean)) 00007FDEC67F7F40 00007ff8abf70887 (MethodDesc 00007ff8a435d610 + 0x247 Ralms.EntityFrameworkCore.Oracle.Storage.Internal.OracleRelationalCommandBuilderFactory+OracleRelationalCommandBuilder+OracleRelationalCommand.Execute(Microsoft.EntityFrameworkCore.Storage.IRelationalConnection, Microsoft.EntityFrameworkCore.Diagnostics.DbCommandMethod, System.Collections.Generic.IReadOnlyDictionary2<System.String,System.Object>))
00007FDEC67F7F70 00007ff89cef2743 (MethodDesc 00007ff89ca7c0b0 + 0xf3 System.Collections.Generic.Dictionary2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].FindEntry(System.__Canon)) 00007FDEC67F8000 00007ff9165cc3f1 libcoreclr.so!ObjIsInstanceOfNoGC + 0xe1 [/root/coreclr/src/vm/methodtable.inl:1692], calling libcoreclr.so!MethodTable::CanCastToInterfaceNoGC(MethodTable*) [/root/coreclr/src/vm/methodtable.cpp:1739] 00007FDEC67F80A0 00007ff8ac4556a0 (MethodDesc 00007ff8a435d428 + 0x40 Microsoft.EntityFrameworkCore.Storage.Internal.RelationalCommand.ExecuteNonQuery(Microsoft.EntityFrameworkCore.Storage.IRelationalConnection, System.Collections.Generic.IReadOnlyDictionary2<System.String,System.Object>))
00007FDEC67F80C0 00007ff8ac51413f (MethodDesc 00007ff8a47f5ea8 + 0x13f Microsoft.EntityFrameworkCore.RelationalDatabaseFacadeExtensions.ExecuteSqlCommand(Microsoft.EntityFrameworkCore.Infrastructure.DatabaseFacade, Microsoft.EntityFrameworkCore.RawSqlString, System.Collections.Generic.IEnumerable`1<System.Object>))

@tarekgh
Copy link
Member

tarekgh commented Jun 6, 2019

@wangzhaoguan thanks for sending the stack. I posted on the PR a couple of questions regarding your targeting version. does you app can target 3.0?

@adamsitnik
Copy link
Member

The lock has been removed in dotnet/coreclr#24973 and the issue is now solved for good.

Tomorrow I am going to fill the servicing request to merge my backport to 2.1 dotnet/coreclr#25042

@tarekgh tarekgh closed this as completed Jun 9, 2019
@tarekgh tarekgh reopened this Jun 9, 2019
@tarekgh
Copy link
Member

tarekgh commented Jun 9, 2019

sorry closed this by mistake. @adamsitnik can close it though.

@adamsitnik
Copy link
Member

sorry closed this by mistake

np, I wanted to close it once I port the fix to 2.1, but since we are not going to do that I can close it right now.

Btw I've also benchmarks to track the performance of string.GetHashCode and string.Compare for different comparisons and sizes in dotnet/performance#535

@sebastienros
Copy link
Member Author

I don't think the issue has been fixed, running on ASP.NET I still get 271,737 RPS for Windows and 39,740 RPS for Linux. I checked as the benchmark that triggered the issue (data updates on Npgsql) didn't get any significant improvement.

@sebastienros
Copy link
Member Author

FYI the command to run the benchmarks (if you know how to use our service).

dotnet run -- @C:\temp\.environments\aspnet-lin `
--repository https://github.com/sebastienros/invariantcultureperf --project-file InvariantCulture.csproj --path /api/values/InvariantCultureIgnoreCase/100 --warmup 1 --duration 5 --runtime 3.0.0-*

@sebastienros
Copy link
Member Author

And some event counters:

Windows Linux
ThreadPool Threads Count 48 76
Lock Contention (#/s) 77 1,281
ThreadPool Queue Length 222 106
ThreadPool Items (#/s) 89,701 581,628

@tarekgh
Copy link
Member

tarekgh commented Jun 11, 2019

@sebastienros just double checking, are you sure you were running with @adamsitnik fix?

@sebastienros
Copy link
Member Author

Trace here https://microsoft-my.sharepoint.com/:u:/p/sebros/EbPBKcbM8ZtOg8oY1wF5GT4B4uOXzCM3-BJ5AwvdLFBYfw?e=n8rtb1

Yes, I verified the build I am using has it (currently the latest 3.0.0-preview7-27811-09)

@sebastienros
Copy link
Member Author

I can see two jumps in perf due to the two commits you both made, but nothing close to what would be comparable:

image

@adamsitnik
Copy link
Member

I still get 271,737 RPS for Windows and 39,740 RPS for Linux

This is weird, I get sth around 90k RPS for Windows and 60k for Linux

dotnet run -- --server http://asp-perf-win:5001 --client http://asp-perf-load:5002 --repository https://github.com/sebastienros/invariantcultureperf --project-file InvariantCulture.csproj --path /api/values/InvariantCultureIgnoreCase/100 --warmup 1 --duration 5 --runtime 3.0.0-*
RequestsPerSecond:           92 141
Max CPU (%):                 90
WorkingSet (MB):             162
Avg. Latency (ms):           3,82
Startup (ms):                346
First Request (ms):          105,74
Latency (ms):                0,65
Total Requests:              469 856
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27811-14
ASP.NET Core:                3.0.0-preview7.19311.5
dotnet run -- --server http://asp-perf-lin:5001 --client http://asp-perf-load:5002 --repository https://github.com/sebastienros/invariantcultureperf --project-file InvariantCulture.csproj --path /api/values/InvariantCultureIgnoreCase/100 --warmup 1 --duration 5 --runtime 3.0.0-*
RequestsPerSecond:           60 645
Max CPU (%):                 98
WorkingSet (MB):             176
Avg. Latency (ms):           4,22
Startup (ms):                267
First Request (ms):          113,33
Latency (ms):                0,66
Total Requests:              308 079
Duration: (ms)               5 080
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27811-14
ASP.NET Core:                3.0.0-preview7.19311.5

@adamsitnik
Copy link
Member

I have run the oldest preview7 that was available at dotnet-core feed (BTW why there are not preview6 packages there?) to make sure it did not contain dotnet/coreclr#24973 :

--runtimeVersion 3.0.0-preview7-27807-12 --collect-trace
[09:34:32.664] Downloading trace: trace.06-12-09-34-32.RPS-44K.trace.zip
RequestsPerSecond:           43 902
Max CPU (%):                 96
WorkingSet (MB):             178
Avg. Latency (ms):           5,73
Startup (ms):                311
First Request (ms):          133,19
Latency (ms):                0,85
Total Requests:              223 424
Duration: (ms)               5 090
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27807-12
ASP.NET Core:                3.0.0-preview7.19311.5

The locks were all over the place:

obraz

With the fix:

--runtime 3.0.0-* --collect-trace
[08:59:29.204] Downloading trace: trace.06-12-08-59-29.RPS-60K.trace.zip
RequestsPerSecond:           59 524
Max CPU (%):                 99
WorkingSet (MB):             176
Avg. Latency (ms):           4,1
Startup (ms):                304
First Request (ms):          132,98
Latency (ms):                0,77
Total Requests:              301 716
Duration: (ms)               5 070
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27811-13
ASP.NET Core:                3.0.0-preview7.19311.5

The locks are gone

obraz

@adamsitnik
Copy link
Member

@sebastienros could you please take a look at the commands I run? The results we get for Windows are very different: 271k vs 90k RPS.

In the meantime, I am going to take a look at the two hottest method from the stack trace:

System.Private.CoreLib!System.Runtime.InteropServices.SafeHandle::InternalRelease(bool)[OptimizedTier1] 
System.Private.CoreLib!System.Runtime.InteropServices.SafeHandle::DangerousAddRef(bool&)[OptimizedTier1]

@adamsitnik adamsitnik reopened this Jun 12, 2019
@adamsitnik
Copy link
Member

Ok, it looks like CriticalHandle is what we need.

@adamsitnik
Copy link
Member

Ok, I was able to get 110k RPS using my local build of System.Private.CoreLib.dll passed via --outputFile.

RequestsPerSecond:           110 245
Max CPU (%):                 94
WorkingSet (MB):             177
Avg. Latency (ms):           2,65
Startup (ms):                270
First Request (ms):          110,62
Latency (ms):                0,91
Total Requests:              562 104
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27812-01
ASP.NET Core:                3.0.0-preview7.19311.5

obraz

I have created a PR in CoreCLR: dotnet/coreclr#25117 I hope it's not a very bad idea.

@benaadams
Copy link
Member

And some event counters:

  Windows Linux
ThreadPool Items (#/s) 89,701 581,628

That's an interesting difference...

@stephentoub
Copy link
Member

That's an interesting difference

Presumably coming from asynchronously completing socket operations that need to queue from the epoll thread back to the thread pool to run user code.

@sebastienros
Copy link
Member Author

Ok, I was able to get 110k RPS using my local build of System.Private.CoreLib.dll passed via --outputFile.

This is exactly how a fix should be validated, testing your local changes on the scenario before sending the PR, that's awesome!

could you please take a look at the commands I run

Yeah they are perfect, I was just using the TechEmpower machines which are more beefy, and were probably more prone to locking as it handles more load. The fact you got faster on Linux than Windows means it's really fixed. You can also try --collect-counters to verify that the locks/s are much lower.

@sebastienros
Copy link
Member Author

FYI for everyone, I added the support for Event Counters on the benchmarking service (--collect-counters) and it's tracked with every ASP.NET scenario now. We'll get continuous charts for these numbers too now.

Sneak peek:

CPU Usage (%):               99
Working Set (MB):            526
GC Heap Size (MB):           281
Gen 1 GC (#/s):              1
Gen 2 GC (#/s):              0
Time in GC (%):              0
Gen 0 Size (B):              135,464
Gen 1 Size (B):              3,033,544
Gen 2 Size (B):              2,872,120
LOH Size (B):                1,278,288
Allocation Rate (B/sec):     227,329,432
# of Assemblies Loaded:      110
Exceptions (#/s):            985
ThreadPool Threads Count:    48
Lock Contention (#/s):       77
ThreadPool Queue Length:     222
ThreadPool Items (#/s):      89,701

@adamsitnik
Copy link
Member

Ok, I was able to use the same machine as @sebastienros and confirm that the problem is gone.

Changes

dotnet/coreclr#24889 - I have reduced the number of calls to an expensive native API by half, which made StringComparison.InvariantCulture.GetHashCode() two times faster for strings shorter than 262 144 characters (the biggest buffer rentable from ArrayPool.Shared without allocation / 4)

dotnet/coreclr#24973 - I have removed the lock that was part of native implementation of CompareString, IndexOf, LastIndexOf, StartsWith, EndsWith, GetHashCode

dotnet/coreclr#25117 - I have removed the expensive ref counting of SafeSortHandle usages and replaced SafeSortHandle with a cached IntPtr (the number of cultures and sort options is always low so it's ok), the affected methods are CompareString, IndexOf, LastIndexOf, StartsWith, EndsWith, GetHashCode

25117

The most recent change had the biggest impact on the beefy Citrine Physical Machines (14 Core(s), 28 Logical Processor(s))

before

RequestsPerSecond:           40 811
Max CPU (%):                 100
WorkingSet (MB):             496
Avg. Latency (ms):           6,63
Startup (ms):                298
First Request (ms):          102,54
Latency (ms):                0,86
Total Requests:              208 136
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012412
Runtime:                     3.0.0-preview7-27815-04
ASP.NET Core:                3.0.0-preview7.19315.2

after

RequestsPerSecond:           240 711
Max CPU (%):                 94
WorkingSet (MB):             523
Avg. Latency (ms):           1,77
Startup (ms):                338
First Request (ms):          121,31
Latency (ms):                1,04
Total Requests:              1 227 445
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0

Linux vs Windows after all changes

Linux vs Windows after all fixes for the ASP.NET perf lab machines

Local Physical servers (E5-1650 v3 @ 3.50 GHz w/ 15MB Cache 6 Cores / 12 Threads 32GB of RAM.)

110k RPS for Linux and 90k for Windows

Citrine Physical Machines (Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz, 2195 Mhz, 14 Core(s), 28 Logical Processor(s) 32GB of RAM.)

240k RPS for Linux and 265k RPS for Windows

Azure VMs (D3_v2 4 virtual cores. 4GB of RAM)

28k RPS for both Linux and Windows

Summary

For the Azure VMs scenarios both OSes are even, for Local Physical servers Linux is now 20% faster, for Citrine Physical Machines Windows is still 10% faster.

I currently run out of ideas for how to close the 10% gap for the Citrine Physical Machines machines, the trace does not leave much space for optimizations.

obraz

@sebastienros please close the issue after you verify that the problem is gone

big thanks to @jkotas and @tarekgh for all the help!

@tarekgh
Copy link
Member

tarekgh commented Jun 19, 2019

Ok, I was able to use the same machine as @sebastienros Sebastien Ros FTE and confirm that the problem is gone.

@adamsitnik can we close this issue now?

@adamsitnik
Copy link
Member

Yes, we can.

@sebastienros I verified that the fix works with the latest coreclr packages that were pushed to core-setup in the last 24h.

@sebastienros
Copy link
Member Author

Before and after the changes from @adamsitnik

image

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 3.0 milestone Feb 1, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Globalization os-linux Linux OS (any supported distro) tenet-performance Performance related issue
Projects
None yet