Perf: InvariantCultureIgnoreCase 2x to 6x times slower on Linux #29565

sebastienros · 2019-05-16T01:00:25Z

While investigating why the Data Updates TechEmpower benchmark was much slower on Linux than on Windows, we identified the usage of InvariantCultureIgnoreCase in the Npgsql driver was responsible.

Seems worst on ASP.NET (6 times slower) than benchmarkdotnet (2 times slower), where there are no concurrent calls.

Here is the code that was used in BDN and ASP.NET to repro the numbers:

[Benchmark]
public object TryAddInvariantCultureIgnoreCase()
{
    var data = new Dictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
    for (var i = 0; i < N; i++)
    {
        data.TryAdd("Id_", i);
    }

    return data;
}

On ASP.NET, with 100 inserts, we have 80K RPS on Windows and 15K RPS on Linux.

The trace doesn't go further than this

Linux

BenchmarkDotNet=v0.11.5, OS=debian 9
Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host] : .NET Core ? (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26308), 64bit RyuJIT

Toolchain=InProcessEmitToolchain  IterationCount=3  LaunchCount=1
WarmupCount=3

Method	N	Mean	Error	StdDev
TryAddOrdinalIgnoreCase	10	275.4 ns	440.5 ns	24.15 ns
TryAddInvariantCultureIgnoreCase	10	3,442.1 ns	484.8 ns	26.57 ns
TryAddOrdinalIgnoreCase	50	1,238.5 ns	301.0 ns	16.50 ns
TryAddInvariantCultureIgnoreCase	50	16,930.4 ns	2,142.9 ns	117.46 ns
TryAddOrdinalIgnoreCase	100	2,401.3 ns	426.6 ns	23.38 ns
TryAddInvariantCultureIgnoreCase	100	33,979.7 ns	6,911.2 ns	378.83 ns

Windows

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.107 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v3 3.50GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host] : .NET Core 3.0.0-preview6-27713-13 (CoreCLR 3.0.19.26071, CoreFX 4.700.19.26308), 64bit RyuJI

Toolchain=InProcessEmitToolchain  IterationCount=3  LaunchCount=1
WarmupCount=3

Method	N	Mean	Error	StdDev
TryAddOrdinalIgnoreCase	10	239.3 ns	2.354 ns	0.1290 ns
TryAddInvariantCultureIgnoreCase	10	1,694.5 ns	23.130 ns	1.2679 ns
TryAddOrdinalIgnoreCase	50	1,145.2 ns	7.520 ns	0.4122 ns
TryAddInvariantCultureIgnoreCase	50	8,332.2 ns	166.891 ns	9.1479 ns
TryAddOrdinalIgnoreCase	100	2,241.1 ns	379.392 ns	20.7958 ns
TryAddInvariantCultureIgnoreCase	100	16,191.5 ns	12.769 ns	0.6999 ns

The text was updated successfully, but these errors were encountered:

tarekgh · 2019-05-16T15:59:42Z

Seems worst on ASP.NET (6 times slower) than benchmarkdotnet (2 times slower), where there are no concurrent calls.

This will be good to be investigated by the Dictionary owner first to look at if we have a locking in the collection side that may cause showing much slower in the ASP.NET scenario.

@sebastienros would be nice if you can attache the ASP.NET sample repro you used when seeing the 6x degradation.

sebastienros · 2019-05-16T16:41:46Z

Here is the controller to add to a new project from the default API template:

    [Route("api/[controller]")]
    [ApiController]
    public class ValuesController : ControllerBase
    {
        [HttpGet]
        [Route("InvariantCultureIgnoreCase/{count}")]
        public ActionResult InvariantCultureIgnoreCase(int count)
        {
            var data = new Dictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
            for (var i = 0; i < count; i++)
            {
                data.TryAdd("Id_", i);
            }

            return Ok();
        }

        [HttpGet]
        [Route("OrdinalIgnoreCase/{count}")]
        public ActionResult OrdinalIgnoreCase(int count)
        {
            var data = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
            for (var i = 0; i < count; i++)
            {
                data.TryAdd("Id_", i);
            }

            return Ok();
        }
    }

danmoseley · 2019-05-30T19:27:27Z

@sebastienros do we know whether this was a regression from past release? Also, is this critical for meeting 3.0 perf goals?
If not we should move to future milestone please.

tarekgh · 2019-05-30T20:02:40Z

I don't think this is a regression.

sebastienros · 2019-05-30T20:26:54Z

This is not a regression, I already filed a related issue a year ago.

mjsabby · 2019-05-30T20:31:23Z

We should fix this right? 2x-6x for such a critical operation seems bad.

tarekgh · 2019-05-30T20:33:17Z

Yes we should look at it but I don't think this is urgent for 3.0 as it is not a regression. I am going to move it to the future milestone.

terrajobst · 2019-05-30T23:59:04Z

Tagging @adamsitnik as Dan assigned him to take a look

danmoseley · 2019-05-31T15:26:27Z

@mjsabby is right, it's significant enough we should investigate.

adamsitnik · 2019-05-31T16:02:22Z

From a quick look at the profiles and the code, I can tell that we are using different APIs to get the sortKey when computing the hash when comparing strings (Windows offers us something nice here). I will try to find a way to make the Linux implementation faster.

tarekgh · 2019-05-31T16:19:48Z

@adamsitnik from @sebastienros investigation, I believe he mentioned not seeing that much difference when executing in a single threaded scenario while the regression is seen in the ASP.NET scenario with mutli-threaded. in other word, @sebastienros data show if measuring the difference between Linux and Windows only in console app when calling InvariantCultureIgnoreCase, not seeing that much regression.

@sebastienros feel free to correct me if I spelled anything wrong here.

I can tell that we are using different APIs to get the sortKey when computing the hash when comparing strings

This is expected as we depend on different components on different OS's.

adamsitnik · 2019-05-31T16:41:13Z

This is expected as we depend on different components on different OS's.

I just want to make sure we use the best available method in the right way according to spec.

tarekgh · 2019-05-31T16:49:08Z

I just want to make sure we use the best available method in the right way according to spec.

Yes, that is fine. I would suggest to measure this scenario in isolation first before jumping in trying to optimize it. As I mentioned there could be another factor in the ASP.NET scenario (something like involving locking with the collections for instance) causing this difference.

adamsitnik · 2019-05-31T16:54:14Z

I would suggest to measure this scenario in isolation first before jumping in trying to optimize it.

When I wrote "From a quick look at the profiles" I meant that I have already profiled that.

Anyway, thanks for the hint. The more people repeat "measure, measure, measure" the better!

sebastienros · 2019-05-31T18:00:07Z

FYI this issue has probably the same root cause as this one: https://github.com/dotnet/corefx/issues/29462

tarekgh · 2019-05-31T18:58:47Z

FYI this issue has probably the same root cause as this one: #26054

I doubt that because the string comparisons are not using the sortkeys. but maybe similar issue.

adamsitnik · 2019-06-02T12:29:31Z

I have used the recommendations provided in the ICU User Guide and made it two times faster dotnet/coreclr#24889

billwert · 2019-06-02T15:52:31Z

Great that we were able to optimize this! @sebastienros what was the overall impact of this on the techempower scenario?

@brianrob

tarekgh · 2019-06-02T20:42:25Z

Yes @billwert point is important as I am expecting there is some other places contributing to the perf differences.

sebastienros · 2019-06-03T20:21:06Z

@billwert For the DataUpdate scenario this was terrible when used with Npgsql and Dapper, which I assume is very common. We'll see how much of an issue this was for the other scenarios when the fix is in. And also how much is this change fixing the DataUpdates scenario. As an example, ADO.NET is 10 times faster than Dapper on this scenario. I would expect a difference of 10% instead.

brianrob · 2019-06-03T22:22:03Z

@sebastienros do you have a trace that shows the overall cost of the slowdown on the scenario? I'd like for us to prioritize this bug based on the potential benefit we'd get from a fix.

sebastienros · 2019-06-03T22:30:10Z

@brianrob I pasted a screenshot of the best details I could get from the trace. I can show you the command line I used, maybe you would know how to get a better one, or get better data out of it.

tarekgh · 2019-06-03T22:36:27Z

Just to tell, @jkotas mentioned important comment dotnet/coreclr#24889 (comment) which most likely contributing to the delay in the multi-threading scenario.

danmoseley · 2019-06-04T14:13:09Z

@adamsitnik maybe you still need this bug to track further investigation

adamsitnik · 2019-06-04T15:43:23Z

maybe you still need this bug to track further investigation

Sure, I did not know that GH is going to close the issue from other repo OOTB when I use the "fixes" keyword.

I am still looking at the traces with lock. It looks like a different lock than the Jan pointed out is a problem (I have removed the pointed one and changed the implementation to ThreadStatic and it did not help).

wangzhaoguan · 2019-06-06T06:26:35Z

我们遇到了同样的问题，dotnet core在Linux平台下的性能，比windows差好几倍，核心态CPU非常高。今天下载了3.0 Preview验证，发现此问题仍然存在。真实使用场景是向Oracle批量插入数据时，ODP内部频繁调用的String.Compare(s1, s2, true)触发了calling System.Globalization.Native.so!pthread_mutex_lock。lldb的堆栈日志显示如下

Thread 233
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libpthread.so.0!__lll_lock_wait + 0x1d
Child-SP RetAddr Caller, Callee
00007FDEC67F7600 00007ff89d019709 (MethodDesc 00007ff89ca66428 + 0xa9 DomainNeutralILStubClass.IL_STUB_PInvoke(SafeSortHandle, Char*, Int32, Char*, Int32, System.Globalization.CompareOptions))
00007FDEC67F7608 00007ff917d6cdcb libpthread.so.0!_L_lock_883 + 0xf, calling libpthread.so.0!__lll_lock_wait
00007FDEC67F7640 00007fe359f3ceb9 libicui18n.so.50!ucol_strcoll_50 + 0x479, calling libicui18n.so.50!___lldb_unnamed_symbol90$$libicui18n.so.50
00007FDEC67F7670 00007ff916538ec1 libcoreclr.so!MetaSig::SkipArg() + 0x31 [/root/coreclr/src/vm/siginfo.cpp:866], calling libcoreclr.so!SigParser::SkipExactlyOne() [/root/coreclr/src/utilcode/sigparser.cpp:16]
00007FDEC67F7690 00007ff89c30149c System.Globalization.Native.so!GetCollatorFromSortHandle(_sort_handle*, int, UErrorCode*) + 0x3c [/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../include/c++/4.8.5/bits/stl_tree.h:502], calling System.Globalization.Native.so!pthread_mutex_lock
00007FDEC67F76D0 00007ff89c3015db System.Globalization.Native.so!GlobalizationNative_CompareString + 0x3b [/root/coreclr/src/corefx/System.Globalization.Native/collation.cpp:437], calling System.Globalization.Native.so!GetCollatorFromSortHandle(_sort_handle*, int, UErrorCode*)
00007FDEC67F7710 00007ff89d019709 (MethodDesc 00007ff89ca66428 + 0xa9 DomainNeutralILStubClass.IL_STUB_PInvoke(SafeSortHandle, Char*, Int32, Char*, Int32, System.Globalization.CompareOptions))
00007FDEC67F7750 00007ff89d019709 (MethodDesc 00007ff89ca66428 + 0xa9 DomainNeutralILStubClass.IL_STUB_PInvoke(SafeSortHandle, Char*, Int32, Char*, Int32, System.Globalization.CompareOptions))
00007FDEC67F77B0 00007ff89cee3bb3 (MethodDesc 00007ff89ca79198 + 0x53 System.Globalization.CompareInfo.CompareString(System.ReadOnlySpan1<Char>, System.ReadOnlySpan1, System.Globalization.CompareOptions)), calling 00007ff89cba6dc0 (stub for Interop+Globalization.CompareString(SafeSortHandle, Char*, Int32, Char*, Int32, System.Globalization.CompareOptions))
00007FDEC67F77F0 00007ff89cee15e2 (MethodDesc 00007ff89ca78e50 + 0x152 System.Globalization.CompareInfo.Compare(System.String, System.String, System.Globalization.CompareOptions)), calling (MethodDesc 00007ff89ca79198 + 0 System.Globalization.CompareInfo.CompareString(System.ReadOnlySpan1<Char>, System.ReadOnlySpan1, System.Globalization.CompareOptions))
00007FDEC67F7850 00007ff89ce9bff6 (MethodDesc 00007ff89ca6f6b0 + 0xf6 System.String.Compare(System.String, System.String, System.StringComparison))
00007FDEC67F78A0 00007ff8aab77938 (MethodDesc 00007ff8a4c97e20 + 0x1d8 Oracle.ManagedDataAccess.Client.OracleParameterCollection.FindParamByName(System.String)), calling (MethodDesc 00007ff89ca6f6b0 + 0 System.String.Compare(System.String, System.String, System.StringComparison))
00007FDEC67F78B0 00007ff8aab97d7d (MethodDesc 00007ff89f0c1258 + 0x9d System.Collections.ArrayList.Contains(System.Object))
00007FDEC67F78F0 00007ff8ac448748 (MethodDesc 00007ff8a4ccba40 + 0x168 OracleInternal.ServiceObjects.OracleCommandImpl.GetBindByPositionBasedParameterCollection(Oracle.ManagedDataAccess.Client.OracleParameterCollection, System.Collections.ArrayList, Boolean)), calling (MethodDesc 00007ff8a4c97e20 + 0 Oracle.ManagedDataAccess.Client.OracleParameterCollection.FindParamByName(System.String))
00007FDEC67F7940 00007ff8ac452f59 (MethodDesc 00007ff8a4ccb8a8 + 0x4e9 OracleInternal.ServiceObjects.OracleCommandImpl.ExecuteNonQuery(System.String, Oracle.ManagedDataAccess.Client.OracleParameterCollection, System.Data.CommandType, OracleInternal.ServiceObjects.OracleConnectionImpl, Int32, Int64, OracleInternal.ServiceObjects.OracleDependencyImpl, Int64[] ByRef, Oracle.ManagedDataAccess.Client.OracleParameterCollection ByRef, Boolean ByRef, Oracle.ManagedDataAccess.Client.OracleException ByRef, Oracle.ManagedDataAccess.Client.OracleConnection, Oracle.ManagedDataAccess.Client.OracleLogicalTransaction ByRef, Boolean)), calling (MethodDesc 00007ff8a4ccba40 + 0 OracleInternal.ServiceObjects.OracleCommandImpl.GetBindByPositionBasedParameterCollection(Oracle.ManagedDataAccess.Client.OracleParameterCollection, System.Collections.ArrayList, Boolean))
00007FDEC67F79F0 00007ff89cec2f24 (MethodDesc 00007ff89ca74700 + 0x94 System.Number.TryFormatInt32(Int32, System.ReadOnlySpan1<Char>, System.IFormatProvider, System.Span1, Int32 ByRef)), calling (MethodDesc 00007ff89ca74800 + 0 System.Number.TryUInt32ToDecStr(UInt32, Int32, System.Span1<Char>, Int32 ByRef)) 00007FDEC67F7A00 00007ff916746fd0 libcoreclr.so!SVR::gc_heap::adjust_limit_clr(unsigned char*, unsigned long, alloc_context*, SVR::heap_segment*, int, int) + 0x3b0 [/root/coreclr/src/gc/gc.cpp:11474], calling libcoreclr.so!memset 00007FDEC67F7A70 00007ff916749127 libcoreclr.so!SVR::gc_heap::a_fit_segment_end_p(int, SVR::heap_segment*, unsigned long, alloc_context*, int, int*) + 0x427 [/root/coreclr/src/gc/gc.cpp:12171], calling libcoreclr.so!SVR::gc_heap::adjust_limit_clr(unsigned char*, unsigned long, alloc_context*, SVR::heap_segment*, int, int) [/root/coreclr/src/gc/gc.cpp:11364] 00007FDEC67F7B00 00007ff916749b16 libcoreclr.so!SVR::gc_heap::allocate_small(int, unsigned long, alloc_context*, int) + 0x4a6 [/root/coreclr/src/gc/gc.cpp:12371], calling libcoreclr.so!SVR::gc_heap::a_fit_segment_end_p(int, SVR::heap_segment*, unsigned long, alloc_context*, int, int*) [/root/coreclr/src/gc/gc.cpp:12085] 00007FDEC67F7B90 00007ff91674ad83 libcoreclr.so!SVR::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int) + 0x1f3 [/root/coreclr/src/gc/gc.cpp:13208], calling libcoreclr.so!SVR::gc_heap::allocate_small(int, unsigned long, alloc_context*, int) [/root/coreclr/src/gc/gc.cpp:12315] 00007FDEC67F7E60 00007ff8ac45509d (MethodDesc 00007ff8a4c90580 + 0x1fd Oracle.ManagedDataAccess.Client.OracleCommand.ExecuteNonQuery()), calling (MethodDesc 00007ff8a4ccb8a8 + 0 OracleInternal.ServiceObjects.OracleCommandImpl.ExecuteNonQuery(System.String, Oracle.ManagedDataAccess.Client.OracleParameterCollection, System.Data.CommandType, OracleInternal.ServiceObjects.OracleConnectionImpl, Int32, Int64, OracleInternal.ServiceObjects.OracleDependencyImpl, Int64[] ByRef, Oracle.ManagedDataAccess.Client.OracleParameterCollection ByRef, Boolean ByRef, Oracle.ManagedDataAccess.Client.OracleException ByRef, Oracle.ManagedDataAccess.Client.OracleConnection, Oracle.ManagedDataAccess.Client.OracleLogicalTransaction ByRef, Boolean)) 00007FDEC67F7F40 00007ff8abf70887 (MethodDesc 00007ff8a435d610 + 0x247 Ralms.EntityFrameworkCore.Oracle.Storage.Internal.OracleRelationalCommandBuilderFactory+OracleRelationalCommandBuilder+OracleRelationalCommand.Execute(Microsoft.EntityFrameworkCore.Storage.IRelationalConnection, Microsoft.EntityFrameworkCore.Diagnostics.DbCommandMethod, System.Collections.Generic.IReadOnlyDictionary2<System.String,System.Object>))
00007FDEC67F7F70 00007ff89cef2743 (MethodDesc 00007ff89ca7c0b0 + 0xf3 System.Collections.Generic.Dictionary2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].FindEntry(System.__Canon)) 00007FDEC67F8000 00007ff9165cc3f1 libcoreclr.so!ObjIsInstanceOfNoGC + 0xe1 [/root/coreclr/src/vm/methodtable.inl:1692], calling libcoreclr.so!MethodTable::CanCastToInterfaceNoGC(MethodTable*) [/root/coreclr/src/vm/methodtable.cpp:1739] 00007FDEC67F80A0 00007ff8ac4556a0 (MethodDesc 00007ff8a435d428 + 0x40 Microsoft.EntityFrameworkCore.Storage.Internal.RelationalCommand.ExecuteNonQuery(Microsoft.EntityFrameworkCore.Storage.IRelationalConnection, System.Collections.Generic.IReadOnlyDictionary2<System.String,System.Object>))
00007FDEC67F80C0 00007ff8ac51413f (MethodDesc 00007ff8a47f5ea8 + 0x13f Microsoft.EntityFrameworkCore.RelationalDatabaseFacadeExtensions.ExecuteSqlCommand(Microsoft.EntityFrameworkCore.Infrastructure.DatabaseFacade, Microsoft.EntityFrameworkCore.RawSqlString, System.Collections.Generic.IEnumerable`1<System.Object>))

tarekgh · 2019-06-06T16:13:21Z

@wangzhaoguan thanks for sending the stack. I posted on the PR a couple of questions regarding your targeting version. does you app can target 3.0?

adamsitnik · 2019-06-09T13:29:24Z

The lock has been removed in dotnet/coreclr#24973 and the issue is now solved for good.

Tomorrow I am going to fill the servicing request to merge my backport to 2.1 dotnet/coreclr#25042

tarekgh · 2019-06-09T18:04:20Z

sorry closed this by mistake. @adamsitnik can close it though.

adamsitnik · 2019-06-10T20:36:14Z

sorry closed this by mistake

np, I wanted to close it once I port the fix to 2.1, but since we are not going to do that I can close it right now.

Btw I've also benchmarks to track the performance of string.GetHashCode and string.Compare for different comparisons and sizes in dotnet/performance#535

sebastienros · 2019-06-11T23:07:29Z

I don't think the issue has been fixed, running on ASP.NET I still get 271,737 RPS for Windows and 39,740 RPS for Linux. I checked as the benchmark that triggered the issue (data updates on Npgsql) didn't get any significant improvement.

sebastienros · 2019-06-11T23:08:29Z

FYI the command to run the benchmarks (if you know how to use our service).

dotnet run -- @C:\temp\.environments\aspnet-lin `
--repository https://github.com/sebastienros/invariantcultureperf --project-file InvariantCulture.csproj --path /api/values/InvariantCultureIgnoreCase/100 --warmup 1 --duration 5 --runtime 3.0.0-*

sebastienros · 2019-06-11T23:13:06Z

And some event counters:

	Windows	Linux
ThreadPool Threads Count	48	76
Lock Contention (#/s)	77	1,281
ThreadPool Queue Length	222	106
ThreadPool Items (#/s)	89,701	581,628

tarekgh · 2019-06-11T23:19:34Z

@sebastienros just double checking, are you sure you were running with @adamsitnik fix?

sebastienros · 2019-06-11T23:21:58Z

Trace here https://microsoft-my.sharepoint.com/:u:/p/sebros/EbPBKcbM8ZtOg8oY1wF5GT4B4uOXzCM3-BJ5AwvdLFBYfw?e=n8rtb1

Yes, I verified the build I am using has it (currently the latest 3.0.0-preview7-27811-09)

sebastienros · 2019-06-11T23:23:43Z

I can see two jumps in perf due to the two commits you both made, but nothing close to what would be comparable:

adamsitnik · 2019-06-12T08:07:29Z

I still get 271,737 RPS for Windows and 39,740 RPS for Linux

This is weird, I get sth around 90k RPS for Windows and 60k for Linux

dotnet run -- --server http://asp-perf-win:5001 --client http://asp-perf-load:5002 --repository https://github.com/sebastienros/invariantcultureperf --project-file InvariantCulture.csproj --path /api/values/InvariantCultureIgnoreCase/100 --warmup 1 --duration 5 --runtime 3.0.0-*

RequestsPerSecond:           92 141
Max CPU (%):                 90
WorkingSet (MB):             162
Avg. Latency (ms):           3,82
Startup (ms):                346
First Request (ms):          105,74
Latency (ms):                0,65
Total Requests:              469 856
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27811-14
ASP.NET Core:                3.0.0-preview7.19311.5

dotnet run -- --server http://asp-perf-lin:5001 --client http://asp-perf-load:5002 --repository https://github.com/sebastienros/invariantcultureperf --project-file InvariantCulture.csproj --path /api/values/InvariantCultureIgnoreCase/100 --warmup 1 --duration 5 --runtime 3.0.0-*

RequestsPerSecond:           60 645
Max CPU (%):                 98
WorkingSet (MB):             176
Avg. Latency (ms):           4,22
Startup (ms):                267
First Request (ms):          113,33
Latency (ms):                0,66
Total Requests:              308 079
Duration: (ms)               5 080
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27811-14
ASP.NET Core:                3.0.0-preview7.19311.5

adamsitnik · 2019-06-12T08:18:02Z

I have run the oldest preview7 that was available at dotnet-core feed (BTW why there are not preview6 packages there?) to make sure it did not contain dotnet/coreclr#24973 :

--runtimeVersion 3.0.0-preview7-27807-12 --collect-trace

[09:34:32.664] Downloading trace: trace.06-12-09-34-32.RPS-44K.trace.zip
RequestsPerSecond:           43 902
Max CPU (%):                 96
WorkingSet (MB):             178
Avg. Latency (ms):           5,73
Startup (ms):                311
First Request (ms):          133,19
Latency (ms):                0,85
Total Requests:              223 424
Duration: (ms)               5 090
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27807-12
ASP.NET Core:                3.0.0-preview7.19311.5

The locks were all over the place:

With the fix:

--runtime 3.0.0-* --collect-trace

[08:59:29.204] Downloading trace: trace.06-12-08-59-29.RPS-60K.trace.zip
RequestsPerSecond:           59 524
Max CPU (%):                 99
WorkingSet (MB):             176
Avg. Latency (ms):           4,1
Startup (ms):                304
First Request (ms):          132,98
Latency (ms):                0,77
Total Requests:              301 716
Duration: (ms)               5 070
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27811-13
ASP.NET Core:                3.0.0-preview7.19311.5

The locks are gone

adamsitnik · 2019-06-12T08:21:38Z

@sebastienros could you please take a look at the commands I run? The results we get for Windows are very different: 271k vs 90k RPS.

In the meantime, I am going to take a look at the two hottest method from the stack trace:

System.Private.CoreLib!System.Runtime.InteropServices.SafeHandle::InternalRelease(bool)[OptimizedTier1] 
System.Private.CoreLib!System.Runtime.InteropServices.SafeHandle::DangerousAddRef(bool&)[OptimizedTier1]

adamsitnik · 2019-06-12T08:59:56Z

Ok, it looks like CriticalHandle is what we need.

adamsitnik · 2019-06-12T10:48:16Z

Ok, I was able to get 110k RPS using my local build of System.Private.CoreLib.dll passed via --outputFile.

RequestsPerSecond:           110 245
Max CPU (%):                 94
WorkingSet (MB):             177
Avg. Latency (ms):           2,65
Startup (ms):                270
First Request (ms):          110,62
Latency (ms):                0,91
Total Requests:              562 104
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012367
Runtime:                     3.0.0-preview7-27812-01
ASP.NET Core:                3.0.0-preview7.19311.5

I have created a PR in CoreCLR: dotnet/coreclr#25117 I hope it's not a very bad idea.

benaadams · 2019-06-12T15:00:01Z

And some event counters:

	Windows	Linux
ThreadPool Items (#/s)	89,701	581,628

That's an interesting difference...

stephentoub · 2019-06-12T15:25:32Z

That's an interesting difference

Presumably coming from asynchronously completing socket operations that need to queue from the epoll thread back to the thread pool to run user code.

sebastienros · 2019-06-12T17:49:50Z

Ok, I was able to get 110k RPS using my local build of System.Private.CoreLib.dll passed via --outputFile.

This is exactly how a fix should be validated, testing your local changes on the scenario before sending the PR, that's awesome!

could you please take a look at the commands I run

Yeah they are perfect, I was just using the TechEmpower machines which are more beefy, and were probably more prone to locking as it handles more load. The fact you got faster on Linux than Windows means it's really fixed. You can also try --collect-counters to verify that the locks/s are much lower.

sebastienros · 2019-06-12T17:51:17Z

FYI for everyone, I added the support for Event Counters on the benchmarking service (--collect-counters) and it's tracked with every ASP.NET scenario now. We'll get continuous charts for these numbers too now.

Sneak peek:

CPU Usage (%):               99
Working Set (MB):            526
GC Heap Size (MB):           281
Gen 1 GC (#/s):              1
Gen 2 GC (#/s):              0
Time in GC (%):              0
Gen 0 Size (B):              135,464
Gen 1 Size (B):              3,033,544
Gen 2 Size (B):              2,872,120
LOH Size (B):                1,278,288
Allocation Rate (B/sec):     227,329,432
# of Assemblies Loaded:      110
Exceptions (#/s):            985
ThreadPool Threads Count:    48
Lock Contention (#/s):       77
ThreadPool Queue Length:     222
ThreadPool Items (#/s):      89,701

adamsitnik · 2019-06-16T05:47:10Z

Ok, I was able to use the same machine as @sebastienros and confirm that the problem is gone.

Changes

dotnet/coreclr#24889 - I have reduced the number of calls to an expensive native API by half, which made StringComparison.InvariantCulture.GetHashCode() two times faster for strings shorter than 262 144 characters (the biggest buffer rentable from ArrayPool.Shared without allocation / 4)

dotnet/coreclr#24973 - I have removed the lock that was part of native implementation of CompareString, IndexOf, LastIndexOf, StartsWith, EndsWith, GetHashCode

dotnet/coreclr#25117 - I have removed the expensive ref counting of SafeSortHandle usages and replaced SafeSortHandle with a cached IntPtr (the number of cultures and sort options is always low so it's ok), the affected methods are CompareString, IndexOf, LastIndexOf, StartsWith, EndsWith, GetHashCode

25117

The most recent change had the biggest impact on the beefy Citrine Physical Machines (14 Core(s), 28 Logical Processor(s))

before

RequestsPerSecond:           40 811
Max CPU (%):                 100
WorkingSet (MB):             496
Avg. Latency (ms):           6,63
Startup (ms):                298
First Request (ms):          102,54
Latency (ms):                0,86
Total Requests:              208 136
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0
SDK:                         3.0.100-preview7-012412
Runtime:                     3.0.0-preview7-27815-04
ASP.NET Core:                3.0.0-preview7.19315.2

after

RequestsPerSecond:           240 711
Max CPU (%):                 94
WorkingSet (MB):             523
Avg. Latency (ms):           1,77
Startup (ms):                338
First Request (ms):          121,31
Latency (ms):                1,04
Total Requests:              1 227 445
Duration: (ms)               5 100
Socket Errors:               0
Bad Responses:               0

Linux vs Windows after all changes

Linux vs Windows after all fixes for the ASP.NET perf lab machines

Local Physical servers (E5-1650 v3 @ 3.50 GHz w/ 15MB Cache 6 Cores / 12 Threads 32GB of RAM.)

110k RPS for Linux and 90k for Windows

Citrine Physical Machines (Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz, 2195 Mhz, 14 Core(s), 28 Logical Processor(s) 32GB of RAM.)

240k RPS for Linux and 265k RPS for Windows

Azure VMs (D3_v2 4 virtual cores. 4GB of RAM)

28k RPS for both Linux and Windows

Summary

For the Azure VMs scenarios both OSes are even, for Local Physical servers Linux is now 20% faster, for Citrine Physical Machines Windows is still 10% faster.

I currently run out of ideas for how to close the 10% gap for the Citrine Physical Machines machines, the trace does not leave much space for optimizations.

@sebastienros please close the issue after you verify that the problem is gone

big thanks to @jkotas and @tarekgh for all the help!

tarekgh · 2019-06-19T15:35:23Z

Ok, I was able to use the same machine as @sebastienros Sebastien Ros FTE and confirm that the problem is gone.

@adamsitnik can we close this issue now?

adamsitnik · 2019-06-19T15:44:46Z

Yes, we can.

@sebastienros I verified that the fix works with the latest coreclr packages that were pushed to core-setup in the last 24h.

sebastienros · 2019-06-19T19:33:08Z

Before and after the changes from @adamsitnik

danmoseley assigned adamsitnik May 31, 2019

adamsitnik closed this as completed Jun 4, 2019

jkotas reopened this Jun 4, 2019

tarekgh closed this as completed Jun 9, 2019

tarekgh reopened this Jun 9, 2019

adamsitnik closed this as completed Jun 10, 2019

adamsitnik reopened this Jun 12, 2019

adamsitnik closed this as completed Jun 19, 2019

msftgits transferred this issue from dotnet/corefx Feb 1, 2020

msftgits added this to the 3.0 milestone Feb 1, 2020

ghost locked as resolved and limited conversation to collaborators Dec 13, 2020

Perf: InvariantCultureIgnoreCase 2x to 6x times slower on Linux #29565

Perf: InvariantCultureIgnoreCase 2x to 6x times slower on Linux #29565

Comments

sebastienros commented May 16, 2019

Linux

Windows

tarekgh commented May 16, 2019

sebastienros commented May 16, 2019

danmoseley commented May 30, 2019

tarekgh commented May 30, 2019

sebastienros commented May 30, 2019

mjsabby commented May 30, 2019

tarekgh commented May 30, 2019

terrajobst commented May 30, 2019

danmoseley commented May 31, 2019

adamsitnik commented May 31, 2019

tarekgh commented May 31, 2019 • edited Loading

adamsitnik commented May 31, 2019

tarekgh commented May 31, 2019

adamsitnik commented May 31, 2019

sebastienros commented May 31, 2019

tarekgh commented May 31, 2019

adamsitnik commented Jun 2, 2019

billwert commented Jun 2, 2019

tarekgh commented Jun 2, 2019

sebastienros commented Jun 3, 2019 • edited Loading

brianrob commented Jun 3, 2019

sebastienros commented Jun 3, 2019

tarekgh commented Jun 3, 2019

danmoseley commented Jun 4, 2019 • edited Loading

adamsitnik commented Jun 4, 2019 • edited Loading

wangzhaoguan commented Jun 6, 2019

tarekgh commented Jun 6, 2019

adamsitnik commented Jun 9, 2019

tarekgh commented Jun 9, 2019

adamsitnik commented Jun 10, 2019

sebastienros commented Jun 11, 2019

sebastienros commented Jun 11, 2019

sebastienros commented Jun 11, 2019

tarekgh commented Jun 11, 2019

sebastienros commented Jun 11, 2019

sebastienros commented Jun 11, 2019

adamsitnik commented Jun 12, 2019

adamsitnik commented Jun 12, 2019

adamsitnik commented Jun 12, 2019

adamsitnik commented Jun 12, 2019

adamsitnik commented Jun 12, 2019

benaadams commented Jun 12, 2019

stephentoub commented Jun 12, 2019

sebastienros commented Jun 12, 2019

sebastienros commented Jun 12, 2019

adamsitnik commented Jun 16, 2019

Changes

25117

Linux vs Windows after all changes

Local Physical servers (E5-1650 v3 @ 3.50 GHz w/ 15MB Cache 6 Cores / 12 Threads 32GB of RAM.)

Citrine Physical Machines (Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz, 2195 Mhz, 14 Core(s), 28 Logical Processor(s) 32GB of RAM.)

Azure VMs (D3_v2 4 virtual cores. 4GB of RAM)

Summary

tarekgh commented Jun 19, 2019

adamsitnik commented Jun 19, 2019

sebastienros commented Jun 19, 2019

tarekgh commented May 31, 2019 •

edited

Loading

sebastienros commented Jun 3, 2019 •

edited

Loading

danmoseley commented Jun 4, 2019 •

edited

Loading

adamsitnik commented Jun 4, 2019 •

edited

Loading