-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Critical Bug/Issue with SqlConnection in .NET Core 2.0 when running on Linux or Linux Containers and having multiple concurrent users like in a Load Test #24301
Comments
@CESARDELATORRE Do you know at what User capacity does the application break down? |
Just with around 15 concurrent Http users (configured with the VS LoadTest project ) the application is breaking down after 1 minute or so running the Load Test with VS. |
I could isolate the issue to Dns resolution of the endpoint of the Sql Server being slow. When the requests for opening a socket to Sql Server pile up, there is an increase in DNS lookup times as well. The endpoint to use for the socket based repro is |
For a Single TCP connection to Azure SQL DB
Using wget to send multiple requests to the sqlnet endpoint
The DNS Resolution time starts increasing for every request. As a result the timeout for SqlConnection is expired by the time a connection to a SQL server is established. |
I am using the following script on Ubuntu to emulate multiple concurrent connections. #!/bin/bash
for number in {1..20}
do
wget http://localhost:22354/api/values/sqlnet &
done
exit 0
wait The author of the issue has reported the repro on Docker, however, I can repro this issue even on Host Ubuntu VM To find out if the DNS resolution on the OS itself is slow, I tried nslookup eshopsql-e3h5ug5xd3rfs.database.windows.net and the results for NS lookup were almost instantaneous. To run the repro, navigate to sqltestapp/sqltestwebapi and |
Copying folks from networking for guidance |
Do you have sample code for the Dns.GetHostAddressesAsync @saurabh500 ??? |
I have posted the link to the repro. |
I've tried the repro in a container and I get worse results |
That is not that surprising. I do see sql demo but not pure dns test. As far as BaseOS on a VM -> does that mean full Linux VM in Let Say Hyperv on Windows desktop? |
In the repro there is an endpoint called sqlnet which only calls networking Api. The OS is hosted on hyper v |
Have you tried changing from connection.Open to connection.OpenAsync eg
as well as
I have noticed that Linux (with asp.netcore) if far more susceptible to thread starvation. Also a docker machine has a very low core count normally so dies off quickly and you start to hit timeouts. Try changing to above and check the results (also was your DNS test done using blocking or async code?) |
@Drawaes I am trying to understand what is causing the slow down of subsequent DNS lookup from the DNS Apis which both SqlConnection Open and OpenAsync code will use. The parallelization extent for which this bug was reported was only 15 concurrent users. |
From what I am reading Docker doesn't impose any limitations on the resources of a container. I upgraded my VM to 4 cores and I could still repro the issue. After this I changed the code to add an I then went ahead and modified all the SqlClient calls to be async and saw a lot of improvements in the performance of the behavior and I could scale the requests to 10000 concurrent users in VS (Thanks @Drawaes ) @wfurt does it still make sense to investigate why blocking calls slow down under high concurrent load? |
With blocking calls, everything will be serialized. The part I don't understand is why time would increase over time - unless we building backlog e.g. we are asking for lookup faster then processing. |
Yes, that't weird. I don't know about the internal implementation of An experiment based on your comment above, if I change all calls to be blocking e.g. I would imagine that if I am waiting for dns resolution and socket connect async tasks to finish in a blocking context, then that should take the same amount of time as calling the sync counter parts, but that doesn't seem to be the case here. I will evaluate if the blocking APIs of SqlClient should be changed to use blocking network calls if that yields better performance. |
Imagine you have four threads in the thread pool, and on those four thread pool threads, you initiate a synchronous operation; each completes synchronously and the operations go on their merry way. Now imagine you have four threads in the thread pool, and on those four thread pool threads, you initiate an asynchronous operation that will queue a work item back to the thread pool at some point during its invocation, and then you block on the result of that asynchronous operation. You're now blocking all four threads in the pool waiting on those four asynchronous operations to complete, but those four asynchronous operations that were initiated are themselves queueing work items to the pool that require threads to run, and those work items need to complete before the blocking is ended. If the limit on the pool were four threads, you've essentially deadlocked. Assuming the limit is not, eventually the thread pool will detect the starvation and will inject more threads, but it does so relatively slowly. That's very likely contributing to the lag you're seeing. |
:) Its something I see over and over. It's due to the context of ASP.NET. It uses the same thread pool to try to service requests. You get a limited number of threadpool threads (I forget the core count multiplier off the top of my head). There is a hill climbing algo that approx adds a thread every 500ms, so if the cpu use is low and all threads are used then you get a very slow ramp up. So you block on a network call for say 200ms, in the meantime there are x requests waiting for aspnet. You finally release that thread and then you have a backlog of requests, then it takes longer and longer. You can't block on IO in aspnet if you dont' want to bring down your server. We showed in the aspnet repo that a small number like 10 on a low cpu count docker container could bring it to it's knees. |
@stephentoub beat me with a better explanation. |
I can understand the improvements when using asynchronous calls, but using sync we can reproduce it with a low number of users and most of all, we cannot repro it with a similar or higher number of users when load testing the same code running on Windows or Windows Containers. |
@stephentoub and @Drawaes Thanks for the insights into what is going on with this issue. Currenly SqlConnection.Open() calls SqlConnection.Open() should call the blocking version of the methods above instead i.e. SqlConnection.OpenAsync() which returns a task should also using the non-async version of the APIs I just checked the Windows implementation of TCP socket connections which always establishes connections synchronously in case of |
@saurabh500 Is there going to be a related fix for this issue? - Where and what timeframe can we expect? |
Yes
I don't have the answer to this question right now. |
Btw, can you confirm if the issue/fix is also related to SqlConnection running on plain Linux in addition to when running on a Docker Linux container? |
I have seen the issue on Ubuntu 16.04 as well as a hosted docker image. |
OK, thanks for helping on this issue and jumping on it so quick! 👍 |
I was telling @saurabh500 offline that this part of the plan sounded very counterintuitive to me:
I.e. shouldn't we be striving to make But then he explained to me that currently we end up blocking on the result of the async methods we call, in places like https://github.com/dotnet/corefx/blob/master/src/System.Data.SqlClient/src/System/Data/SqlClient/SNI/SNITcpHandle.cs#L130 and Those lines are part of the Without that change, I agree with @saurabh500 it makes sense to switch to the sync versions of the methods in the short term, but I think we should pursue removing blocking calls from the implementation longer term. @saurabh500 @geleems I will create an issue for this. |
Be warned, that adding blocking back in badly effects aspnet's performance. If a connection is being opened and closed rapidly it can bring a webserver to it's knees, is there no other way around this? |
@Drawaes the problem is that the code is already blocking, only it calls the async API and then blocks on Task.Result. That seems to be worse than just calling the sync version, unless I missed something. Also, it seems the actual way around this would be to implement what I describe in https://github.com/dotnet/corefx/issues/25742. That doesn't seem too complex to me, but I don't know if @saurabh500 agrees with that assessment. |
I guess my point would be, why not fix it now. All that code in the constructor seems like it could be moved now to a method that is called in the ConnectAsync or ParallelConnectAsync there aren't any blocking connects in there. |
I won't comment on the complexity but I do know that this effort needs a change from the top level API to the networking layer and this might creep into native sni as well if done right. So according to me, to do this right the number of areas impacted would be many. @Drawaes if you look at the code calls that lead to the Dns lookup and parallel connect, you will see that we block at quite a few places. |
@saurabh500 Agreed that Async is a virus like that (in the nicest possible way). My point here is, that unless this going back to blocking is for a back port fix, then aren't you better off trying to eliminate the blocking one by one. At the moment its far easier to spot (Search for .Result, .Wait etc) rather than a temp hack that won't go out yet anyway? I have a couple of things to finish up on SslStream hopefully this week. Then I can try to take a look at what I can do as quick wins.... From scanning through there are these areas
|
+1 on removing blocking code from connection pooling for the async path. FWIW, this has come up in our discussions for https://github.com/aspnet/dataaccessperformance. @Drawaes could you create a separate issue for this if one does not exist? It is better to track those improvements separately because it is likely we will address the separately. E.g. we are considering taking the minimal fix for this issue in a 2.0.x patch, while the rest of the improvements could go in 2.1 or later. |
Fix was merged to master branch. Closing this issue. |
Is this likely to be patched in 2.0 at all? |
A PR to patch 2.0.x has been merged at dotnet/corefx#26247 |
Can I assume that this is also fixed in Microsoft.Data.SqlClient, and (correspondingly) EF Core 3? We are experiencing a similar issue. |
POSSIBLE ISSUE/BUG
When load testing with not many concurrent users (just like 20, 50 or more concurrent users) we've found a Possible Critical Bug/Issue with SqlConnection in .NET Core 2.0 when running on Linux or Linux Containers.
We might be missing something, but I don't see it.
This is happening only when running SqlClient in .NET Core 2.0 on Linux or Linux-containers.
When running on Windows or Windows Containers, there are no issues, so, that is a bad symptom...
REPRO:
We have created a very simple project to repro the issue.
It is an ASP.NET Core Web API with two methods to run:
Basically, the issue is related to SqlConnection, not from higher level frameworks like EF Core or Dapper, as it happens whenever .NET Core 2.0 SqlConnection is used on LINUX.
You can get the repro code from here (Simple ASP.NET Core 2.0 Web API with SqlConnection):
https://github.com/CESARDELATORRE/SQLConnNetCore2IssueRepro
This is the Web API method when using simply a SqlConntection:
The database is actually running on an Azure SQL Database so we test it against a hypothetical "production" database. But this issue happens the same way if the SQL Database is running in a SQL Container.
If I deploy the container/solution from VS into Docker for Windows (running on the Linux VM), the Web API runs OK, as in the following screenshot:
NOTE that the URL has to provide "/native" at the end.
The tables in SQL currently have no registries, hence "0" as return value. But the query is being executed:
If you want to try with Dapper, try the same URL but ending on "/dapper" which will execute a different code/method.
You can see the Web API container running in the local Docker Linux VM:
![image](https://user-images.githubusercontent.com/1712635/33460652-f50e2e94-d5e3-11e7-99fe-4b4595aadc03.png)
Then, if we do Load Testing and simulate concurrent users, this is when we start getting a lot of errors from SqlConntection, like this:
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 40 - Could not open a connection to SQL Server) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource
1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource
1 retry, DbConnectionOptions userOptions) at System.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource1 retry, DbConnectionOptions userOptions) at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource
1 retry) at System.Data.SqlClient.SqlConnection.Open() at`
Contact me at cesardl at microsoft.com for further info about the issue.
The text was updated successfully, but these errors were encountered: