Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Critical Bug/Issue with SqlConnection in .NET Core 2.0 when running on Linux or Linux Containers and having multiple concurrent users like in a Load Test #24301

Closed
CESARDELATORRE opened this issue Nov 30, 2017 · 39 comments
Assignees
Milestone

Comments

@CESARDELATORRE
Copy link

POSSIBLE ISSUE/BUG
When load testing with not many concurrent users (just like 20, 50 or more concurrent users) we've found a Possible Critical Bug/Issue with SqlConnection in .NET Core 2.0 when running on Linux or Linux Containers.
We might be missing something, but I don't see it.
This is happening only when running SqlClient in .NET Core 2.0 on Linux or Linux-containers.
When running on Windows or Windows Containers, there are no issues, so, that is a bad symptom...

Error when load testing:
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred
while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that 
the instance name is correct and that SQL Server is configured to allow remote connections. (provider: 
TCP Provider, error: 40 - Could not open a connection to SQL Server) at 
System.Data.ProviderBase.DbConnectionPool.TryGetConnection(...

REPRO:
We have created a very simple project to repro the issue.
It is an ASP.NET Core Web API with two methods to run:

  1. Method running a plain native SqlConnection with a query.
  2. Method running a query with Dapper based on a query.
  3. When using EF Core, it should happen the same, but for the repro we perefered to go to the simplest repro which is just to use a SqlConntection.

Basically, the issue is related to SqlConnection, not from higher level frameworks like EF Core or Dapper, as it happens whenever .NET Core 2.0 SqlConnection is used on LINUX.

You can get the repro code from here (Simple ASP.NET Core 2.0 Web API with SqlConnection):

https://github.com/CESARDELATORRE/SQLConnNetCore2IssueRepro

This is the Web API method when using simply a SqlConntection:

[HttpGet("native")]
public async Task<IActionResult> GetNative()
{
    var connectionString = "Server=tcp:eshopsql-e3h5ug5xd3rfs.database.windows.net,1433;Initial Catalog=orderingdb;Persist Security Info=False;User ID=eshop;Password=MY-PASSWORD;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Max Pool Size=400;Connection Timeout=30;";
    using (var connection = new SqlConnection(connectionString))
    {
        try
        {
            connection.Open();
            var command = connection.CreateCommand();
            command.CommandText = @"SELECT o.[Id] as ordernumber,o.[OrderDate] as [date],os.[Name] as [status],SUM(oi.units*oi.unitprice) as total
                FROM [ordering].[Orders] o
                LEFT JOIN[ordering].[orderitems] oi ON  o.Id = oi.orderid 
                LEFT JOIN[ordering].[orderstatus] os on o.OrderStatusId = os.Id                     
                GROUP BY o.[Id], o.[OrderDate], os.[Name] 
                ORDER BY o.[Id]";
            var reader = command.ExecuteReader(System.Data.CommandBehavior.CloseConnection);
            var count = 0;
            while (reader.Read())
            {
                count += 1;
            }
            return Ok(count);
        }
        catch (Exception x)
        {
            throw new InvalidOperationException(x.ToString());
        }
    }
}

The database is actually running on an Azure SQL Database so we test it against a hypothetical "production" database. But this issue happens the same way if the SQL Database is running in a SQL Container.

If I deploy the container/solution from VS into Docker for Windows (running on the Linux VM), the Web API runs OK, as in the following screenshot:
NOTE that the URL has to provide "/native" at the end.

The tables in SQL currently have no registries, hence "0" as return value. But the query is being executed:

image

If you want to try with Dapper, try the same URL but ending on "/dapper" which will execute a different code/method.

You can see the Web API container running in the local Docker Linux VM:
image

Then, if we do Load Testing and simulate concurrent users, this is when we start getting a lot of errors from SqlConntection, like this:

System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 40 - Could not open a connection to SQL Server) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection) at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource1 retry, DbConnectionOptions userOptions) at System.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource1 retry, DbConnectionOptions userOptions) at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource1 retry) at System.Data.SqlClient.SqlConnection.Open() at
`

Contact me at cesardl at microsoft.com for further info about the issue.

@saurabh500
Copy link
Contributor

@CESARDELATORRE Do you know at what User capacity does the application break down?

@CESARDELATORRE
Copy link
Author

CESARDELATORRE commented Nov 30, 2017

Just with around 15 concurrent Http users (configured with the VS LoadTest project ) the application is breaking down after 1 minute or so running the Load Test with VS.
On the other hand, the Max Pool Size at the connection string is configured to 400, which should be high enough.
Looks like connections are not being disposed/free after they are used?
I say this because even after stopping the Load Test, with just a single user from a browser, the app is not responding or giving SqlConn errors and you need to re-start the app/service/container (re-deploy).

@CESARDELATORRE CESARDELATORRE changed the title Possible Critical Bug/Issue with SqlClient in .NET Core 2.0 when running on Linux or Linux Containers and having multiple concurrent users like in a Load Test Possible Critical Bug/Issue with SqlConnection in .NET Core 2.0 when running on Linux or Linux Containers and having multiple concurrent users like in a Load Test Dec 1, 2017
@saurabh500
Copy link
Contributor

I could isolate the issue to Dns resolution of the endpoint of the Sql Server being slow.

When the requests for opening a socket to Sql Server pile up, there is an increase in DNS lookup times as well.
I have the application from @CESARDELATORRE clone at https://github.com/saurabh500/SQLConnNetCore2IssueRepro

The endpoint to use for the socket based repro is http://localhost:22354/api/values/sqlnet

@saurabh500
Copy link
Contributor

For a Single TCP connection to Azure SQL DB

DNS Resolution Time 425
Socket Connected 476

Using wget to send multiple requests to the sqlnet endpoint

DNS Resolution Time 3009
DNS Resolution Time 9520
DNS Resolution Time 20142
DNS Resolution Time 2031
DNS Resolution Time 3540
DNS Resolution Time 20163
DNS Resolution Time 1054
Socket Connected 3055
DNS Resolution Time 20181
Socket Connected 9569
Socket Connected 20185
DNS Resolution Time 5571
DNS Resolution Time 6576
DNS Resolution Time 20191
Socket Connected 2075
DNS Resolution Time 20193
DNS Resolution Time 20193
DNS Resolution Time 13585
DNS Resolution Time 11587
Socket Connected 3584
DNS Resolution Time 18592
DNS Resolution Time 15592
DNS Resolution Time 17594
DNS Resolution Time 20205
Socket Connected 20206
DNS Resolution Time 16598
DNS Resolution Time 19602
DNS Resolution Time 14600
DNS Resolution Time 20212
DNS Resolution Time 20217
DNS Resolution Time 8605
Socket Connected 1102
DNS Resolution Time 10610
DNS Resolution Time 7609
DNS Resolution Time 12606
Socket Connected 20223
DNS Resolution Time 4610
DNS Resolution Time 20226
Socket Connected 5613
Socket Connected 6620
Socket Connected 20236
Socket Connected 20239
Socket Connected 20239
Socket Connected 13628
Socket Connected 11631
Socket Connected 18638
Socket Connected 15637
Socket Connected 20248
Socket Connected 16641
Socket Connected 19644
Socket Connected 17645
Socket Connected 14642
Socket Connected 20255
Socket Connected 8647
Socket Connected 20260
Socket Connected 12654
Socket Connected 10654
Socket Connected 7653
Socket Connected 20268
Socket Connected 4653

The DNS Resolution time starts increasing for every request. As a result the timeout for SqlConnection is expired by the time a connection to a SQL server is established.

@saurabh500
Copy link
Contributor

saurabh500 commented Dec 5, 2017

I am using the following script on Ubuntu to emulate multiple concurrent connections.

#!/bin/bash
for number in {1..20}
do
wget http://localhost:22354/api/values/sqlnet &
done 
exit 0
wait

The author of the issue has reported the repro on Docker, however, I can repro this issue even on Host Ubuntu VM

To find out if the DNS resolution on the OS itself is slow, I tried

nslookup eshopsql-e3h5ug5xd3rfs.database.windows.net

and the results for NS lookup were almost instantaneous.
However programmatic DNS resolution using Dns.GetHostAddressesAsync is very slow to execute and performance deteriorates based on the number of parallel connections in ASP.Net

To run the repro, navigate to sqltestapp/sqltestwebapi and dotnet run

@saurabh500
Copy link
Contributor

Copying folks from networking for guidance
cc @davidsh @stephentoub @wfurt

@wfurt
Copy link
Member

wfurt commented Dec 5, 2017

Do you have sample code for the Dns.GetHostAddressesAsync @saurabh500 ???
Did you run it in container or just base os? I assume your tests are on Azure instance?

@saurabh500
Copy link
Contributor

I have posted the link to the repro.
I am running on base OS on a VM in office

@saurabh500
Copy link
Contributor

@saurabh500
Copy link
Contributor

I've tried the repro in a container and I get worse results

@wfurt
Copy link
Member

wfurt commented Dec 5, 2017

That is not that surprising. I do see sql demo but not pure dns test.
If I understood your post you had test with Dns.GetHostAddressesAsync without any SQL, right?

As far as BaseOS on a VM -> does that mean full Linux VM in Let Say Hyperv on Windows desktop?

@saurabh500
Copy link
Contributor

In the repro there is an endpoint called sqlnet which only calls networking Api.
It tries to do something close to what SqlClient does to establish a network connection.

The OS is hosted on hyper v

@Drawaes
Copy link
Contributor

Drawaes commented Dec 5, 2017

Have you tried changing from connection.Open to connection.OpenAsync eg

using (var connection = new SqlConnection(connectionString))
{
    try
    {
        await connection.OpenAsync();

as well as

var reader = await command.ExecuteReaderAsync(System.Data.CommandBehavior.CloseConnection);

I have noticed that Linux (with asp.netcore) if far more susceptible to thread starvation. Also a docker machine has a very low core count normally so dies off quickly and you start to hit timeouts.

Try changing to above and check the results (also was your DNS test done using blocking or async code?)

@saurabh500
Copy link
Contributor

@Drawaes I am trying to understand what is causing the slow down of subsequent DNS lookup from the DNS Apis which both SqlConnection Open and OpenAsync code will use.
The result is that the DNS lookup takes long and eventually causes timeouts in SqlConnection.Open
I will give it a try with SqlConnection.OpenAsync to see what the degradation there is. However I have narrowed the repro down to Dns.GetHostAddressesAsync which is taking long here without the SqlClient.

The parallelization extent for which this bug was reported was only 15 concurrent users.
Based on your experience of aspnet.core do you have any recommendations on the Docker container configurations that I could use to see if this issue is alleviated?

@saurabh500
Copy link
Contributor

From what I am reading Docker doesn't impose any limitations on the resources of a container. I upgraded my VM to 4 cores and I could still repro the issue.

After this I changed the code to add an await on the task which waits on the Network socket connection and I saw tremendous improvement in Dns resolution. I think this points in the direction of @drawes theory of thread starvation.

I then went ahead and modified all the SqlClient calls to be async and saw a lot of improvements in the performance of the behavior and I could scale the requests to 10000 concurrent users in VS (Thanks @Drawaes )

@wfurt does it still make sense to investigate why blocking calls slow down under high concurrent load?

@wfurt
Copy link
Member

wfurt commented Dec 5, 2017

With blocking calls, everything will be serialized. The part I don't understand is why time would increase over time - unless we building backlog e.g. we are asking for lookup faster then processing.

@saurabh500
Copy link
Contributor

saurabh500 commented Dec 5, 2017

Yes, that't weird. I don't know about the internal implementation of Dns.GetHostAddressesAsync to make a guess.

An experiment based on your comment above, if I change all calls to be blocking e.g. Dns.GetHostAddresses(serverName); and socket.Connect(targetAddr, port); in the repro, the calls to network APIs are very quick!! Sub 100 ms response.

I would imagine that if I am waiting for dns resolution and socket connect async tasks to finish in a blocking context, then that should take the same amount of time as calling the sync counter parts, but that doesn't seem to be the case here.

I will evaluate if the blocking APIs of SqlClient should be changed to use blocking network calls if that yields better performance.

@stephentoub
Copy link
Member

I would imagine that if I am waiting for dns resolution and socket connect async tasks to finish in a blocking context, then that should take the same amount of time as calling the sync counter parts, but that doesn't seem to be the case here.

Imagine you have four threads in the thread pool, and on those four thread pool threads, you initiate a synchronous operation; each completes synchronously and the operations go on their merry way.

Now imagine you have four threads in the thread pool, and on those four thread pool threads, you initiate an asynchronous operation that will queue a work item back to the thread pool at some point during its invocation, and then you block on the result of that asynchronous operation. You're now blocking all four threads in the pool waiting on those four asynchronous operations to complete, but those four asynchronous operations that were initiated are themselves queueing work items to the pool that require threads to run, and those work items need to complete before the blocking is ended. If the limit on the pool were four threads, you've essentially deadlocked. Assuming the limit is not, eventually the thread pool will detect the starvation and will inject more threads, but it does so relatively slowly. That's very likely contributing to the lag you're seeing.

@Drawaes
Copy link
Contributor

Drawaes commented Dec 5, 2017

:) Its something I see over and over. It's due to the context of ASP.NET. It uses the same thread pool to try to service requests. You get a limited number of threadpool threads (I forget the core count multiplier off the top of my head). There is a hill climbing algo that approx adds a thread every 500ms, so if the cpu use is low and all threads are used then you get a very slow ramp up.

So you block on a network call for say 200ms, in the meantime there are x requests waiting for aspnet. You finally release that thread and then you have a backlog of requests, then it takes longer and longer.

You can't block on IO in aspnet if you dont' want to bring down your server. We showed in the aspnet repo that a small number like 10 on a low cpu count docker container could bring it to it's knees.

@Drawaes
Copy link
Contributor

Drawaes commented Dec 5, 2017

@stephentoub beat me with a better explanation.

@CESARDELATORRE
Copy link
Author

I can understand the improvements when using asynchronous calls, but using sync we can reproduce it with a low number of users and most of all, we cannot repro it with a similar or higher number of users when load testing the same code running on Windows or Windows Containers.
Shouldn’t it behave and perform in a similar/comparable way than when running the same C# code on Windows?
Also, there’s a lot of code out there from customers who are not using asynchronous calls.
I believe that running asynchronous calls improves the results but is hiding the real issue/cause.

@saurabh500
Copy link
Contributor

@stephentoub and @Drawaes Thanks for the insights into what is going on with this issue.
Based on @stephentoub explanation, I understand that DNS resolution is slow because the DNS resolution task is waiting to be scheduled and not because the actual resolution is taking time.

Currenly SqlConnection.Open() calls Dns.GetHostAddressesAsync(serverName); and socket.ConnectAsync(targetAddr, port); and waits on the result of these two additional threads to complete. The scheduling of these threads may not happen quickly enough based on ThreadPool size.

SqlConnection.Open() should call the blocking version of the methods above instead i.e. Dns.GetHostAddresses(serverName); and socket.Connect(targetAddr, port); so that the connectivity operation happens on the same thread.

SqlConnection.OpenAsync() which returns a task should also using the non-async version of the APIs Dns.GetHostAddresses(serverName); and socket.Connect(targetAddr, port); so that it completes the connection establishment on a single thread and the task that SqlConnection.OpenAsync() returns can be awaited upon, instead of awaiting on the underlying networking tasks which need more threads to be spun up.

I just checked the Windows implementation of TCP socket connections which always establishes connections synchronously in case of Open() and OpenAsync()

@CESARDELATORRE
Copy link
Author

@saurabh500 Is there going to be a related fix for this issue? - Where and what timeframe can we expect?

@saurabh500
Copy link
Contributor

Is there going to be a related fix for this issue?

Yes

Where and what timeframe can we expect?

I don't have the answer to this question right now.

@CESARDELATORRE
Copy link
Author

Btw, can you confirm if the issue/fix is also related to SqlConnection running on plain Linux in addition to when running on a Docker Linux container?

@saurabh500
Copy link
Contributor

I have seen the issue on Ubuntu 16.04 as well as a hosted docker image.

@CESARDELATORRE
Copy link
Author

OK, thanks for helping on this issue and jumping on it so quick! 👍

@divega
Copy link
Contributor

divega commented Dec 6, 2017

I was telling @saurabh500 offline that this part of the plan sounded very counterintuitive to me:

SqlConnection.OpenAsync() which returns a task should also using the non-async version of the APIs Dns.GetHostAddresses(serverName); and socket.Connect(targetAddr, port); so that it completes the connection establishment on a single thread and the task that SqlConnection.OpenAsync() returns can be awaited upon, instead of awaiting on the underlying networking tasks which need more threads to be spun up.

I.e. shouldn't we be striving to make SqlConnection.OpenAsync() non-blocking throughout?

But then he explained to me that currently we end up blocking on the result of the async methods we call, in places like https://github.com/dotnet/corefx/blob/master/src/System.Data.SqlClient/src/System/Data/SqlClient/SNI/SNITcpHandle.cs#L130 and
https://github.com/dotnet/corefx/blob/master/src/System.Data.SqlClient/src/System/Data/SqlClient/SNI/SNITcpHandle.cs#L152

Those lines are part of the SNITCPHandle constructor so it can't be async unless we move code around to defer the DNS resolution and the socket connection, which would require some additional work.

Without that change, I agree with @saurabh500 it makes sense to switch to the sync versions of the methods in the short term, but I think we should pursue removing blocking calls from the implementation longer term. @saurabh500 @geleems I will create an issue for this.

@Drawaes
Copy link
Contributor

Drawaes commented Dec 6, 2017

Be warned, that adding blocking back in badly effects aspnet's performance. If a connection is being opened and closed rapidly it can bring a webserver to it's knees, is there no other way around this?

@divega
Copy link
Contributor

divega commented Dec 6, 2017

Be warned, that adding blocking back in badly effects aspnet's performance

@Drawaes the problem is that the code is already blocking, only it calls the async API and then blocks on Task.Result. That seems to be worse than just calling the sync version, unless I missed something.

Also, it seems the actual way around this would be to implement what I describe in https://github.com/dotnet/corefx/issues/25742. That doesn't seem too complex to me, but I don't know if @saurabh500 agrees with that assessment.

@Drawaes
Copy link
Contributor

Drawaes commented Dec 6, 2017

I guess my point would be, why not fix it now. All that code in the constructor seems like it could be moved now to a method that is called in the ConnectAsync or ParallelConnectAsync there aren't any blocking connects in there.

@saurabh500
Copy link
Contributor

saurabh500 commented Dec 6, 2017

@divega

That doesn't seem too complex to me

I won't comment on the complexity but I do know that this effort needs a change from the top level API to the networking layer and this might creep into native sni as well if done right. So according to me, to do this right the number of areas impacted would be many.
I think we should take any discussions about making OpenAsync truly async to the issue that you have opened.

@Drawaes if you look at the code calls that lead to the Dns lookup and parallel connect, you will see that we block at quite a few places.

@Drawaes
Copy link
Contributor

Drawaes commented Dec 6, 2017

@saurabh500 Agreed that Async is a virus like that (in the nicest possible way). My point here is, that unless this going back to blocking is for a back port fix, then aren't you better off trying to eliminate the blocking one by one. At the moment its far easier to spot (Search for .Result, .Wait etc) rather than a temp hack that won't go out yet anyway?

I have a couple of things to finish up on SslStream hopefully this week. Then I can try to take a look at what I can do as quick wins.... From scanning through there are these areas

  1. The constructor you identified
  2. The pool.... it blocks in there even if you call "OpenAsync"
  3. Number 2 but there is even a Thread.Sleep .... 😢

@stephentoub
Copy link
Member

And https://github.com/dotnet/corefx/issues/19836 😦

@divega
Copy link
Contributor

divega commented Dec 6, 2017

+1 on removing blocking code from connection pooling for the async path. FWIW, this has come up in our discussions for https://github.com/aspnet/dataaccessperformance.

@Drawaes could you create a separate issue for this if one does not exist?

It is better to track those improvements separately because it is likely we will address the separately. E.g. we are considering taking the minimal fix for this issue in a 2.0.x patch, while the rest of the improvements could go in 2.1 or later.

@geleems
Copy link

geleems commented Jan 10, 2018

Fix was merged to master branch. Closing this issue.

@geleems geleems closed this as completed Jan 10, 2018
@TheRubble
Copy link

Is this likely to be patched in 2.0 at all?

@saurabh500
Copy link
Contributor

A PR to patch 2.0.x has been merged at dotnet/corefx#26247

@witua
Copy link

witua commented Mar 23, 2020

Can I assume that this is also fixed in Microsoft.Data.SqlClient, and (correspondingly) EF Core 3? We are experiencing a similar issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants