Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlConnection.Open() is not finished with success to specified Failover Partner on linux. #583

Closed
alexanderinochkin opened this issue May 29, 2020 · 12 comments

Comments

@alexanderinochkin
Copy link

The main problem is:

We use sql connection string with Failover Partner server and Connect Timeout = 30.
The program runs on linux (Ubuntu 19.10).
When Data Source server is not available (server turned off or blocked by firewall),
SqlConnection.Open() does not connect to Failover Partner server despite it is available and act as SQL principal.

The reason description:

As it was discovered, the problem is at SNITCPHandle.Connect(...) procedure: It does not terminate connection at specified timeout.
https://github.com/dotnet/SqlClient/blob/3ca39848735143055d9d7d4864d5f1bfd1976b99/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNITcpHandle.cs

It uses CancellationTokenSource with Cancel() callback, but it does not really cancel socket.Connect() operation.

Here is the part of code with additional debug logging:

private static Socket Connect(string serverName, int port, TimeSpan timeout, bool isInfiniteTimeout)
{
	var sw = Stopwatch.StartNew();

	...

	CancellationTokenSource cts = null;

	void Cancel()
	{
		for (int i = 0; i < sockets.Length; ++i)
		{
			try
			{
				if (sockets[i] != null && !sockets[i].Connected)
				{
					Console.WriteLine($"{sw.Elapsed} Disposing socket {i}...");
					sockets[i].Dispose();
					Console.WriteLine($"{sw.Elapsed} Disposed {i}");
					sockets[i] = null;
				}
			}
			catch { }
		}
	}

	if (!isInfiniteTimeout)
	{
		cts = new CancellationTokenSource(timeout);
		cts.Token.Register(Cancel);
	}

	Socket availableSocket = null;
	try
	{
		for (int i = 0; i < sockets.Length; ++i)
		{
			try
			{
				if (ipAddresses[i] != null)
				{
					sockets[i] = new Socket(ipAddresses[i].AddressFamily, SocketType.Stream, ProtocolType.Tcp);
					Console.WriteLine($"{sw.Elapsed} Connecting socket {i}...");
					sockets[i].Connect(ipAddresses[i], port);
					Console.WriteLine($"{sw.Elapsed} Connect socket finished {i}.");
					
					...
				}
			}
			catch (Exception ex) { Console.WriteLine($"{sw.Elapsed} Exception on connect socket {i}: {ex.Message}"); }
		}
	}
	finally
	{
		...
	}

	return availableSocket;
} 

Below is the console output when run this methhod with timeout = TimeSpan.FromSeconds(10):

00:00:00.0018383 Connecting socket 0...
00:00:10.0036811 Disposing socket 0...
00:02:10.7964222 Disposed 0
00:02:10.8004390 Exception on connect socket 0: Connection timed out 10.0.0.16:1433
00:02:10.8006636 Finished.

As you can see, despite Cancel() callback is occured at spesified timeout, the Socket.Connect() is continue processing and finished with 130 sec timeout.
The reason of such behavior is well described at Marek Majkowski's article "When TCP sockets refuse to die": https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

"..the operating system sends a SYN packet. Since it didn't get any response the OS will by default retry sending it 6 times.
The retries are staggered at 1s, 3s, 7s, 15s, 31s, 63s marks (the inter-retry time starts at 2s and then doubles each time). By default the whole process takes 130 seconds, until the kernel gives up with the ETIMEDOUT errno."

To solve the problem:

As a result: Socket.Dispose() is not proper method to cancel Socket.Connect() process.
To be able to finish connect in specified timeout, this should be refactored to use Socket.BeginConnect() call.

Where to reproduce:

It is reproduced on all versions of System.Data.SqlClient.SqlConnection and Microsoft.Data.SqlClient.SqlConnection.

@karinazhou
Copy link
Member

Hi @alexanderinochkin ,

Thank you for the discovery. I appreciate the detailed description and explanation for this problem. I will go through the reference article you attached and test it locally.

@karinazhou
Copy link
Member

karinazhou commented Jun 16, 2020

Hi @alexanderinochkin ,

I am trying to reproduce this issue locally but it actually can connect to the failover partner from my Ubuntu 18.10 after I stop the Data Source server service. For my testing, I have ServerA and ServerB which are configured as data mirroring to each other.

  1. I stop SQLSERVER service on ServerA and it cannot be accessible from Ubuntu

  2. ServerB is accessible and Database has the status (Principal, Disconnected) from SSMS
    Before I stop ServerA, the status of the database on ServerB is (Mirror, Synchronized / Restoring...)

  3. I provide the following connection string in my test app:
    Data Source=ServerA;Failover Partner=ServerB;UID=uid; PWD=pwd;Database=testdb;Connect Timeout=30;Pooling=False

  4. I have a test app doing the following simple connection open:

            using (SqlConnection conn = new SqlConnection(connString))
            {
                conn.Open();

                Console.WriteLine($"Connected to SQL Server v{conn.ServerVersion} from {Environment.OSVersion.VersionString}");
            }

It always connects to the failover partner ServerB for me...
Have you tried with Socket.BeginConnect() with your failover server and does it work for you?


Another thing I would like to check with you is whether you are connecting to a SQL Server availability group listener or SQL Server Failover Cluster Instance instead of data mirroring servers? And do you have the MultiSubnetFailover option set to be true? According to this docs, When MultiSubnetFailover=True is specified for a connection, the client retries TCP connection attempts faster than the operating system’s default TCP retransmit intervals. There are some limitations when MultiSubnetFailover is on when you are not connecting to a SQL Server availability group listener or SQL Server Failover Cluster Instance.

The Socket.Connect() establishes a network connection synchronously while Socket.BeginConnect() does it in an asynchronous mode. Could you explain more why we should use the asynchronous connection for general synchronous cases?

@alexanderinochkin
Copy link
Author

alexanderinochkin commented Jul 1, 2020

Hi @karinazhou,

I stop SQLSERVER service on ServerA and it cannot be accessible from Ubuntu

It is not enough to stop SQLSERVER service.
To reproduce described problem you should disable ServerA IP address on Ubuntu firewall (with drop packets) to imitate server down.
To check that testing environment is configured correctly, you can run "telnet ServerA 1433" on Ubuntu. It should fail with long timeout (about 130 sec).

Could you explain more why we should use the asynchronous connection for general synchronous cases?

From the code of "private static Socket Connect(..)" function it is expected that call of "void Cancel()" will terminate sockets[i].Connect(ipAddresses[i], port) process. Otherwise no need to register it with cancellation token "cts.Token.Register(Cancel)". Am I right?

But in real "sockets[i].Dispose()" call does not return immediatelly and does not terminate "sockets[i].Connect(ipAddresses[i], port)" call until described timeout (130 sec).
Hereby if you want to terminate "sockets[i].Connect(ipAddresses[i], port)" after specified timeout, you should use some another mechanism but not Dispose() call.

@karinazhou
Copy link
Member

Hi @alexanderinochkin ,

I tried your suggestion to block the ServerA IP on Ubuntu and I can now reproduce the hanging behavior. If the connection timeout is increased to more than 130s such as 180s, it can connect to the failover partner in the end. In fact, even without giving the failover partner in the connection string, the connection will fail after about 130s as you have discovered.

Another thing I found is that if you try something like
Data Source=ServerA;UID=uid; PWD=pwd;Database=db;MultiSubnetFailover=True;Connect Timeout=15;Pooling=False,
the driver will go into TryConnectParallel() instead of Connect(). The connection will fail after the expected timeout. However, you cannot use MultiSubnetFailover=True with Failover Partner at the same time.

When testing the managed SNI on Windows by setting the following two lines of code in your application :

string ManagedNetworkingAppContextSwitch = "Switch.Microsoft.Data.SqlClient.UseManagedNetworkingOnWindows";
AppContext.SetSwitch(ManagedNetworkingAppContextSwitch, true);

the Socket.Dispose() can work as expected. However, it doesn't work as expected on Ubuntu which I agree with.
We will investigate more to see what possible mechanism we can use to actually dispose of the socket.

@karinazhou
Copy link
Member

karinazhou commented Jul 13, 2020

Hi @alexanderinochkin ,

I tried a standalone application to simulate the behavior we have in driver's SNITCPHandle: Connect(). Instead of calling the Socket.Connect, it calls BeginConnect() and EndConnect() which are for Async request. However, when I test this standalone app on Ubuntu, it still needs about 130s to kill the TCP request. From the debug message I added, it is hanging at EndConnect() which is the same as Dispose() before.

Debug : ts 00:00:10
Enter testConnect...
testConnect : 00:00:00.0047853 Connecting socket 0...
testConnect : 00:00:10.0133238 Timeout 0.
testConnect: 00:00:10.0135108  Before EndConnect...
Enter ConnectCallback...
Leave ConnectCallback...
testConnect : 00:02:09.5913178 Exception on connect socket 0: Connection timed out 
Exit testConnect...

You can find my testing program attached here. Since I am new to Async coding, please feel free to point out any mistakes I may have in the code.

@karinazhou
Copy link
Member

While investigating another issue related to SNITCPHandle.cs, we find the following change which goes back in 2018:
dotnet/corefx#26247

In this PR, the original ConnectAsync() implementation was replaced by a sequential Socket.Connect() logic. This change is to fix the issue when there are multiple concurrent users connecting to the server. The detailed background can be found here:
dotnet/runtime#24301

@alexanderinochkin
Copy link
Author

HI @karinazhou

I think this implementation will be optimal. It should not do multiple concurrent connections, but allow to accurately measure connect timeout. Here no any operation which unexpectedly block thread in the code.

private static Socket Connect(string serverName, int port, TimeSpan timeout, bool isInfiniteTimeout)
{
	var sw = Stopwatch.StartNew();

	IPAddress[] ipAddresses = Dns.GetHostAddresses(serverName);
	IPAddress serverIPv4 = null;
	IPAddress serverIPv6 = null;
	foreach (IPAddress ipAddress in ipAddresses)
	{
		if (ipAddress.AddressFamily == AddressFamily.InterNetwork)
		{
			serverIPv4 = ipAddress;
		}
		else if (ipAddress.AddressFamily == AddressFamily.InterNetworkV6)
		{
			serverIPv6 = ipAddress;
		}
	}
	ipAddresses = new IPAddress[] { serverIPv4, serverIPv6 };

	Socket availableSocket = null;
	for (int i = 0; i < ipAddresses.Length; ++i)
	{
		try
		{
			if (ipAddresses[i] != null)
			{
				Socket socket = new Socket(ipAddresses[i].AddressFamily, SocketType.Stream, ProtocolType.Tcp);
				Console.WriteLine($"{sw.Elapsed} Connecting socket {i}...");

				IAsyncResult result = socket.BeginConnect(ipAddresses[i], port, null, null);
				result.AsyncWaitHandle.WaitOne(timeout, true);

				Console.WriteLine($"{sw.Elapsed} Connect socket finished {i}");

				if (socket.Connected)
				{
					availableSocket = socket;
					break;
				}
				else
				{
					Console.WriteLine($"{sw.Elapsed} Before socket.close() {i}.");
					socket.Close();
					Console.WriteLine($"{sw.Elapsed} After socket.close() {i}.");
				}
			}
		}
		catch { }
	}

	return availableSocket;
}

@karinazhou
Copy link
Member

Hi @alexanderinochkin ,

I tried your proposed code snippet and it didn't hang the application after a pre-defined timeout which is good. However, I still have some concerns about this potential fix.

After calling Socket.BeginConnect(), there is no corresponding EndConnect() called afterwards. Accroding to MDSN Socket.BeginConnect and Socket.EndConnect, EndConnect is supposed to be called after BeginConnect. Though the Socket.Close() is called when socket is not connected, is there any unexcepted consequence without calling EndConnect ?

@karinazhou
Copy link
Member

Hi @alexanderinochkin ,

I tested your suggested change with one of the old issues we have involving multiple concurrent users:
dotnet/runtime#24301

If we change the socket connection to async calls (Socket.BeginConnect), it will bring back the old SqlException.

I have attached the repro application which is originally from
dotnet/runtime#24301 (comment)

To test it, you can replace 'ConnectionString' with your accessible server IP and login credentials in appsettings.json. By default, it uses the custom Microsoft.Data.SqlClient nuget package which I tested with your Socket.BeginConnect fix. I have also attached the custom driver's nuget package for you to test. You will need to copy this file to your .nuget cache on your ubuntu machine:

/home/yourUserName/.nuget/packages/microsoft.data.sqlclient

After coping with the whole project to your Ubuntu machine, go to \sqltestapp\sqltestwebapi and type dotnet run. This will start the service. Now, you can run the concurrentTest.sh which simulates around 50 multiple concurrent connections to the endpoint. You will get the following SqlException with this custom MDS driver:

Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 40 - Could not open a connection to SQL Server)

If you change the MDS version to 2.0.0 in sqltestwebapi.csproj, for example, and test it again, you won't see the SqlException this time.

2.1.0-devgh.583.zip
SQLConnNetCore2IssueRepro.zip

@alexanderinochkin
Copy link
Author

Hi @karinazhou,
Thanks for detailed explanation why BeginConnect() cannot be used at this place.
Based on your wide knowledge you can fix original bug any proper way.

@karinazhou
Copy link
Member

Hi @alexanderinochkin ,
I understand and agree that the async implementation can fix the limitation of Socket.Connect on Linux platform when the target resource is inaccessible by unblocking the working thread. However, we cannot roll back to the original async implementation for this due to its regression in another issue I have mentioned above.

We will keep this in our backlog for further investigation until we find a proper way to fix it. For now, I am afraid that we have to keep using Socket.Connect in the driver.

@arellegue
Copy link
Contributor

This issue is no longer reproducible in the current version of the driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants