-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible lock contention/thread pool starvation when acquiring access token causing timeout exceptions #2152
Comments
My colleague @MarcWils noticed: If we put the following line of code before Task.Run in our test project, the application will run just fine: ThreadPool.SetMinThreads(100, 100);
var tasks = Enumerable ... It seems when there is enough threads available to handle the tasks, then the timeout exceptions doesn't occur in our reproduction project. While debugging he also caught this in his attention: There are a lot of threads waiting when trying to get a SqlConnection. See the printscreen This seems like a bad practice. |
@JRahnama thanks for your reply. I've changed our test project so MARS is explicit disabled in connectionstring & running on Windows Machine, but the exception still occur. |
that sounds like thread starvation issue and the only workaround for now would be limiting threads as you have done. We have noticed if you set the maxpoolsize to 20 that will help too. Also there has been some improvements on Net7 and after version. We will go through the repro and see if there is anything new beside thread starvation issue. |
Hello, I am a colleague of Ka-Wai. We looked at the issue together last week. I believe there's a conceptual design flaw in SqlnternalConnectionTds calling The problem we're facing is a burst scenario where we have many incoming connections as soon as the applications starts. Debugging this scenario lead us to Restricting the max. pool size to a lower value may be a workaround for this issue. However, this may cause other issues in the application. Setting the
|
@MarcWils, have you tried opening one connection first as pool warming and try the rest after? |
@JRahnama we tried that. But when the Access Token expired (after 24 hours+) we observe same issue with thread starvation. |
You are correct here. However, MSAL doesn't offer any sync APIs and itself only uses async HTTP client APIs. So, unfortunately, SqlClient is eventually limited to calling an async API from within the sync Connection.Open() flow. I was looking for better ways of handling the unfortunate pattern (I've read the common "solutions", which all have at least one pitfall) and thinking about using a single thread synchronization context that would dedicate a thread to handle all the async token acquisition tasks to keep them off the system thread pool. I got the idea looking at how NpgSql was handling a similar situation. |
Seems like a creative solution which might work. And if I understand it correctly, the authentication token is cached at level of the connection pool. If a cached token is found, no new call will be made to the authentication token provider. We use managed identities where tokens have a lifetime of 24 hours. In this scenario, the SingleThreadSynchronizationContext would create a new thread when the first SQL connection is opened to get the token. Subsequent SQL connections can use the cached token and the SingleThreadSynchronizationContext would let the thread terminate. After 24 hours, a new thread is started to refresh the token. Which after a short while will also terminate. The overhead seems very limited. At least for the sync For our use case, @kwlin is working very hard to refactor all SQL calls to async. That should provide better performance overall. |
@vonzshik thanks for the headsup :) |
We had exactly same issues with a major outage today using Web Apps, user assigned identity and SQL Managed instance. We've changed back to using SQL logins for now. In our conn string we have min pool size 5 and max 300. Any suggestion what to try or wait for a fix? We are working on .net 6->7 upgrade would this help? |
@kwlin I gathered you fixed your issues with async methods? Is that considered a workaround? Seems like safest option would be to wait a fix for this? |
Adding some investigation results: I tried the Single Threaded Context path and that got me through the async tasks within MDS itself. But as soon as execution goes into MSAL or Azure.Identity, their async tasks are all created with ConfigureAwait(false) (standard/best practice for libraries), which moves those tasks off my Single Threaded Context and subsequently hangs with the same problem - no available threads in the system thread pool to service their tasks. So seems we might be stuck. We need sync-only APIs in MSAL and Azure Identity in order to avoid this issue. 😞 |
Describe the bug
We are experiencing connection pool problems during high load and when the Azure Access Token is expired. The following exception will be thrown:
This exception will be thrown for about 20 minutes and after that duration our operations of our component will proceed normal as expected. It seems that exception occur during startup of our application or when the Azure Access Token is expired.
To reproduce
I've included a Visual Studio solution with a reproduction of the issue with similar exceptions.
HighLoadAndAzureActiveDirectory.zip
Expected behavior
We expect that when getting a SQL Connection, the SQL Connection Pool shouldn't throw InvalidOperationException during high load or when the access token is expired.
Further technical details
Our system is a .NET 6 & CoreWCF application hosted in Service Fabric. We are using Managed Identity to connect to an Azure SQL database. We should handle more than 100 concurrenct SOAP requests during normal operations (and all the concurrent SOAP requests must connect to our Azure SQL Database with the Managed Identity). We don't have special configurations for Pool Size; just default values.
We are using the following libraries:
Additional context
We are seeing lock contentions/thread pool starvation when acquiring an access token for a SQL Connection. I've provided a reproduction project with similar effect when multiple tasks are opening a SQL connection to Azure SQL Database with a Azure Active Directory.
Following are printscreens of our debugging session:
It seems there are two threads are blocked in GetFedAuthToken, while 49 threads trying to get a connection from DbConnectionPool (TryGetConnection),
Browsing in the Microsoft.Data.SqlClient github repository, we noticed the following code:
https://github.com/dotnet/SqlClient/blob/7b2e2fc9f002e8b270894595f5f92670670053d7/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlInternalConnectionTds.cs#L2416C50-L2416C50
Could this code create a lock contention/thread pool starvation? We've noticed before this construction can cause lock contention/thread pool starvation. In some places this construction is used to go from async to sync: https://github.com/aspnet/AspNetIdentity/blob/main/src/Microsoft.AspNet.Identity.Core/AsyncHelper.cs
PS: I'm not sure if I'm using the correct terminlogy lock contention vs thread pool starvation.
The text was updated successfully, but these errors were encountered: