-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add experimental mars IO task scheduler #1543
Conversation
@Wraith2 is this ultimately needed because the managed MARS implementation does sync-over-async internally? If so, it's important to make sure everyone understands this isn't a real fix (as you've mentioned above), and I do hope there'll be a plan for properly fixing this (i.e. by modifying the managed MARS implementation to stop doing symc-over-async). I understand why you'd do this (and not criticizing at all) - it's just that adding complex temporary workarounds ultimately makes the driver even more complicated to maintain etc. FWIW Npgsql also does sync-over-async internally in some specific scenario, and therefore also moves completions to executer in a dedicated pool - so similar to what you've done here. Though in the Npgsql case this is done because of an actual constraint in the communication with PostgreSQL, rather than a driver implementation issue which could be fixed (as I'm assuming is the case here). |
Yup.
I don't really see a way to avoid that it because you have to keep moving messages on a muxed connection or you can block one mux with another. If you have a sync connection that isn't moving because it's waiting on something it would block all the other sync connections on the same mux, that would be a breaking change in behaviour. This is the minimal workaround. I took a sample which is known to work and has been thoroughly audited by people in runtime familiar with it and added the smallest set of changes possible to use it. I don't really like using this approach which is why I've not pushed it to a PR before but i don't see another way to change the current behaviour. A lot of other work on the driver is stuck behind this support issue so i'd like to see this out of the way even if it is temporarily so that we can get on with the work of modernising the code and get to a situation where rewriting the code is possible. |
I don't know anything about how MARS is implemented, but I'd assume that a muxed physical connection would involve buffering data for the various commands. If a read comes in for data that has already been buffered, it wouldn't need to do I/O at all; and if the data hasn't yet been buffered, then in any case we need to wait for the new data to come in after the currently-blocking read is done., But again, this is a completely external point of view and I may be misunderstanding the whole thing. More importantly, unless I'm mistaken, the Windows non-managed implementation doesn't seem to be suffering from this issue. If that's so, so it may be worth looking into what it does (if it can do it, maybe the managed implementation can do the same). |
It wouldn't use a managed thread pool thread so it wouldn't have this problem. It can use a dedicated thread or another method of completion notification. The code is closed so I can't look but the MS team can and haven't pulled any inspiration from however it works so I assume it isn't applicable to the managed version. As you say this is a workaround. It's not a great one but it does seem to work reasonably well. In testing I got a single trapped connection throw an exception with 200 parallel threads over 16 cores. The single one is the on that hits the threadpool just before it starts to have a queue length>0. Overall this keeps things moving so it's better than nothing, but only just. |
You are spot on for the native SNI code. Completions don't rely on the managed threadpool so they don't get blocked. I've heard the term CPIO thrown around relative to the native socket calls. I wish there was a way to detect that first blocked thread. Is there a place we can use our own threadpool for all async network read completions so that no reads are ever blocked? |
We can simply force all mars IO onto our own threads, that's as simple as changing: if (LocalAppContextSwitches.UseExperimentalMARSThreading
-#if NETCOREAPP31_AND_ABOVE
- && ThreadPool.PendingWorkItemCount>0
-#endif To take out the ThreadPool work item count, just use our ThreadPool if the switch is set. I did this in testing and it works. The problem is that it won't scale because I've just forced the number of threads to an arbitrary number, 3. If someone generates enough IO to saturate those 3 threads we're back in the same position but with a worse ThreadPool that won't add new threads. If we want it to scale we end up with our own ThreadPool implementation and that's very hard and a big resource burden for maintenance. I may be too pessimistic though. I'd like to at some feedback from people experiencing this problem to see if much better is good enough. Much better paired with retry could entirely mask the problem. |
Any feedback on this? I thought it was a fairly major issue that wanted solving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this PR. It would be nice to get some feedback from users on #422. Even if no one responds there, this seems like it would be pretty safe to merge for the next preview. If it causes issues, we can easily revert it.
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNIMarsConnection.cs
Outdated
Show resolved
Hide resolved
That's what I was hoping for. People will need to turn it on to try it though. |
I'm ready to try this on sandbox, running stress tests. Problem is my API's are 2.2, not 3.1. |
It'll work on 2.1 you'll just get all mars io on the dedicated scheduler. |
If you have 20 threadpool threads and try to start 100 connections the threadpool threads are quickly consumed trying to do open tasks and there is little time left for them to work on the async task needed to keep mars moving. If you look at the image in the top post in this thread you'll see that there are 3 lanes for mars io dedicated threads and then a large number of existing thread pool threads, then towards the bottom you see new threads being added in a stepped pattern at a rate of roughly 1 per second. This is the thread pool hill climbing algorithm adding threads slowly under pressure. All this change does it make it so that there is always some amount of thread time available for mars async receive operations by detecting task pressure (in an admittedly crude way) and diverting the mars task activity onto separate threads which can't be consumed by users' tasks. I'm very surprised it makes no difference for you. That makes me wonder if it's turned on. The difference for me is clear. Instead of a large delay at the start of the test run tasks immediately start running, Without this there is a 10 second pause while a number of connections start to open and then hang because the messages aren't being pumped and then a number of them timeout after which newly opened connections start making progress. With this change I immediately see progress being made and sometimes I'll see a timeout. |
That happens occasionally with async things that are delayed. Essentially a packet is waiting to be scheduled but the connection is closed first meaning the stream is now null, we could quite easily just null check and quietly exit in that case, It is something on my list of things to get around to and would be handled if I get the chance to extend nullability annotations down into the managed sni implementation. |
It works on Windows. By running on Linux I found the following exceptions:
Check the output from Linux: *************** Running issue #422 **************Waiting time: 00:00:00.0172121 The original message is: COM Interop is not supported on this platform.. exception type is: System.PlatformNotSupportedException. It happened at Number: 7 for iteration number of: 0. ---> (Inner Exception #2) System.NullReferenceException: Object reference not set to an instance of an object. ---> (Inner Exception #3) System.NullReferenceException: Object reference not set to an instance of an object. ---> (Inner Exception #4) System.NullReferenceException: Object reference not set to an instance of an object. ---> (Inner Exception #5) System.NullReferenceException: Object reference not set to an instance of an object. ---> (Inner Exception #6) System.NullReferenceException: Object reference not set to an instance of an object. ---> (Inner Exception #7) System.NullReferenceException: Object reference not set to an instance of an object. Aborted |
Is there any way you could add in a first chance exception handler and see if you can get some sort of coherent stack trace information so I can identify what's using COM? Async stack traces like that are virtually useless and adding in the xunit spam just makes them worse. |
I ran it on command mode without attached debugger. If you add event source traces in your code I'll collect those logs. |
I've added a first chance handler that will be initialized before any tasks are put onto the new scheduler. If triggered it should dump the stack trace to the console and sni trace log. |
@DavoudEshtehari can I get those traces? |
Doh, of course setting |
The exception is fixed. But I don't see any improvement on Linux. Is this what you expected? |
It working is an improvement of sorts but no that isn't what I expect. For both @David-Engel and I enabling the switch Given that I haven't changed the unmanaged implementation it shows that we were both using the managed implementation when testing on windows when we saw the improvement. I don't know why it wouldn't repro on linux. |
I can see the improvement in Windows. Apparently, thread management is not similar on Windows and Linux. I'm wondering if it's something that should be added to the dotnet/runtime list. |
Unfortunately I just don't have the skills required to track this into linux land. @roji is there any chance of getting a little of your time and expertise on this? |
I checked the unix library by decompiling it to make sure all the required functionality is there, and it is. Thread scheduling is not vastly different on *nix to windows. The abstractions provided by the framework would take care of that or we, and everyone else, would be aware that threading worked differently. If we assume that you can set an appcontext switch properly by copy pasting then that only leaves one thing that can be differently, the threadpool queued count. So I've taken out the check for that and made it so that if the switch is set it will always use the dedicated threads. Can you retest please @DavoudEshtehari ? it'd be really good to get this into the latest preview. |
Unfortunately, still no improvements. Please find the event source logs in the attached file. |
It's working. Sort of. I can only see one exception in the trace even though it's hard to work with. It looks like the IO is all being performed on the 3 IO threads but continuations of some kind are being queued back onto the threadpool and that's what you're seeing blocking the throughput at the start. So you're seeing a wait but you aren't getting many failures. If we can work out what that continuation is and see if it's safe to do on the IO pool threads we can push it back onto the dedicated IO threads. If it could execute user code it can't go on our pool. Can you tell me what framework version you're using? How do you work with these traces? Visual studio will open them but the options are very limited, perfview will let me poke around in them but it's got a UI that only just beats The gimp in terms of usability. |
I can see the Timeout exceptions. And the target framework is Output:
|
Take a look at the flame graph if you load he file into perfview, you'll see the IO is happening where it's supposed to so the scheduler component is working. It's just that you still get threadpool exhaustion because of the waiting on the non-io completion. Can you give me the project you're using to repro and the commands to gather the traces? Nothing about this should be unix specific so I want to see if I can see the same pattern in windows. |
You should pass the running process Id using dotnet-trace tool. I'd suggest using "Perfview" If you liked to investigate on Windows.
Find my repro here based on issue #422. e.g: Waits 8 seconds, enables the UseExperimentalMARSThreading switch and runs 100 iterations.
|
I built a vm and set it up with everything needed to test this against linux using wsl as @DavoudEshtehari has been and I can reproduce the issue. It isn't clear what is causing the problem but it looks like the networking implementation in the runtime just stops calling back when data arrives, if that's the case and it's caused by the runtime using the threadpool internally then there is no way we can fix it at this layer. I will continue to investigate. |
I've been looking into what is going on here and the results are discouraging. I started by adding public void ReadFromStreamAsync(Stream stream, SNIAsyncCallback callback)
{
Debug.WriteLine($"ReadFromStreamAsync {Thread.CurrentThread.Name}");
stream.ReadAsync(_data, 0, _dataCapacity, CancellationToken.None)
.ContinueWith(
continuationAction: _readCallback,
state: callback,
CancellationToken.None,
TaskContinuationOptions.DenyChildAttach,
TaskScheduler.Default
);
} and the method that callback delegate refers to: private void ReadFromStreamAsyncContinuation(Task<int> t, object state)
{
Debug.WriteLine($"ReadFromStreamAsyncContinuation {Thread.CurrentThread.Name}");
SNIAsyncCallback callback = (SNIAsyncCallback)state;
bool error = false;
Exception e = t.Exception?.InnerException;
if (e != null)
{
... This is where we call into the
You can see that we get 3 callbacks and then they stop and we see only new reads queuing up. Eventually the driver code will timeout and error because we're not seeing packets come back. So why 3? Unknown. If we enable the mars scheduler from this PR we'll see a working pattern on windows:
But doing the same on linux doesn't have the same effect:
You can see that the mars IO threads are being used but that we're not getting callbacks from the system to tell us that a read has completed. On windows the limiting factor seems to be that a high level async task was being blocked by the threadpool being unavailable and the mars scheduler fixes that problem. On Linux there is also a lower level problem somewhere in the runtime/networking implementation which means that when we hit 3 waiting callbacks we stop getting callbacks. I went looking for the threadpool implementation in the runtime to see if there's anything interesting in there that explains what's going on. It looks like for .net 7 the was a PR last month to add the ability to use IOCompletionPorts to the threadpool on windows but before that there was just the standard ThreadPool.CoreCLR.cs implementation which is the portable version. So there is no managed difference between the threadpools on <6 which means that if there is a difference it would need to be in the os abstraction layer. To see what patterns of behaviour we have i ran the test program using 4 differnent configuration options.
Hyper-V VM with 4 cores (to keep default thread counts down) and 8Gib memory. Windows is win10 latest. Linux is ubuntu 20.04 running in wsl2 with mode set to 2 so it's using a real linux kernel. All test run through visual studio in debug mode so i could capture the output and see if any odd things happened. The database is an azure hosted empty database, the internet latency can be useful to make async operations go async but all of this can be replicated locally. Mars IO thread count is 10 when enabled. The results are:
Takeaways from this are:
So the scheduler is useful if and only if we want a way to alleviate this issue on windows. However, there's one more set of results. In
So if you simply prepare the environment for the workload you're about to throw at it then you'll get good results. So the problem we have is what we see in this trace:
Why do we stop receiving callback notifications from the system once 3 have been queued? is the 3 co-incidental/temporal in nature or is there some limit of 3 in the runtime somewhere? We rely on the |
For the record, sharing my findings with
I found a similar result by downgrading the thread's priority: |
Update over the previous comment: Running the test without enabling |
It is the workaround I proposed over 2 years ago now #422 (comment) but people didn't want to accept it. Unless the framework changes underneath us then imo it remains the only viable option. |
@Wraith2 it seems that in #1543 (comment) you suspect an issue with the BCL socket implementation; if so the best way forward is to reproduce it in a code sample outside of SqlClient. Doing that may not be trivial, but otherwise there's always the chance something problematic happening in SqlClient somehow triggers the problematic behavior. |
I'm curious whether you've tried the same thing with the postgres driver and if it would also show the same behaviour. If the runtime is using sync over async internally then you should even though you aren't using that pattern in the library. |
I certainly haven't seen issues like this in Npgsql (except when we messed up ourselves in the driver). Npgsql only does sync-over-async itself in a very specific scenario where we can't aviod it - when batching and sync APIs are used together - though that's probably common enough that we'd have heard some complaints if it were failing as above... |
Regarding The main issue here is NetworkStream.ReadAsync, which by calling SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNIMarsHandle.cs Line 519 in a6625b3
We've faced the same kind of problem in Npgsql with ValueTask.AsTask , @roji should remember that.
|
Thanks for the investigation @vonzshik - yes, we faced this issue with AsTask in Npgsql's sync-over-async hack - see this comment and the workaround. Since AsTask is being used inside NetworkStream, it's likely to be more complicated to work around it. More generally, the problem here is in the assumption that an async BCL API (NetworkStream.ReadAsync) won't itself internally depend on the thread pool - this simply isn't a guarantee that the API makes for you. If fixing this issue is important, then I really urge everyone involved in SqlClient to come up with a plan where sync-over-async isn't used. I understand that's a big thing to do, but it really is the only way to ensure the driver works correctly in all scenarios etc. |
That fits with what I think is going on. It also makes clear that even on windows this PR isn't a fix, all it does is divert some pressure from the threadpool onto our dedicated threads it doesn't prevent the runtime below us from hitting the same problem we've just been lucky in testing. Thanks @vonzshik!
There's some difference of opinion on that one. For the SQL team fixing it is a high priority. |
@Wraith2 can we close this PR? |
Addresses #422 but does not "fix" it, the fix is not to overload the system thread pool starving asynchronous operations of the cpu resources needed to do the requested work.
This adds a netcore3.1 or later specific task scheduler which uses dedicated threads to handle mars io when the system thread pool has queued tasks waiting. The only work done on these dedicated threads io work required to keep mars functioning and once data has been received any other async work will proceed on user threads as normal. No user code will execute on these threads.
The behaviour observed is that as soon as the system thread pool becomes overloaded the mars io threads take over and continue to do mars io work until the system thread pool becomes less saturated, A demand changes the io is diverted to the dedicated threads as needed but we prefer the system thread pool. This is a simple scheduler taken from the dotnet example code and so it does not attempt to grow or shrink the number of threads, it is hardcoded at 3 for this PR but could be changed.
To use this requires two appsettings, preferably in
Main