-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible hang with Async_detach #569
Comments
It's worth mentioning that all the fds I'm using are blocking. Switching to using non-blocking fds would probably also help. Might also explain why I don't see other reports of this behavior. |
I haven't fully understood the issue yet, but I want to immediately note that:
|
Unfortunately, my project has been incrementally adopting Lwt. So the process which creates the pipe and spawns the lwt-using-server doesn't use Lwt. So we have a blocking pipe. (Which I can fix with
Yep, you're right! Interestingly enough, the |
If I understood correctly, at that time when the binary gets stuck, you would expect to see the main thread and two system threads, not one, with one system thread waiting in |
Yep! That's exactly what I think is happening! |
No problem. You may want to set it to non-blocking mode as a quick workaround, but if this is accurate, then this is a classic type of bug that definitely needs to be fixed on the Lwt side as well :) I don't remember why that was commented off the top of my head. Have to look into it. Given I have a rule never to leave any code commented without commenting why, I suspect that was supposed to be temporary for debugging, and I missed restoring it during self-review. |
Also, to make it non-blocking from Lwt's point of view, you should use the functions in |
...and this suggests another potential workaround, to try Lwt 3.1.0 for the time being. It should have this bug and/or the even more incorrect thread count management Lwt had last year, but Lwt actually doing I will look into this issue in detail shortly :) |
...and, confirmed, that call shouldn't be commented out. I remember clearly now, that I did that in order to be able to debug the other issue (linked to the commit). However, Lwt should work whether |
@gabelevi Your analysis was correct. See the attached commit. Can you please pin Lwt to The short description of the problem is that worker threads were decrementing the idle thread count, after being awakened from the main thread. The main thread should instead itself decrement the idle worker count, to ensure thread wakeup and counting is done atomically. With the previous code, when the main thread saw one worker idle, it would awaken it. The worker would now be racing with the rest of the main thread, which was going on to submit another job to the thread pool. If this next job submission happened before the worker thread started running, the main thread would still see the count say that one worker thread is idle, even though that would not be the case. I was able to reproduce this (or a similar) issue on my end with this program: open Lwt.Infix
let () =
(* This blocking pipe is just a way to get Lwt_unix to make a worker thread
block indefinitely. Without wait_read, Lwt_unix hands a worker thread the
job of reading from a pipe that will never become readable. *)
let read_from, _ = Unix.pipe () in
let read_from =
read_from |> Lwt_unix.of_unix_file_descr ~blocking:true ~set_flags:false in
(* A writable blocking pipe. *)
let _, write_to = Unix.pipe () in
let write_to =
write_to |> Lwt_unix.of_unix_file_descr ~blocking:true ~set_flags:false in
(* Perform any blocking operation, so that Lwt spawns one worker thread. *)
let _ = Lwt_unix.stat "." |> Lwt_main.run in
(* Check we have one worker thread, and it is waiting for work. *)
assert (Lwt_unix.thread_count () = 1);
assert (Lwt_unix.thread_waiting_count () = 1);
(* We want the highest chance that the write is queued before the read worker
starts running, so do both allocations first to reduce the amount of code
that has to run between Lwt_unix.read and Lwt_unix.write. *)
let read_buffer = Bytes.create 1 in
let write_buffer = Bytes.create 1 in
Lwt_unix.read read_from read_buffer 0 1
|> ignore;
Lwt_unix.write write_to write_buffer 0 1
|> Lwt_main.run
|> ignore
(* ocamlfind opt -linkpkg -package lwt.unix race.ml && ./a.out *) This sometimes exits after a successful It's difficult to make this into a permanent test because I am going to fix the commented-out I didn't fully fix the code for |
Summary: This is the result of a week long investigation into Flow occasionally hanging. Here's what you should know: * A read/write of a blocking fd will block until the fd is readable/writable * A read/write of a nonblocking fd will return -1 immediately if the fd is not readable/writable * Lwt super extra hates running blocking system calls. It will create system threads and run the blocking system call on them * There's a bug in Lwt in scheduling the system threads: ocsigen/lwt#569 From our point of view, there's very little difference in using blocking or non-blocking fds. We had just been using blocking fds since I couldn't really tell if one was better than the other, However, using non-blocking keeps us from needing system threads and works around the bug, so let's use them! Reviewed By: samwgoldman Differential Revision: D7464034 fbshipit-source-id: e0ba602381a8bef7dd374ee1cd5fb0fdef9ad7d9
Summary: This is the result of a week long investigation into Flow occasionally hanging. Here's what you should know: * A read/write of a blocking fd will block until the fd is readable/writable * A read/write of a nonblocking fd will return -1 immediately if the fd is not readable/writable * Lwt super extra hates running blocking system calls. It will create system threads and run the blocking system call on them * There's a bug in Lwt in scheduling the system threads: ocsigen/lwt#569 From our point of view, there's very little difference in using blocking or non-blocking fds. We had just been using blocking fds since I couldn't really tell if one was better than the other, However, using non-blocking keeps us from needing system threads and works around the bug, so let's use them! Reviewed By: samwgoldman Differential Revision: D7464034 fbshipit-source-id: e0ba602381a8bef7dd374ee1cd5fb0fdef9ad7d9
Summary: I reported the Lwt issue here: ocsigen/lwt#574. The long story short is that reading from non-blocking fds on Windows doesn't seem to yield to other threads, even when the fd isn't ready to read. I'm not sure the cause. While we could just revert back to using blocking fds, we'd need to either workaround ocsigen/lwt#569 or ask for a lwt release and update lwt. Reviewed By: ljw1004 Differential Revision: D7603653 fbshipit-source-id: d6f9b4ed256cfa5b1fb286f9bd41c75748c28ddc
Summary: I reported the Lwt issue here: ocsigen/lwt#574. The long story short is that reading from non-blocking fds on Windows doesn't seem to yield to other threads, even when the fd isn't ready to read. I'm not sure the cause. While we could just revert back to using blocking fds, we'd need to either workaround ocsigen/lwt#569 or ask for a lwt release and update lwt. Reviewed By: ljw1004 Differential Revision: D7603653 fbshipit-source-id: d6f9b4ed256cfa5b1fb286f9bd41c75748c28ddc
Summary: I reported the Lwt issue here: ocsigen/lwt#574. The long story short is that reading from non-blocking fds on Windows doesn't seem to yield to other threads, even when the fd isn't ready to read. I'm not sure the cause. While we could just revert back to using blocking fds, we'd need to either workaround ocsigen/lwt#569 or ask for a lwt release and update lwt. Reviewed By: ljw1004 Differential Revision: D7603653 fbshipit-source-id: 4af7e806e68cf188c16d199e98801be3b430ac54
I have been debugging a hang in my project. I'll try and describe the symptoms and what I think is happen.
Symptoms
Lwt_unix.write
call to a pipe. I have confirmed that the fd is ready to be written and someone is listening on the other side.strace
a stuck binary, I observe the main thread callingselect
, waiting for various fds to be readable. I also observe a single system thread worker, performing a blockingread
call.write
call.My guess at a cause
(I'm assuming async_detach since that's the default and I don't ever specify an async method)
I poked around in
lwt_unix_stubs.c
a bit. I'm guessing I'm hitting a deadlock that looks like this:read
system call either A) sees 0 system threads and creates a new one, or B) sees 1 free system thread and adds itself to the poolread
and can decrementthread_waiting_count
, the main thread adds thewrite
to the pool tooread
job and blocks indefinitelywrite
job is stuck in the poolwrite
command, which creates a second system thread. In my strace, I see twowrite
system calls on the new system thread - one to an fd and one to eventfdwrite
, and the binary is now unhungThings to do
Avoid long blocking system calls
Before lwt, I would use the
select & read
pattern. With Lwt, I've just been usingread
. Perhaps I should always usewait_read
andwait_write
before aread
orwrite
, to avoid tying up a system thread for a long time.If this is a best practice, it might be worth documenting too.
Fix the deadlock
Perhaps the number of jobs in the
pool_queue
should never exceedthread_waiting_count
. If we assume that every job might take a Very Long Time, then putting N+1 jobs in the queue when there are only N free system workers will block that last job for a Very Long TimeSo we could do something like
pool_queue_size
pool_queue_size >= thread_waiting_count || thread_waiting_count == 0
, create a new system threadThe text was updated successfully, but these errors were encountered: