-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for daemon shutdown before testing NodeStrategy #620
Conversation
Signed-off-by: Michel Hidalgo <[email protected]>
Windows still looks unhappy :(. |
That's interesting, I'm not sure if the daemon is taking really long to shutdown or if it's deadlocking (?) |
Signed-off-by: Michel Hidalgo <[email protected]>
I suspect it's the socket lingering, that's how this detects if the daemon is running or not (arguably, not a great way of doing so). I don't see any resource mismanagement in the daemon nor in |
Alright, I think I know what the problem is. There is TOCTOU race in the test, and ultimately in this whole idea of checking for a failing The test fails because more than one daemon gets spawned and shutdown requests get routed to only one of them. This occurs because I can tweak the test to not run into this situation, but the status check for the daemon is fundamentally flawed IMHO. It has to be changed (perhaps using file locking?). |
Can we dissable |
Yes, we can. But it doesn't solve the problem IMHO. It hides it. Spurious daemons can still be spawned, they just won't live enough to become an issue. |
Sorry I don't quite follow the issue. I'm also not sure how file locking could help, could you expand? |
Multiple daemons can be executed because there's a race in the check for whether a daemon is already running or not (because we use that check to gate daemon execution). Even if we disable socket address reuse, that race is still there. Two daemons can still be spawned in quick succession. One will exit or crash though. That's what I meant by hiding the issue.
If we had some form of synchronization, we could avoid the races. But it's tricky. You'd have to first attempt to lock the file. If you can, no daemon is running. Then fork to run the daemon while preserving the file descriptor with the lock. The lock would get released when the daemon dies. On Windows... (shivers) you'd need an extra outer lock to check for the inner lock and spawn, plus an IPC for the child process to tell when it has acquired the inner lock. Spurious daemon don't sound that bad now 😅 |
IMO spawning two daemons in succession is not a problem if one crashes. e.g.
The important thing that process A and B should know is that the daemon they spawned might crash and they might found one spawned by another process, but that seems irrelevant. @hidmic does that sound reasonble to you?
I'm not against adding another form of synchronization though, but it makes the code more complex. |
It's not a problem, but it's not ideal either. Spawning processes in an OS like Windows is somewhat expensive.
Yeah, it does get complex. Particularly on Windows. I had another idea. The process that attempts to spawn a daemon could first bind the socket, fetch its underlying handle (using In the interest of time, I'll do the following. I'll disable address reuse and handle bind failure in the daemon gracefully. Then I'll open a issue or a draft PR with the more sophisticated approach. |
Sounds great!!
I'm happy if that works but I'm not sure if file descriptors are inherited when spawning a process though, and I'm not sure if all OSs behave in the same way. |
On Unix OSes, they can be. On Windows, they cannot, but they can be shared through an IPC. I'm sure there's a booby trap somewhere, so I'll defer that POC. |
Oh boy. We need |
The The other option is just to be aware of the "large time" that can take to spawn a daemon if one was recently closed, but it isn't pretty nice. |
I have to override a few, I believe the problem stems from active closing of accepted connections. Setting
Yeah, no, that's not nice. That timeout is about 4 minutes on Linux by default. |
Yes, the problem is that we're kind of doing the "opposite" of what's recommended. AFAIK resetting the connection isn't too bad, worst case it will cause a ConnectionResetError. |
Indeed.
All XMLRPC requests end up like that -- the server closes the accepted connection after serving the HTTP request 😃. I suspect that's why |
|
Superseded by #622. |
The original problem here is that running the test_strategy.py test in the nightly repeated jobs "sometimes" fails. There have been a few attempts to fix flakiness in the ros2 daemon in the past. These include #620 , #622 , and #652 . These all changed things in various ways, but the key PR was #652, which made spawning the daemon a reliable operation. #622 made some changes to change the sockets to add SO_LINGER with a zero timeout. That improved, but did not totally solve the situation. It also has its own downsides, as SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the socket and terminates it. To fix this for real requires 3 parts in this commit, though one of the parts is platform-dependent: 1. When the daemon is exiting cleanly, it should explicitly shutdown the socket that it was using for the XMLRPC server. That will cleanly shutdown the socket, and tell the kernel it can start the cleanup. On its own, this does not completely solve the problem, but it reduces the amount of time that things are hanging about waiting for the Python interpreter and/or the kernel to implicitly clean things up. 2. We should not specify SO_LINGER on the daemon sockets. As mentioned above, this is actually something of an anti-pattern and does not properly terminate connections with FIN (it just sends RST). 3. We should specify SO_REUSEADDR, but only on Unix. On Unix, SO_REUSEADDR essentially means "allow binding to an address/port that is in TCP TIME_WAIT (but not that is otherwise in use)". This is exactly the behavior we want. On Windows, SO_REUSEADDR causes undefined behavior, as it can cause a socket to bind even if there is something else bound already. Because of that, we want to set SO_REUSEADDR on Unix, but not Windows. Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to also catch ConnectionResetError. That arises because we can attempt to "connect" to a daemon that is in the process of shutting down. In that case, we should also consider the daemon not "connected". Signed-off-by: Chris Lalancette <[email protected]>
Hopefully this solves the sporadic flakes on Windows (e.g. https://ci.ros2.org/view/nightly/job/nightly_win_rep/lastCompletedBuild/testReport/ros2cli.ros2cli.test/test_strategy/test_with_no_daemon_running_3_3_/)
CI up to
ros2cli
: