Wait for daemon shutdown before testing NodeStrategy #620

hidmic · 2021-04-09T17:01:04Z

Hopefully this solves the sporadic flakes on Windows (e.g. https://ci.ros2.org/view/nightly/job/nightly_win_rep/lastCompletedBuild/testReport/ros2cli.ros2cli.test/test_strategy/test_with_no_daemon_running_3_3_/)

CI up to ros2cli:

Repeated Windows CI

Signed-off-by: Michel Hidalgo <[email protected]>

clalancette · 2021-04-09T19:36:19Z

Windows still looks unhappy :(.

ros2cli/ros2cli/node/daemon.py

ivanpauno · 2021-04-09T19:39:53Z

Windows still looks unhappy :(.

That's interesting, I'm not sure if the daemon is taking really long to shutdown or if it's deadlocking (?)

Signed-off-by: Michel Hidalgo <[email protected]>

hidmic · 2021-04-09T22:46:06Z

That's interesting, I'm not sure if the daemon is taking really long to shutdown or if it's deadlocking (?)

I suspect it's the socket lingering, that's how this detects if the daemon is running or not (arguably, not a great way of doing so).

I don't see any resource mismanagement in the daemon nor in xmlrpc. Socket is explicitly closed. Maybe that's not enough on Windows?

hidmic · 2021-04-12T19:20:06Z

Alright, I think I know what the problem is. There is TOCTOU race in the test, and ultimately in this whole idea of checking for a failing socket.bind in is_daemon_running implementation.

The test fails because more than one daemon gets spawned and shutdown requests get routed to only one of them. This occurs because SimpleXMLRPCServer allows address reuse by default (see here and here), and since socket.bind calls start to fail only once the first socket.listen has succeeded, is_daemon_running() keeps returning True even while a daemon is spawning, potentially letting more daemons to be spawned.

I can tweak the test to not run into this situation, but the status check for the daemon is fundamentally flawed IMHO. It has to be changed (perhaps using file locking?).

ivanpauno · 2021-04-12T19:55:41Z

The test fails because more than one daemon gets spawned and shutdown requests get routed to only one of them. This occurs because SimpleXMLRPCServer allows address reuse by default

Can we dissable allow_reuse_address?
I don't see why we need that, we only want one daemon.

hidmic · 2021-04-12T21:40:25Z

Can we dissable allow_reuse_address?

Yes, we can. But it doesn't solve the problem IMHO. It hides it. Spurious daemons can still be spawned, they just won't live enough to become an issue.

ivanpauno · 2021-04-12T21:54:39Z

Yes, we can. But it doesn't solve the problem IMHO. It hides it. Spurious daemons can still be spawned, they just won't live enough to become an issue.

Sorry I don't quite follow the issue.
What do you mean with "spurious daemons"? Why won't they live longh enough to become an issue?

I'm also not sure how file locking could help, could you expand?

hidmic · 2021-04-12T23:42:20Z

What do you mean with "spurious daemons"? Why won't they live longh enough to become an issue?

Multiple daemons can be executed because there's a race in the check for whether a daemon is already running or not (because we use that check to gate daemon execution). Even if we disable socket address reuse, that race is still there. Two daemons can still be spawned in quick succession. One will exit or crash though. That's what I meant by hiding the issue.

I'm also not sure how file locking could help, could you expand?

If we had some form of synchronization, we could avoid the races. But it's tricky. You'd have to first attempt to lock the file. If you can, no daemon is running. Then fork to run the daemon while preserving the file descriptor with the lock. The lock would get released when the daemon dies. On Windows... (shivers) you'd need an extra outer lock to check for the inner lock and spawn, plus an IPC for the child process to tell when it has acquired the inner lock.

Spurious daemon don't sound that bad now 😅

ivanpauno · 2021-04-13T13:54:55Z

Multiple daemons can be executed because there's a race in the check for whether a daemon is already running or not (because we use that check to gate daemon execution). Even if we disable socket address reuse, that race is still there. Two daemons can still be spawned in quick succession. One will exit or crash though. That's what I meant by hiding the issue.

IMO spawning two daemons in succession is not a problem if one crashes.

e.g.

process A spawn daemon
process B doesn't see daemon running and spawns new daemon
either the daemon spawned by process A or B starts serving and the other crashes
proscess A waits until the daemon is running and continues
proscess B waits until the daemon is running and continues

The important thing that process A and B should know is that the daemon they spawned might crash and they might found one spawned by another process, but that seems irrelevant.

@hidmic does that sound reasonble to you?

If we had some form of synchronization, we could avoid the races

I'm not against adding another form of synchronization though, but it makes the code more complex.

hidmic · 2021-04-13T15:32:24Z

IMO spawning two daemons in succession is not a problem if one crashes.

It's not a problem, but it's not ideal either. Spawning processes in an OS like Windows is somewhat expensive.

I'm not against adding another form of synchronization though, but it makes the code more complex.

Yeah, it does get complex. Particularly on Windows.

I had another idea. The process that attempts to spawn a daemon could first bind the socket, fetch its underlying handle (using socket.fileno() or socket.share()), and forward it to the daemon through stdin. The daemon can reconstruct the socket (using socket.fromfd() or socket.fromshare()) and let its parent know through stdout, so that it can release its own copy. In theory, it works 😅

In the interest of time, I'll do the following. I'll disable address reuse and handle bind failure in the daemon gracefully. Then I'll open a issue or a draft PR with the more sophisticated approach.

ivanpauno · 2021-04-13T16:30:31Z

In the interest of time, I'll do the following. I'll disable address reuse and handle bind failure in the daemon gracefully. Then I'll open a issue or a draft PR with the more sophisticated approach.

Sounds great!!

I had another idea. The process that attempts to spawn a daemon could first bind the socket, fetch its underlying handle (using socket.fileno() or socket.share()), and forward it to the daemon through stdin. The daemon can reconstruct the socket (using socket.fromfd() or socket.fromshare()) and let its parent know through stdout, so that it can release its own copy. In theory, it works sweat_smile

I'm happy if that works but I'm not sure if file descriptors are inherited when spawning a process though, and I'm not sure if all OSs behave in the same way.

hidmic · 2021-04-13T20:42:18Z

if file descriptors are inherited when spawning a process

On Unix OSes, they can be. On Windows, they cannot, but they can be shared through an IPC. I'm sure there's a booby trap somewhere, so I'll defer that POC.

hidmic · 2021-04-13T22:48:43Z

Oh boy. We need SO_REUSEADDR because Python's XMLRPC server likes leaving accepted connection sockets in TIME_WAIT state laying around. And then Linux won't allowing rebinding i.e. I cannot stop a daemon and immediately start another one. Unless SO_REUSEADDR is set 🙈

ivanpauno · 2021-04-14T14:20:12Z

Oh boy. We need SO_REUSEADDR because Python's XMLRPC server likes leaving accepted connection sockets in TIME_WAIT state laying around. And then Linux won't allowing rebinding i.e. I cannot stop a daemon and immediately start another one. Unless SO_REUSEADDR is set

The socket object is a "public" member, maybe you can set SO_LINGER to 0 before binding (?).
https://github.com/python/cpython/blob/b5711c940f70af89f2b4cf081a3fcd83924f3ae7/Lib/socketserver.py#L464-L465.

The other option is just to be aware of the "large time" that can take to spawn a daemon if one was recently closed, but it isn't pretty nice.

hidmic · 2021-04-14T14:45:03Z

The socket object is a "public" member, maybe you can set SO_LINGER to 0 before binding (?).

I have to override a few, I believe the problem stems from active closing of accepted connections. Setting SO_LINGER to 0 resets the connection. Not sure if clients will be happy about it, but I can try.

The other option is just to be aware of the "large time" that can take to spawn a daemon if one was recently closed, but it isn't pretty nice.

Yeah, no, that's not nice. That timeout is about 4 minutes on Linux by default.

ivanpauno · 2021-04-14T15:22:32Z

I have to override a few, I believe the problem stems from active closing of accepted connections.

Yes, the problem is that we're kind of doing the "opposite" of what's recommended.
The idea is to "avoid" an active close() by the server, and for that the server typically sends a "shutdown" message to the client which actively close().
We're doing the opposite, creating a client connection to send a "shutdown" message to the server which actively close(), and thus the server enters the TIME_WAIT state (because probably, the conection of the client that send the "shutdown" message is still active).

AFAIK resetting the connection isn't too bad, worst case it will cause a ConnectionResetError.

hidmic · 2021-04-14T15:28:40Z

The idea is to "avoid" an active close() by the server, and for that the server typically sends a "shutdown" message to the client which actively close().

Indeed.

thus the server enters the TIME_WAIT state (because probably, the conection of the client that send the "shutdown" message is still active).

All XMLRPC requests end up like that -- the server closes the accepted connection after serving the HTTP request 😃. I suspect that's why allow_reuse_address is True by default.

hidmic · 2021-04-14T22:07:00Z

SO_LINGER worked like a charm 🎉 I prototyped sharing sockets through an IPC too, but, unsurprisingly, it gets messy.

hidmic · 2021-04-14T22:16:06Z

Superseded by #622.

The original problem here is that running the test_strategy.py test in the nightly repeated jobs "sometimes" fails. There have been a few attempts to fix flakiness in the ros2 daemon in the past. These include #620 , #622 , and #652 . These all changed things in various ways, but the key PR was #652, which made spawning the daemon a reliable operation. #622 made some changes to change the sockets to add SO_LINGER with a zero timeout. That improved, but did not totally solve the situation. It also has its own downsides, as SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the socket and terminates it. To fix this for real requires 3 parts in this commit, though one of the parts is platform-dependent: 1. When the daemon is exiting cleanly, it should explicitly shutdown the socket that it was using for the XMLRPC server. That will cleanly shutdown the socket, and tell the kernel it can start the cleanup. On its own, this does not completely solve the problem, but it reduces the amount of time that things are hanging about waiting for the Python interpreter and/or the kernel to implicitly clean things up. 2. We should not specify SO_LINGER on the daemon sockets. As mentioned above, this is actually something of an anti-pattern and does not properly terminate connections with FIN (it just sends RST). 3. We should specify SO_REUSEADDR, but only on Unix. On Unix, SO_REUSEADDR essentially means "allow binding to an address/port that is in TCP TIME_WAIT (but not that is otherwise in use)". This is exactly the behavior we want. On Windows, SO_REUSEADDR causes undefined behavior, as it can cause a socket to bind even if there is something else bound already. Because of that, we want to set SO_REUSEADDR on Unix, but not Windows. Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to also catch ConnectionResetError. That arises because we can attempt to "connect" to a daemon that is in the process of shutting down. In that case, we should also consider the daemon not "connected". Signed-off-by: Chris Lalancette <[email protected]>

Wait for daemon shutdown before testing NodeStrategy

e10a8a6

Signed-off-by: Michel Hidalgo <[email protected]>

hidmic requested review from audrow and ivanpauno April 9, 2021 17:01

ivanpauno approved these changes Apr 9, 2021

View reviewed changes

ros2cli/ros2cli/node/daemon.py Outdated Show resolved Hide resolved

audrow approved these changes Apr 9, 2021

View reviewed changes

Address peer review comments

1633a92

Signed-off-by: Michel Hidalgo <[email protected]>

fujitatomoya approved these changes Apr 10, 2021

View reviewed changes

hidmic mentioned this pull request Apr 13, 2021

Wait for daemon to perform discovery unconditionally. ros2/sros2#260

Closed

hidmic mentioned this pull request Apr 14, 2021

Ensure only one daemon can run at a time #622

Merged

hidmic closed this Apr 14, 2021

hidmic deleted the hidmic/wait-for-daemon-shutdown branch April 14, 2021 22:16

clalancette mentioned this pull request Nov 19, 2024

Allow reuse address when firing up the ros2cli daemon. #947

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for daemon shutdown before testing NodeStrategy #620

Wait for daemon shutdown before testing NodeStrategy #620

hidmic commented Apr 9, 2021

clalancette commented Apr 9, 2021

ivanpauno commented Apr 9, 2021

hidmic commented Apr 9, 2021 •

edited

Loading

hidmic commented Apr 12, 2021

ivanpauno commented Apr 12, 2021

hidmic commented Apr 12, 2021

ivanpauno commented Apr 12, 2021

hidmic commented Apr 12, 2021

ivanpauno commented Apr 13, 2021

hidmic commented Apr 13, 2021

ivanpauno commented Apr 13, 2021

hidmic commented Apr 13, 2021

hidmic commented Apr 13, 2021

ivanpauno commented Apr 14, 2021

hidmic commented Apr 14, 2021

ivanpauno commented Apr 14, 2021

hidmic commented Apr 14, 2021

hidmic commented Apr 14, 2021 •

edited

Loading

hidmic commented Apr 14, 2021

Wait for daemon shutdown before testing NodeStrategy #620

Wait for daemon shutdown before testing NodeStrategy #620

Conversation

hidmic commented Apr 9, 2021

clalancette commented Apr 9, 2021

ivanpauno commented Apr 9, 2021

hidmic commented Apr 9, 2021 • edited Loading

hidmic commented Apr 12, 2021

ivanpauno commented Apr 12, 2021

hidmic commented Apr 12, 2021

ivanpauno commented Apr 12, 2021

hidmic commented Apr 12, 2021

ivanpauno commented Apr 13, 2021

hidmic commented Apr 13, 2021

ivanpauno commented Apr 13, 2021

hidmic commented Apr 13, 2021

hidmic commented Apr 13, 2021

ivanpauno commented Apr 14, 2021

hidmic commented Apr 14, 2021

ivanpauno commented Apr 14, 2021

hidmic commented Apr 14, 2021

hidmic commented Apr 14, 2021 • edited Loading

hidmic commented Apr 14, 2021

hidmic commented Apr 9, 2021 •

edited

Loading

hidmic commented Apr 14, 2021 •

edited

Loading