Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for daemon shutdown before testing NodeStrategy #620

Closed
wants to merge 2 commits into from

Conversation

hidmic
Copy link
Contributor

@hidmic hidmic commented Apr 9, 2021

@hidmic hidmic requested review from audrow and ivanpauno April 9, 2021 17:01
@clalancette
Copy link
Contributor

Windows still looks unhappy :(.

@ivanpauno
Copy link
Member

Windows still looks unhappy :(.

That's interesting, I'm not sure if the daemon is taking really long to shutdown or if it's deadlocking (?)

Signed-off-by: Michel Hidalgo <[email protected]>
@hidmic
Copy link
Contributor Author

hidmic commented Apr 9, 2021

That's interesting, I'm not sure if the daemon is taking really long to shutdown or if it's deadlocking (?)

I suspect it's the socket lingering, that's how this detects if the daemon is running or not (arguably, not a great way of doing so).

I don't see any resource mismanagement in the daemon nor in xmlrpc. Socket is explicitly closed. Maybe that's not enough on Windows?

@hidmic
Copy link
Contributor Author

hidmic commented Apr 12, 2021

Alright, I think I know what the problem is. There is TOCTOU race in the test, and ultimately in this whole idea of checking for a failing socket.bind in is_daemon_running implementation.

The test fails because more than one daemon gets spawned and shutdown requests get routed to only one of them. This occurs because SimpleXMLRPCServer allows address reuse by default (see here and here), and since socket.bind calls start to fail only once the first socket.listen has succeeded, is_daemon_running() keeps returning True even while a daemon is spawning, potentially letting more daemons to be spawned.

I can tweak the test to not run into this situation, but the status check for the daemon is fundamentally flawed IMHO. It has to be changed (perhaps using file locking?).

@ivanpauno
Copy link
Member

The test fails because more than one daemon gets spawned and shutdown requests get routed to only one of them. This occurs because SimpleXMLRPCServer allows address reuse by default

Can we dissable allow_reuse_address?
I don't see why we need that, we only want one daemon.

@hidmic
Copy link
Contributor Author

hidmic commented Apr 12, 2021

Can we dissable allow_reuse_address?

Yes, we can. But it doesn't solve the problem IMHO. It hides it. Spurious daemons can still be spawned, they just won't live enough to become an issue.

@ivanpauno
Copy link
Member

Yes, we can. But it doesn't solve the problem IMHO. It hides it. Spurious daemons can still be spawned, they just won't live enough to become an issue.

Sorry I don't quite follow the issue.
What do you mean with "spurious daemons"? Why won't they live longh enough to become an issue?

I'm also not sure how file locking could help, could you expand?

@hidmic
Copy link
Contributor Author

hidmic commented Apr 12, 2021

What do you mean with "spurious daemons"? Why won't they live longh enough to become an issue?

Multiple daemons can be executed because there's a race in the check for whether a daemon is already running or not (because we use that check to gate daemon execution). Even if we disable socket address reuse, that race is still there. Two daemons can still be spawned in quick succession. One will exit or crash though. That's what I meant by hiding the issue.

I'm also not sure how file locking could help, could you expand?

If we had some form of synchronization, we could avoid the races. But it's tricky. You'd have to first attempt to lock the file. If you can, no daemon is running. Then fork to run the daemon while preserving the file descriptor with the lock. The lock would get released when the daemon dies. On Windows... (shivers) you'd need an extra outer lock to check for the inner lock and spawn, plus an IPC for the child process to tell when it has acquired the inner lock.

Spurious daemon don't sound that bad now 😅

@ivanpauno
Copy link
Member

Multiple daemons can be executed because there's a race in the check for whether a daemon is already running or not (because we use that check to gate daemon execution). Even if we disable socket address reuse, that race is still there. Two daemons can still be spawned in quick succession. One will exit or crash though. That's what I meant by hiding the issue.

IMO spawning two daemons in succession is not a problem if one crashes.

e.g.

process A spawn daemon
process B doesn't see daemon running and spawns new daemon
either the daemon spawned by process A or B starts serving and the other crashes
proscess A waits until the daemon is running and continues
proscess B waits until the daemon is running and continues

The important thing that process A and B should know is that the daemon they spawned might crash and they might found one spawned by another process, but that seems irrelevant.

@hidmic does that sound reasonble to you?

If we had some form of synchronization, we could avoid the races

I'm not against adding another form of synchronization though, but it makes the code more complex.

@hidmic
Copy link
Contributor Author

hidmic commented Apr 13, 2021

IMO spawning two daemons in succession is not a problem if one crashes.

It's not a problem, but it's not ideal either. Spawning processes in an OS like Windows is somewhat expensive.

I'm not against adding another form of synchronization though, but it makes the code more complex.

Yeah, it does get complex. Particularly on Windows.

I had another idea. The process that attempts to spawn a daemon could first bind the socket, fetch its underlying handle (using socket.fileno() or socket.share()), and forward it to the daemon through stdin. The daemon can reconstruct the socket (using socket.fromfd() or socket.fromshare()) and let its parent know through stdout, so that it can release its own copy. In theory, it works 😅


In the interest of time, I'll do the following. I'll disable address reuse and handle bind failure in the daemon gracefully. Then I'll open a issue or a draft PR with the more sophisticated approach.

@ivanpauno
Copy link
Member

In the interest of time, I'll do the following. I'll disable address reuse and handle bind failure in the daemon gracefully. Then I'll open a issue or a draft PR with the more sophisticated approach.

Sounds great!!

I had another idea. The process that attempts to spawn a daemon could first bind the socket, fetch its underlying handle (using socket.fileno() or socket.share()), and forward it to the daemon through stdin. The daemon can reconstruct the socket (using socket.fromfd() or socket.fromshare()) and let its parent know through stdout, so that it can release its own copy. In theory, it works sweat_smile

I'm happy if that works but I'm not sure if file descriptors are inherited when spawning a process though, and I'm not sure if all OSs behave in the same way.

@hidmic
Copy link
Contributor Author

hidmic commented Apr 13, 2021

if file descriptors are inherited when spawning a process

On Unix OSes, they can be. On Windows, they cannot, but they can be shared through an IPC. I'm sure there's a booby trap somewhere, so I'll defer that POC.

@hidmic
Copy link
Contributor Author

hidmic commented Apr 13, 2021

Oh boy. We need SO_REUSEADDR because Python's XMLRPC server likes leaving accepted connection sockets in TIME_WAIT state laying around. And then Linux won't allowing rebinding i.e. I cannot stop a daemon and immediately start another one. Unless SO_REUSEADDR is set 🙈

@ivanpauno
Copy link
Member

Oh boy. We need SO_REUSEADDR because Python's XMLRPC server likes leaving accepted connection sockets in TIME_WAIT state laying around. And then Linux won't allowing rebinding i.e. I cannot stop a daemon and immediately start another one. Unless SO_REUSEADDR is set

The socket object is a "public" member, maybe you can set SO_LINGER to 0 before binding (?).
https://github.com/python/cpython/blob/b5711c940f70af89f2b4cf081a3fcd83924f3ae7/Lib/socketserver.py#L464-L465.

The other option is just to be aware of the "large time" that can take to spawn a daemon if one was recently closed, but it isn't pretty nice.

@hidmic
Copy link
Contributor Author

hidmic commented Apr 14, 2021

The socket object is a "public" member, maybe you can set SO_LINGER to 0 before binding (?).

I have to override a few, I believe the problem stems from active closing of accepted connections. Setting SO_LINGER to 0 resets the connection. Not sure if clients will be happy about it, but I can try.

The other option is just to be aware of the "large time" that can take to spawn a daemon if one was recently closed, but it isn't pretty nice.

Yeah, no, that's not nice. That timeout is about 4 minutes on Linux by default.

@ivanpauno
Copy link
Member

I have to override a few, I believe the problem stems from active closing of accepted connections.

Yes, the problem is that we're kind of doing the "opposite" of what's recommended.
The idea is to "avoid" an active close() by the server, and for that the server typically sends a "shutdown" message to the client which actively close().
We're doing the opposite, creating a client connection to send a "shutdown" message to the server which actively close(), and thus the server enters the TIME_WAIT state (because probably, the conection of the client that send the "shutdown" message is still active).

AFAIK resetting the connection isn't too bad, worst case it will cause a ConnectionResetError.

@hidmic
Copy link
Contributor Author

hidmic commented Apr 14, 2021

The idea is to "avoid" an active close() by the server, and for that the server typically sends a "shutdown" message to the client which actively close().

Indeed.

thus the server enters the TIME_WAIT state (because probably, the conection of the client that send the "shutdown" message is still active).

All XMLRPC requests end up like that -- the server closes the accepted connection after serving the HTTP request 😃. I suspect that's why allow_reuse_address is True by default.

@hidmic
Copy link
Contributor Author

hidmic commented Apr 14, 2021

SO_LINGER worked like a charm 🎉 I prototyped sharing sockets through an IPC too, but, unsurprisingly, it gets messy.

@hidmic
Copy link
Contributor Author

hidmic commented Apr 14, 2021

Superseded by #622.

@hidmic hidmic closed this Apr 14, 2021
@hidmic hidmic deleted the hidmic/wait-for-daemon-shutdown branch April 14, 2021 22:16
clalancette added a commit that referenced this pull request Nov 20, 2024
The original problem here is that running the test_strategy.py test
in the nightly repeated jobs "sometimes" fails.

There have been a few attempts to fix flakiness in the ros2 daemon in the past.
These include #620 , 
#622 , and
#652 .
These all changed things in various ways, but the key PR was #652, which made
spawning the daemon a reliable operation.

#622 made some changes to change the sockets to add SO_LINGER with a zero timeout.
That improved, but did not totally solve the situation. It also has its own downsides, as
SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the
socket and terminates it.

To fix this for real requires 3 parts in this commit, though one of the parts is platform-dependent:

1.  When the daemon is exiting cleanly, it should explicitly shutdown the socket that it was using
for the XMLRPC server. That will cleanly shutdown the socket, and tell the kernel it can start the
cleanup. On its own, this does not completely solve the problem, but it reduces the amount of time
that things are hanging about waiting for the Python interpreter and/or the kernel to implicitly
clean things up.
2.  We should not specify SO_LINGER on the daemon sockets. As mentioned above, this is actually
something of an anti-pattern and does not properly terminate connections with FIN (it just sends RST).
3.  We should specify SO_REUSEADDR, but only on Unix. On Unix, SO_REUSEADDR essentially means
"allow binding to an address/port that is in TCP TIME_WAIT (but not that is otherwise in use)".
This is exactly the behavior we want. On Windows, SO_REUSEADDR causes undefined behavior, as
it can cause a socket to bind even if there is something else bound already. Because of that, we want
to set SO_REUSEADDR on Unix, but not Windows.

Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to
also catch ConnectionResetError. That arises because we can attempt to "connect" to a daemon that
is in the process of shutting down. In that case, we should also consider the daemon not "connected".

Signed-off-by: Chris Lalancette <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants