ponyc segfault during runtime shutdown #1608

nisanharamati · 2017-09-29T17:52:44Z

##tldr;

come up with a standalone reproduction code block (in pony, i.e. without wallaroo)
open an issue against mainline pony

Background

While addressing a flip-flopping test failure in wallaroo, where the DataChannel's shutdown sequence had a race condition, we set up a shell script to run the data_channel unit tests in a loop until they failed because of that error. This resulted in the unit test failure being reproducible within a few seconds if you ran that test loop in parallel on multiple shells.
The fix involved applying two patches from mainlaine ponyc's TCPConnection to all of our copies of TCP Connection: ponylang/ponyc#1578, ponylang/ponyc#1626.

While this helped us fix the original issue, it also exposed a much less frequent segfault that seems to occur while running these tests.

@dipinhora investigated this

The evidence i have here leads me to believe that the runtime shuts down the asio thread (and the other scheduler threads) prior to an pony_asio_event_subscribe being processed resulting in the segfault.. what i don't understand is why that would happen since the runtime is supposed to wait to shutdown if there are any noisy actors (and it shows 2 of them).. or maybe this is before all the threads have been started? either way.. i'm confused 8*/

thanks to the magic of printf, i've confirmed that the ASIO backend is shut down prematurely.. now just have to figure out why

alrighty... so.. i mostly know what's going on and why it's a segfault... there's a race condition in the pony runtime shutdown code due to how the CNF/ACK cycle works where the runtime shuts down the ASIO backend because it thinks it has reached quiescence but then runs another CNF/ACK cycle which could potentially result in the runtime realizing that it hasn't reached quiescence after all... but by this time the ASIO backend has already been shut down and so we get a segfault... this seems to me to be something more likely to occur with a lower # of threads where the CNF/ACK cycles can be/are quick and can happen prior to messages being processed (due to how the so the runtime thinks there's a false quiescence that's been reached.. it's are to think of the exact sequence of operations that occurred leading to this situation but it's what the evidence is pointing to

Conclusion

There appears to be a race condition in the pony runtime shutdown code handling the ASIO backend. This results in a segfault during runtime shutdown.
We want to come up with code to reproduce this that's independent of Wallaroo, and open an issue to fix it against mainline Ponyc.

The text was updated successfully, but these errors were encountered:

Prior to this commit, the shutdown process could potentially cause the runtime to end up in an invalid state where the ASIO backend had been stopped but the shutdown of the runtime was aborted. If this occurred and any actor subsequently tried to subscribe to an ASIO event, it would cause a segfault because the ASIO backend would be `NULL` instead of an actual `asio_backend_t*`. This commit changes things so that if the shutdown is aborted due to the receipt of a new `UNBLOCK` message andthe ASIO has already been stopped, we try and restart the ASIO backend. This commit also adds some extra assertions around the use of the ASIO backend to ensure it is not `NULL` when it shouldn't be. This should resolve WallarooLabs/wally/issues/1608.

Prior to this commit, the shutdown process could potentially cause the runtime to end up in an invalid state where the ASIO backend had been stopped but the shutdown of the runtime was aborted. If this occurred and any actor subsequently tried to subscribe to an ASIO event, it would cause a segfault because the ASIO backend would be `NULL` instead of an actual `asio_backend_t*`. This commit changes things so that if the shutdown is aborted due to the receipt of a new `UNBLOCK` message and the ASIO backend has already been stopped, we try and restart the ASIO backend. This commit also adds some extra assertions around the use of the ASIO backend to ensure it is not `NULL` when it shouldn't be. This should resolve WallarooLabs/wally/issues/1608.

nisanharamati · 2017-11-28T18:38:19Z

👏

nisanharamati added the test label Sep 29, 2017

dipinhora mentioned this issue Oct 25, 2017

Standard Library tests occassionally fails to exit ponylang/ponyc#2298

Closed

dipinhora mentioned this issue Nov 25, 2017

Restart ASIO if needed while runtime is attempting to terminate. ponylang/ponyc#2373

Merged

dipinhora closed this as completed in dipinhora/ponyc@c50bf31 Nov 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ponyc segfault during runtime shutdown #1608

ponyc segfault during runtime shutdown #1608

nisanharamati commented Sep 29, 2017 •

edited

Loading

nisanharamati commented Nov 28, 2017

ponyc segfault during runtime shutdown #1608

ponyc segfault during runtime shutdown #1608

Comments

nisanharamati commented Sep 29, 2017 • edited Loading

Background

Conclusion

nisanharamati commented Nov 28, 2017

nisanharamati commented Sep 29, 2017 •

edited

Loading