Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ponyc segfault during runtime shutdown #1608

Closed
nisanharamati opened this issue Sep 29, 2017 · 1 comment · Fixed by ponylang/ponyc#2373
Closed

ponyc segfault during runtime shutdown #1608

nisanharamati opened this issue Sep 29, 2017 · 1 comment · Fixed by ponylang/ponyc#2373
Labels

Comments

@nisanharamati
Copy link
Contributor

nisanharamati commented Sep 29, 2017

##tldr;

  1. come up with a standalone reproduction code block (in pony, i.e. without wallaroo)
  2. open an issue against mainline pony

Background

While addressing a flip-flopping test failure in wallaroo, where the DataChannel's shutdown sequence had a race condition, we set up a shell script to run the data_channel unit tests in a loop until they failed because of that error. This resulted in the unit test failure being reproducible within a few seconds if you ran that test loop in parallel on multiple shells.
The fix involved applying two patches from mainlaine ponyc's TCPConnection to all of our copies of TCP Connection: ponylang/ponyc#1578, ponylang/ponyc#1626.

While this helped us fix the original issue, it also exposed a much less frequent segfault that seems to occur while running these tests.

@dipinhora investigated this

The evidence i have here leads me to believe that the runtime shuts down the asio thread (and the other scheduler threads) prior to an pony_asio_event_subscribe being processed resulting in the segfault.. what i don't understand is why that would happen since the runtime is supposed to wait to shutdown if there are any noisy actors (and it shows 2 of them).. or maybe this is before all the threads have been started? either way.. i'm confused 8*/

thanks to the magic of printf, i've confirmed that the ASIO backend is shut down prematurely.. now just have to figure out why

alrighty... so.. i mostly know what's going on and why it's a segfault... there's a race condition in the pony runtime shutdown code due to how the CNF/ACK cycle works where the runtime shuts down the ASIO backend because it thinks it has reached quiescence but then runs another CNF/ACK cycle which could potentially result in the runtime realizing that it hasn't reached quiescence after all... but by this time the ASIO backend has already been shut down and so we get a segfault... this seems to me to be something more likely to occur with a lower # of threads where the CNF/ACK cycles can be/are quick and can happen prior to messages being processed (due to how the so the runtime thinks there's a false quiescence that's been reached.. it's are to think of the exact sequence of operations that occurred leading to this situation but it's what the evidence is pointing to

Conclusion

There appears to be a race condition in the pony runtime shutdown code handling the ASIO backend. This results in a segfault during runtime shutdown.
We want to come up with code to reproduce this that's independent of Wallaroo, and open an issue to fix it against mainline Ponyc.

dipinhora added a commit to dipinhora/ponyc that referenced this issue Nov 25, 2017
Prior to this commit, the shutdown process could potentially cause
the runtime to end up in an invalid state where the ASIO backend
had been stopped but the shutdown of the runtime was aborted. If
this occurred and any actor subsequently tried to subscribe to
an ASIO event, it would cause a segfault because the ASIO backend
would be `NULL` instead of an actual `asio_backend_t*`.

This commit changes things so that if the shutdown is aborted due
to the receipt of a new `UNBLOCK` message andthe ASIO has already
been stopped, we try and restart the ASIO backend.

This commit also adds some extra assertions around the use of the
ASIO backend to ensure it is not `NULL` when it shouldn't be.

This should resolve WallarooLabs/wally/issues/1608.
dipinhora added a commit to dipinhora/ponyc that referenced this issue Nov 25, 2017
Prior to this commit, the shutdown process could potentially cause
the runtime to end up in an invalid state where the ASIO backend
had been stopped but the shutdown of the runtime was aborted. If
this occurred and any actor subsequently tried to subscribe to
an ASIO event, it would cause a segfault because the ASIO backend
would be `NULL` instead of an actual `asio_backend_t*`.

This commit changes things so that if the shutdown is aborted due
to the receipt of a new `UNBLOCK` message and the ASIO backend
has already been stopped, we try and restart the ASIO backend.

This commit also adds some extra assertions around the use of the
ASIO backend to ensure it is not `NULL` when it shouldn't be.

This should resolve WallarooLabs/wally/issues/1608.
dipinhora added a commit to dipinhora/ponyc that referenced this issue Nov 25, 2017
Prior to this commit, the shutdown process could potentially cause
the runtime to end up in an invalid state where the ASIO backend
had been stopped but the shutdown of the runtime was aborted. If
this occurred and any actor subsequently tried to subscribe to
an ASIO event, it would cause a segfault because the ASIO backend
would be `NULL` instead of an actual `asio_backend_t*`.

This commit changes things so that if the shutdown is aborted due
to the receipt of a new `UNBLOCK` message and the ASIO backend
has already been stopped, we try and restart the ASIO backend.

This commit also adds some extra assertions around the use of the
ASIO backend to ensure it is not `NULL` when it shouldn't be.

This should resolve WallarooLabs/wally/issues/1608.
@nisanharamati
Copy link
Contributor Author

👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant