-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No more use of unordered reactions in federated programs #191
Conversation
- Place unsafe reactions onto a hashset, probe hashset as signaled to remove reactions to be safely enqueued if upstream port statuses are known
This reverts commit 318299a.
We are now able to get past startup and print the number 42, but federate 1 hangs after requesting shutdown. Also, this may break things for the other programs.
727fa8e
to
687bf89
Compare
These are all specific to decentralized coordination.
This also removes two other fields of the reaction_t struct which are not necessary.
After this change I was able to run Python's StopAtShutdown 95 times without observing any deadlock, whereas before I could consistently reproduce the deadlock (when using the macOS runner from CI) by running it several times. It is not entirely clear what this deadlock has to do with the changes in this branch. Although I have not reproduced it in master (I have tried, and after running it 95 times I did not see deadlock), it does seem like a programming mistake to allow the stop tag to increase. Besides, the error was very flaky, non-reproducible on Ubuntu, and quite possibly present even if it is not observed.
So much for failing fast. It seems like this is the behavior that is expected by our tests, so for now maybe it is what we will do.
Thanks to Jacky Kwok for finding this.
The ability to do this is the initial motivation for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very good! Only some comments missing, but other than that, this looks ready to merge...
Co-authored-by: Marten Lohstroh <[email protected]>
I think the timeout in Edit: Failed to reproduce locally on Ubuntu after running 348 times with log level LOG. Edit: Also failed to reproduce locally on Ubuntu after running 348 times with log level DEBUG. Edit: Also failed to reproduce on a macOS GH actions runner after running 265 times with log level DEBUG. |
Could we maybe pretend that everything is fine? I have run For the record, this is all the information we got from CI about the apparent deadlock:
It looks like the program never completed startup. It is possible to imagine that it is somehow not a true deadlock (some resource not available on GH's server, etc.) but I cannot say with confidence that it is not a true deadlock. |
Sounds fair to me. Should this reappear we'll deal with it then. |
The current runtime support for federated execution relies on a set of blocking unordered "control reactions" for each input port that to block the handling of reactions until the required port statuses are known. The drawbacks of this design are discussed in Rengarajan 2023, along with a solution that no longer relies on the use of blocking unordered reactions. This PR implements that solution.
This work was initiated by @arengarajan99 and completed by @petervdonovan.