-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm stress test failure #3901
Comments
I've discussed this with @Theodus and he agrees with my assessment: I think I found a logical flaw in my backpressure implementation from years ago. //
// For backpressure related muting and unmuting to work correctly, the following
// rules have to be maintained.
//
// 1. Across schedulers, an actor should never been seen as muted when it is not
// in fact muted.
// 2. It's ok for a muted actor to be seen as unmuted in a transient fashion
// across actors I think #2 is incorrect because...
AND another actor sends to it and see that the queue was empty we end up in this code: if(!has_flag(to, FLAG_UNSCHEDULED) && !ponyint_is_muted(to))
{
// if the receiving actor is currently not unscheduled AND it's not
// muted, schedule it.
ponyint_sched_add(ctx, to);
} IF we see this muted actor as unmuted (which rule 2 says is ok)... then this muted actor gets scheduled... bool ponyint_actor_run(pony_ctx_t* ctx, pony_actor_t* actor, bool polling)
{
pony_assert(!ponyint_is_muted(actor)); And the assert will fail. Before I work through the ramifications of running a muted actor which might be ok, or might do really weird things (i suspect there's a decent chance it is ok), I want to go over with someone knowledgeable if what I detail about is reasonable. I believe so. The fix might be as simple as removing that assert and probably replacing it with an immediate return false so short circuit. In which case it we still don't run the muted actor. OR the fix could be to change rule #2, but that would probably be worse performance-wise than the check and bail. The full conversation is here: https://ponylang.zulipchat.com/#narrow/stream/190365-runtime/topic/A.20bug.20in.20backpressure/near/256839385 |
I believe that this could if lots of things go "wrong" to be a very very bad bug that could result in an actor being run by two different schedulers and other not good things related to that. However, I need to review a bit more. However, this would be an exceedingly exceedingly unlilkely thing to happen. If I am correct, then probably the best fix for correctness and performance would be to change the assert into into a check that returns false in the rare occurrence that this happens. |
So we can for at least 1 case, change the assert and still basically keep the same constraints with the one caveat. However, I need to go through all the usage of mute and unmuted and see where they are used in relation to "is muted" checks (and triggers muting checks) to see if there are time of check, time of use violations that might potentially be a problem. |
A small logical flaw was discovered in the Pony runtime backpressure system that could lead to an actor that has been muted to prevent it from overloading other actors to be run despite a rule that says muted actors shouldn't be run. The series of events that need to happen are exceedingly unlikely but we do seem them from time to time in our Arm64 runtime stress tests. In the event that a muted actor was run, if an even more unlikely confluence of events was to occur, then "very bad things" could happen in the Pony runtime where "very bad things" means "we the Pony developers are unable to reason about what might happen". Fixes ponylang#3901
A small logical flaw was discovered in the Pony runtime backpressure system that could lead to an actor that has been muted to prevent it from overloading other actors to be run despite a rule that says muted actors shouldn't be run. The series of events that need to happen are exceedingly unlikely but we do seem them from time to time in our Arm64 runtime stress tests. In the event that a muted actor was run, if an even more unlikely confluence of events was to occur, then "very bad things" could happen in the Pony runtime where "very bad things" means "we the Pony developers are unable to reason about what might happen". Fixes ponylang#3901
This needs to be looked into more. First, verify that the assertion is valid. I'm pretty sure it is, but it should be revisited. Next, if it is valid, start figuring out how this might happened. We might need to figure out how to store the core dump when the assertion is hit to be able to investigate further.
stress_test.log.txt
It appear from history etc that this is happening is very unlikely on Arm (once in billions) and even way more rare on x86 where I have been able in the recent past to run for 72 hours on x86 without triggering this issue (I ended the test after 72 hours). I was also able on the Raspberry Pis to run for 24 hours without having this happen.
So far, this has happened once out of 7 Arm stress test runs on Graviton.
Each stress test has billions of operations passing through the overall code path that code possibly cause this.
The text was updated successfully, but these errors were encountered: