Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm stress test failure #3901

Closed
SeanTAllen opened this issue Oct 9, 2021 · 3 comments · Fixed by #3974
Closed

Arm stress test failure #3901

SeanTAllen opened this issue Oct 9, 2021 · 3 comments · Fixed by #3974
Assignees
Milestone

Comments

@SeanTAllen
Copy link
Member

SeanTAllen commented Oct 9, 2021

This needs to be looked into more. First, verify that the assertion is valid. I'm pretty sure it is, but it should be revisited. Next, if it is valid, start figuring out how this might happened. We might need to figure out how to store the core dump when the assertion is hit to be able to investigate further.

stress_test.log.txt

# pingers 320, report-interval 300, report-count 40, initial-pings 5
time,run-ns,rate
1633744979.384644639,29999107986,121726
1633745009.384589159,29994999103,78168
1633745039.385096865,29997583820,81089
1633745069.383519421,29995676631,134263
1633745099.383482434,29997357952,86157
1633745129.383486906,29994808373,103566
1633745159.383501672,29997518447,104940
1633745189.383529013,29987221155,86130
1633745219.383530791,29997367048,105026
1633745249.383499948,29997443628,253888
1633745279.383498926,29997540641,183572
1633745309.383482997,29997372430,81921
1633745339.383634755,29995461586,80653
1633745369.383524043,29997225028,148541
1633745399.383554970,29997360660,539549
1633745429.383577009,29997636482,87572
1633745459.383763907,29982525165,97255
1633745489.384085025,29997419205,103003
1633745519.383571714,29996706556,80995
1633745549.383570660,29997613470,101745
1633745579.383544874,29996627889,75164
1633745609.383691482,29997434307,79045
1633745639.383542224,29995551837,98205
1633745669.383492419,29987871013,115666
1633745699.383575414,29996335325,77488
1633745729.383550266,29997101304,80457
1633745759.383564689,29997011390,79572
1633745789.383539997,29997228594,83638
1633745819.383510996,29996973907,90360
1633745849.383514323,29997525445,95659
1633745879.383479880,29997142151,77119
1633745909.383506238,29997182471,201210
1633745939.383579102,29997542593,102110
1633745969.383493879,29997261462,102501
1633745999.384491719,29998538042,108283
1633746029.383519051,29985314556,129800
/tmp/cirrus-ci-build/src/libponyrt/actor/actor.c:353: ponyint_actor_run: Assertion `!ponyint_is_muted(actor)` failed.

Backtrace:
  ./ubench(ponyint_assert_fail+0xe0) [0x43ffcc]
  ./ubench(ponyint_actor_run+0x5c) [0x4315c8]
  ./ubench() [0x4419fc]
  ./ubench() [0x440e70]
  /lib/aarch64-linux-gnu/libpthread.so.0(+0x84fc) [0xffff835ed4fc]
  /lib/aarch64-linux-gnu/libc.so.6(+0xd467c) [0xffff8342e67c]
Aborted (core dumped)
make: *** [Makefile:209: test-stress-release] Error 134

It appear from history etc that this is happening is very unlikely on Arm (once in billions) and even way more rare on x86 where I have been able in the recent past to run for 72 hours on x86 without triggering this issue (I ended the test after 72 hours). I was also able on the Raspberry Pis to run for 24 hours without having this happen.

So far, this has happened once out of 7 Arm stress test runs on Graviton.

Each stress test has billions of operations passing through the overall code path that code possibly cause this.

@SeanTAllen SeanTAllen added bug needs investigation This needs to be looked into before its "ready for work" labels Oct 9, 2021
@SeanTAllen SeanTAllen added this to the Arm support milestone Oct 9, 2021
@SeanTAllen SeanTAllen self-assigned this Oct 9, 2021
@SeanTAllen
Copy link
Member Author

SeanTAllen commented Oct 9, 2021

I've discussed this with @Theodus and he agrees with my assessment:

I think I found a logical flaw in my backpressure implementation from years ago.

//
// For backpressure related muting and unmuting to work correctly, the following
// rules have to be maintained.
//
// 1. Across schedulers, an actor should never been seen as muted when it is not
// in fact muted.
// 2. It's ok for a muted actor to be seen as unmuted in a transient fashion
// across actors

I think #2 is incorrect because...

  • IF an actor has 1 last message in its queue.
  • And it gets muted on the last message send
  • So that it is muted and empty queue

AND

another actor sends to it and see that the queue was empty we end up in this code:

    if(!has_flag(to, FLAG_UNSCHEDULED) && !ponyint_is_muted(to))
    {
      // if the receiving actor is currently not unscheduled AND it's not
      // muted, schedule it.
      ponyint_sched_add(ctx, to);
    }

IF we see this muted actor as unmuted (which rule 2 says is ok)... then this muted actor gets scheduled...
and in debug mode... when it gets run we have...

bool ponyint_actor_run(pony_ctx_t* ctx, pony_actor_t* actor, bool polling)
{
  pony_assert(!ponyint_is_muted(actor));

And the assert will fail.

Before I work through the ramifications of running a muted actor which might be ok, or might do really weird things (i suspect there's a decent chance it is ok), I want to go over with someone knowledgeable if what I detail about is reasonable.

I believe so. The fix might be as simple as removing that assert and probably replacing it with an immediate return false so short circuit. In which case it we still don't run the muted actor.

OR the fix could be to change rule #2, but that would probably be worse performance-wise than the check and bail.


The full conversation is here: https://ponylang.zulipchat.com/#narrow/stream/190365-runtime/topic/A.20bug.20in.20backpressure/near/256839385

@SeanTAllen
Copy link
Member Author

SeanTAllen commented Oct 9, 2021

I believe that this could if lots of things go "wrong" to be a very very bad bug that could result in an actor being run by two different schedulers and other not good things related to that. However, I need to review a bit more. However, this would be an exceedingly exceedingly unlilkely thing to happen.

If I am correct, then probably the best fix for correctness and performance would be to change the assert into into a check that returns false in the rare occurrence that this happens.

@SeanTAllen
Copy link
Member Author

So we can for at least 1 case, change the assert and still basically keep the same constraints with the one caveat.
The assert would become a return if muted.

However, I need to go through all the usage of mute and unmuted and see where they are used in relation to "is muted" checks (and triggers muting checks) to see if there are time of check, time of use violations that might potentially be a problem.

@SeanTAllen SeanTAllen removed the needs investigation This needs to be looked into before its "ready for work" label Jan 29, 2022
ergl pushed a commit to ergl/ponyc that referenced this issue Jan 31, 2022
A small logical flaw was discovered in the Pony runtime backpressure system that could lead to an actor that has been muted to prevent it from overloading other actors to be run despite a rule that says muted actors shouldn't be run.

The series of events that need to happen are exceedingly unlikely but we do seem them from time to time in our Arm64 runtime stress tests. In the event that a muted actor was run, if an even more unlikely confluence of events was to occur, then "very bad things" could happen in the Pony runtime where "very bad things" means "we the Pony developers are unable to reason about what might happen".

Fixes ponylang#3901
ergl pushed a commit to ergl/ponyc that referenced this issue Jan 31, 2022
A small logical flaw was discovered in the Pony runtime backpressure system that could lead to an actor that has been muted to prevent it from overloading other actors to be run despite a rule that says muted actors shouldn't be run.

The series of events that need to happen are exceedingly unlikely but we do seem them from time to time in our Arm64 runtime stress tests. In the event that a muted actor was run, if an even more unlikely confluence of events was to occur, then "very bad things" could happen in the Pony runtime where "very bad things" means "we the Pony developers are unable to reason about what might happen".

Fixes ponylang#3901
@SeanTAllen SeanTAllen removed the bug label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant