Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in runtime due to ponyint_actor_run() executing the same actor at same time on two threads #3455

Open
KittyMac opened this issue Feb 1, 2020 · 4 comments

Comments

@KittyMac
Copy link
Contributor

KittyMac commented Feb 1, 2020

Full details are here:

https://github.com/KittyMac/pony.problems/tree/master/crash_due_to_parallel_actor_execution

Running the example with ponynoblock enabled is key to getting the crash. But the issue isn't the crash, the issue is ponyint_actor_run() is running for the same pony_actor_t at the same time on two different schedulers. When that happens, one of them reaches the end of ponyint_actor_run() first and deletes the actor while the other is still executing, causing the crash.

To easily see when the same actor is being run multiple times at the same time, you can use the patched actor.h and actor.c included. They add a simple bool running to actor.h, and set it to true when inside ponyint_actor_run() and false at each exit point. If ponyint_actor_run() executes on the same actor twice at the same time, then for one running is true, it will prints a '.' to the console, and it will return true (avoiding the parallel run).

I spent a couple days looking, but I did not find the cause of the issue. My investigation led me to think its somewhere in the work stealing.

@SeanTAllen
Copy link
Member

I haven't looked closely at this, but I would think the most likely culprit would be a bug in the backpressure implementation code.

@KittyMac
Copy link
Contributor Author

KittyMac commented Feb 6, 2020

During experimentation, I discovered if i add a pthread_mutex() to ponyint_mpmcq_pop() the problem goes away.

@KittyMac
Copy link
Contributor Author

KittyMac commented Feb 8, 2020

Ok, i understand this a bit more now.

Parts of ponyint_actor_run() does indeed run in parallel, and it looks like that is by design. So the issue is no longer "there's a danger of messages being processes out of order on two different schedulers", and is back in the realm of "pony is crashing when ponynoblock is true and its under heavy message load".

The synchronization point in pony actors which prevents multiple actors accessing the message queue at the same time is in the scheduling details in sendv. When a message is sent to an actor (via the pony_sendv from a different thread), if the messageq has been marked as empty then that other thread reschedules the actor with ponyint_sched_add (so that it can process the work in the future). Which means that, while we can be in ponyint_actor_run() on two different threads, the first thread needs to have called ponyint_messageq_markempty() first.

Knowing this, the problem causing the crash is as follows:

  bool empty = ponyint_messageq_markempty(&actor->q);
  if (empty && actor_noblock && (actor->gc.rc == 0))
  {
      // when 'actor_noblock` is true, the cycle detector isn't running.
      // this means actors won't be garbage collected unless we take special
      // action. Here, we know that:
      // - the actor has no messages in its queue
      // - there's no references to this actor
      // therefore if `noblock` is on, we should garbage collect the actor.
	  
      ponyint_actor_setpendingdestroy(actor);
      ponyint_actor_final(ctx, actor);
      ponyint_actor_destroy(actor);
	  
	  return false;
  }
  
  // Return true (i.e. reschedule immediately) if our queue isn't empty.
  actor->running = 0;
  return !empty;
  1. Thread A calls ponyint_messageq_markempty(). empty is sent to true.
  2. Thread B calls sendv, sees empty queues, and reschedules the actor
  3. Thread A then continues execution, dropping into the conditional and destroying the actor
  4. Thead X gets the rescheduled-yet-destroyed actor and tries to process, resulting in bad things.

Under this theory, the next missing piece of the puzzle is "why didn't (actor->gc.rc == 0) prevent going into the conditional"? As per the comment, if no one is referencing this actor how did someone manage to send it a message in step #2 of the theory?

I don't have a full answer to that yet. I know what when #4 happens, at that time actor->gc.rc is 256. So its possible that the ordering is more like:

  1. Thread A calls ponyint_messageq_markempty(). empty is sent to true. It gets into the conditional past (actor->gc.rc == 0) check.
  2. Thread B is executing and gains a reference to actor, calls sendv, sees empty queues, and reschedules the actor
  3. Thread A then continues execution, dropping into the conditional and destroying the actor
  4. Thead X gets the rescheduled-yet-destroyed actor and tries to process, resulting in bad things.

@SeanTAllen
Copy link
Member

SeanTAllen commented Sep 20, 2020

@dipinhora found a bug in the scheduler message queue that would be likely source of this issue. I need to write it up but it could result very (rarely) in an actor having been in 2 scheduler queues and getting run in parallel which would result in bad things.

We are working on a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants