Crash in runtime due to ponyint_actor_run() executing the same actor at same time on two threads #3455

KittyMac · 2020-02-01T21:19:51Z

Full details are here:

https://github.com/KittyMac/pony.problems/tree/master/crash_due_to_parallel_actor_execution

Running the example with ponynoblock enabled is key to getting the crash. But the issue isn't the crash, the issue is ponyint_actor_run() is running for the same pony_actor_t at the same time on two different schedulers. When that happens, one of them reaches the end of ponyint_actor_run() first and deletes the actor while the other is still executing, causing the crash.

To easily see when the same actor is being run multiple times at the same time, you can use the patched actor.h and actor.c included. They add a simple bool running to actor.h, and set it to true when inside ponyint_actor_run() and false at each exit point. If ponyint_actor_run() executes on the same actor twice at the same time, then for one running is true, it will prints a '.' to the console, and it will return true (avoiding the parallel run).

I spent a couple days looking, but I did not find the cause of the issue. My investigation led me to think its somewhere in the work stealing.

SeanTAllen · 2020-02-01T22:08:28Z

I haven't looked closely at this, but I would think the most likely culprit would be a bug in the backpressure implementation code.

KittyMac · 2020-02-06T03:10:18Z

During experimentation, I discovered if i add a pthread_mutex() to ponyint_mpmcq_pop() the problem goes away.

KittyMac · 2020-02-08T15:31:00Z

Ok, i understand this a bit more now.

Parts of ponyint_actor_run() does indeed run in parallel, and it looks like that is by design. So the issue is no longer "there's a danger of messages being processes out of order on two different schedulers", and is back in the realm of "pony is crashing when ponynoblock is true and its under heavy message load".

The synchronization point in pony actors which prevents multiple actors accessing the message queue at the same time is in the scheduling details in sendv. When a message is sent to an actor (via the pony_sendv from a different thread), if the messageq has been marked as empty then that other thread reschedules the actor with ponyint_sched_add (so that it can process the work in the future). Which means that, while we can be in ponyint_actor_run() on two different threads, the first thread needs to have called ponyint_messageq_markempty() first.

Knowing this, the problem causing the crash is as follows:

  bool empty = ponyint_messageq_markempty(&actor->q);
  if (empty && actor_noblock && (actor->gc.rc == 0))
  {
      // when 'actor_noblock` is true, the cycle detector isn't running.
      // this means actors won't be garbage collected unless we take special
      // action. Here, we know that:
      // - the actor has no messages in its queue
      // - there's no references to this actor
      // therefore if `noblock` is on, we should garbage collect the actor.
	  
      ponyint_actor_setpendingdestroy(actor);
      ponyint_actor_final(ctx, actor);
      ponyint_actor_destroy(actor);
	  
	  return false;
  }
  
  // Return true (i.e. reschedule immediately) if our queue isn't empty.
  actor->running = 0;
  return !empty;

Thread A calls ponyint_messageq_markempty(). empty is sent to true.
Thread B calls sendv, sees empty queues, and reschedules the actor
Thread A then continues execution, dropping into the conditional and destroying the actor
Thead X gets the rescheduled-yet-destroyed actor and tries to process, resulting in bad things.

Under this theory, the next missing piece of the puzzle is "why didn't (actor->gc.rc == 0) prevent going into the conditional"? As per the comment, if no one is referencing this actor how did someone manage to send it a message in step #2 of the theory?

I don't have a full answer to that yet. I know what when #4 happens, at that time actor->gc.rc is 256. So its possible that the ordering is more like:

Thread A calls ponyint_messageq_markempty(). empty is sent to true. It gets into the conditional past (actor->gc.rc == 0) check.
Thread B is executing and gains a reference to actor, calls sendv, sees empty queues, and reschedules the actor
Thread A then continues execution, dropping into the conditional and destroying the actor
Thead X gets the rescheduled-yet-destroyed actor and tries to process, resulting in bad things.

SeanTAllen · 2020-09-20T23:09:09Z

@dipinhora found a bug in the scheduler message queue that would be likely source of this issue. I need to write it up but it could result very (rarely) in an actor having been in 2 scheduler queues and getting run in parallel which would result in bad things.

We are working on a fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in runtime due to ponyint_actor_run() executing the same actor at same time on two threads #3455

Crash in runtime due to ponyint_actor_run() executing the same actor at same time on two threads #3455

KittyMac commented Feb 1, 2020

SeanTAllen commented Feb 1, 2020

KittyMac commented Feb 6, 2020

KittyMac commented Feb 8, 2020

SeanTAllen commented Sep 20, 2020 •

edited

Loading

Crash in runtime due to ponyint_actor_run() executing the same actor at same time on two threads #3455

Crash in runtime due to ponyint_actor_run() executing the same actor at same time on two threads #3455

Comments

KittyMac commented Feb 1, 2020

SeanTAllen commented Feb 1, 2020

KittyMac commented Feb 6, 2020

KittyMac commented Feb 8, 2020

SeanTAllen commented Sep 20, 2020 • edited Loading

SeanTAllen commented Sep 20, 2020 •

edited

Loading