Pony runtime goes out of memory under high CPU load / concurrency #517

hakvroot · 2016-02-02T14:04:45Z

So I ran into a somewhat strange issue where a Pony program with a fixed set of actors can eventually run out of memory. The program in question is https://gist.github.com/hakvroot/fbf734017260d99f5a61.

I expected to be able to run this program indefinitely, and am in fact able to do so with a low number of pony threads (1-3). However when I fire up the program with at least 4 pony threads and I manage to get my CPU load over 8 (#cores * 2 for HT) I observe that the program keeps on allocating memory without releasing it.

So for example ./test --ponythreads 16 is a sure way to go OOM, but 3 times ./test --ponythreads 4 will do the trick for each instance just as well. Once the program starts to allocate more memory it will also continue to do this until it is killed by the OS, even if the CPU load falls below (in my case) 8.

Tested with ponyc 0.2.1-478-gebb7b36, built with config=release. I am running Linux Mint 17.1 with LLVM 3.6.2. If any more details are required, please let me know!

The text was updated successfully, but these errors were encountered:

andymcn · 2016-03-06T16:50:13Z

This looks suspiciously like issue #494.

Running on Windows I don't see the segfault, but I do terminate (which I shouldn't) after between a few and a hundred _on_finished messages.

hakvroot · 2016-03-08T08:51:29Z

To be sure the out of memory behavior still occurs I tried to reproduce this issue again with version 0.2.1-549-gab62f55 and I still observed the same behavior.

I have also tried to reproduce the issue on several virtual machines (all Ubuntu servers with various kernels, cloud hosted) and was able to reproduce the issue when the virtual machine had more than one core. I did reuse the binary from my own machine in these tests.

Furthermore #ponylang user ponysaurus tried to reproduce the issue on Debian and OS X but did not observe the same behavior. Unfortunately I am not aware of the number of cores that were available during his test runs.

If I find some time I'll try to build the reproduction case on one of the vm's where I was able to reproduce the issue with a locally compiled binary and see whether the same behavior still occurs.

sparseinference · 2016-03-24T19:01:19Z

I can reproduce this problem on build: 0.2.1-700-g827a243
I am using Linux x86_64, with kernel: 4.2.0-34, and I have 4 cores with 8GB of memory.

With the default number of threads (5) the process slowly starts to leak and continues leaking until all the free memory is used and then starts to swap.

With --ponythreads 8 as mentioned above by @hakvroot the leak is greatly accelerated.
I haven't looked at the code yet to see why this is. Can someone explain?

SeanTAllen · 2016-03-24T20:04:28Z

@sparseinference how many physical cpus do you have in your machine 4 or 2?

sparseinference · 2016-03-24T20:31:57Z

@SeanTAllen I have 1 physical CPU : Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz

according to: /proc/cpuinfo

SeanTAllen · 2016-09-29T02:24:42Z

I tested this just now, no matter what I do, this eventually segfaults.

SeanTAllen · 2016-10-14T01:46:29Z

I tried this after upgrading from LLVM 3.8 to LLVM 3.8.1 and get the same results.

jemc · 2016-10-14T02:50:41Z

@SeanTAllen any chance the segfault was cleared up by #1321?

SeanTAllen · 2016-10-14T03:23:25Z

nope. still segfaults.

SeanTAllen · 2016-10-14T03:25:38Z

when i compile this with a debug build of the compiler, i get a more interesting and familiar looking result:

Assertion failed: (index < POOL_COUNT), function ponyint_pool_free, file src/libponyrt/mem/pool.c, line 602.
[1]    69440 segmentation fault  ./issue-517

Praetonus · 2016-10-14T04:08:56Z

@SeanTAllen Could you get a stack trace for that?

SeanTAllen · 2016-10-14T12:05:44Z

interesting...

(lldb) run
Process 70432 launched: '/Users/sean/Dropbox/Private/Code/pony-scratch/issue-517/issue-517' (x86_64)
Process 70432 stopped
* thread #2: tid = 0x428559, 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x0000000119128da8) + 20 at messageq.c:70, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x0000000119128da8) + 20 at messageq.c:70
   67   pony_msg_t* ponyint_messageq_pop(messageq_t* q)
   68   {
   69     pony_msg_t* tail = q->tail;
-> 70     pony_msg_t* next = atomic_load_explicit(&tail->next, memory_order_acquire);
   71
   72     if(next != NULL)
   73     {
  thread #4: tid = 0x42855a, 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x0000000128ce6d08) + 20 at messageq.c:70, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x0000000128ce6d08) + 20 at messageq.c:70
   67   pony_msg_t* ponyint_messageq_pop(messageq_t* q)
   68   {
   69     pony_msg_t* tail = q->tail;
-> 70     pony_msg_t* next = atomic_load_explicit(&tail->next, memory_order_acquire);
   71
   72     if(next != NULL)
   73     {
  thread #5: tid = 0x42855b, 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x00000001289f0048) + 20 at messageq.c:70, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x00000001289f0048) + 20 at messageq.c:70
   67   pony_msg_t* ponyint_messageq_pop(messageq_t* q)
   68   {
   69     pony_msg_t* tail = q->tail;
-> 70     pony_msg_t* next = atomic_load_explicit(&tail->next, memory_order_acquire);
   71
   72     if(next != NULL)
   73     {
  thread #6: tid = 0x42855c, 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x000000013013c6e8) + 20 at messageq.c:70, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x000000013013c6e8) + 20 at messageq.c:70
   67   pony_msg_t* ponyint_messageq_pop(messageq_t* q)
   68   {
   69     pony_msg_t* tail = q->tail;
-> 70     pony_msg_t* next = atomic_load_explicit(&tail->next, memory_order_acquire);
   71
   72     if(next != NULL)
   73     {
(lldb) bt
warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.
* thread #2: tid = 0x428559, 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x0000000119128da8) + 20 at messageq.c:70, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x00000001000050d4 issue-517`ponyint_messageq_pop(q=0x0000000119128da8) + 20 at messageq.c:70
    frame #1: 0x000000010000445c issue-517`ponyint_actor_run(ctx=0x00000001097ff848, actor=0x0000000119128da0, batch=100) + 252 at actor.c:155
    frame #2: 0x000000010000f99f issue-517`run(sched=0x00000001097ff800) + 175 at scheduler.c:267
    frame #3: 0x000000010000f799 issue-517`run_thread(arg=0x00000001097ff800) + 57 at scheduler.c:309
    frame #4: 0x00007fff9c9f499d libsystem_pthread.dylib`_pthread_body + 131
    frame #5: 0x00007fff9c9f491a libsystem_pthread.dylib`_pthread_start + 168
    frame #6: 0x00007fff9c9f2351 libsystem_pthread.dylib`thread_start + 13

jemc · 2016-10-14T15:46:48Z

@Praetonus - is it possible that our atomics aren't actually atomic on @SeanTAllen's platform?

SeanTAllen · 2016-10-14T16:27:46Z

OSX El Capitan 10.11.6
2.6 ghz Intel core i7

Its a fairly standard Macbook Pro so the not actually being atomic would be concerning.

Praetonus · 2016-10-14T16:51:56Z

No, that's not possible. Even if the code generation for the C atomics was bugged, which is unlikely, memory operations on x86 are atomic with acquire/release semantics by default.

SeanTAllen · 2016-10-16T19:03:46Z

@Praetonus i'll keep looking.

re: the patch. do you want to PR that or shall I? I have it in place already and can easily commit and open. if i do it though, i could use a little help with what to put in the CHANGELOG beyond "fix segfault bug introduced by 83bc2aa"

sparseinference · 2016-10-16T19:08:06Z

@SeanTAllen

I'm using Kubuntu 16.10
clang-3.9
0.5.0-c5ae21e [debug](llvm 3.9.0) (cc (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005)

I ran the test with --ponythreads 2 ... 5, and --ponythreads 50

Interestingly, at --ponythreads 50 some of the threads are created but never scheduled - I assume this is what's happening because their CPU time is zero while the active threads have an equal share of the CPU.

My computer has an i7-5930K CPU @ 3.50GHz - 6 hyperthreading cores.

SeanTAllen · 2016-10-16T19:09:34Z

Thanks @sparseinference.

Praetonus · 2016-10-16T19:09:42Z

@SeanTAllen The code must be adapted to the pool allocator so I'll do it if you're not sure.

SeanTAllen · 2016-10-16T19:14:37Z

@Praetonus i'm not sure how it would be adapted, so have at it.

SeanTAllen · 2016-10-19T16:08:46Z

i found a "fix" for this. pretty sure I understand what is going on. need to discuss a proper solution with @sylvanc. FYI, it is the same issue as #647.

SeanTAllen · 2016-10-30T01:32:27Z

This and #647 are the same problem.

In general what happens is, with a higher number of schedulers, when a scheduler goes to steal work or otherwise get work to do, it finds none available. This causes it to send a block message to the cycle detector (the message causes a memory allocation). They do this every time they try to run, which results in a ton of block messages being sent. Combine this with our intentionally not running the cycle detector often and memory explosion.

The short term fix for this is to run with --ponynoblock. Running with --ponynoblock means your application isn't going to exit. Ever. But, it gets you around this problem. (Or, you can run with fewer ponythreads).

I had an idea for how to handle cycle detection that wouldn't cause this problem. After talking to @sylvanc, it turns out he had already been entertaining a similar and more fully fleshed out idea. We need to prove it will work, but the general idea is.

The cycle detector observes the state of the world and if it finds what it believes to be a cycle, it can send a message to the members of that cycle saying "this is my view of the world, I believe you are a cycle". Those actors can contradict the cycle detector and let it know its view is incorrect or they agree that they are in a cycle and exit. By following a system of this general sort, no messages are sent to the cycle detector without it first sending a message. There's a lot of details that would need to be fleshed out and @sylvanc's idea is farther along than mine, but this explanation should give you a vague handwave idea of the long term fix.

SeanTAllen · 2017-11-04T18:05:08Z

I've created a backpressure branch that address this by applying the in progress backpressure system to the cycle detector: backpressure-all-the-things

SeanTAllen · 2017-11-17T20:31:37Z

it appears that the cycle detector is a bit of a red herring. its not the real issue. this is the same bug as #647 and #2317.

the problem is with the memory allocator. block messages if changed to be sent as the smaller size pony_msgp_t results in no memory growth. i believe because unblock messages also are the same nice and the allocator is more like to find the space required.

in #2317, @slfritchie points to the issue. i did some testing where i:

turned off scheduler block/unblock messages to cut down on noise.
discarded all block/unblock/ack actor messages upon receipt so we could rule out a bug with memory not being freed in any of the view map manipulations.
switch ponyint_cycle_block to be:

void ponyint_cycle_block(pony_ctx_t* ctx, pony_actor_t* actor, gc_t* gc)
{
  pony_assert(ctx->current == actor);
  pony_assert(&actor->gc == gc);

  pony_msgp_t* m = (pony_msgp_t*)pony_alloc_msg(
    POOL_INDEX(sizeof(pony_msgp_t)), ACTORMSG_BLOCK);

  pony_sendv(ctx, cycle_detector, &m->msg, &m->msg, false);
}

this results in what is usually very stable memory usage after a period of time, with rare occassional jumps. i'm fairly sure that the allocator needs to be addressed. i can see that the cycle detector is not getting overwhelmed with messages. i believe the cycle detector play a part in backpressure that seemed to help address this was merely changing the pattern of block message allocations thereby changing the memory usage pattern.

SeanTAllen · 2017-11-19T12:54:05Z

I found an interesting possible "fix" for this. The real fix is we need a coalescing allocator. However, run away memory can be contained by making all block messages allocate the same amount of memory.

This results in less fragmentation in memory and less overall memory being allocated then when messages are of a different size. To do this, we would make the block, unblock and ack messages all the same size. it "wastes" more memory unblock and ack messages but tamps down on the run away memory growth edge case.

even if its not a perfect solution. i think we should make it part of a larger solution given how they are so often paired together (block/unblock messages).

Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647

SeanTAllen · 2017-11-19T14:02:54Z

PR #2355 helps address this issue and it unrelated to the aforementioned allocator twiddling.

Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647

SeanTAllen · 2017-12-16T03:20:23Z

The memory situation has improved quite a lot with 0.21.0 with this but its still in need of improvement. Before it was a very rapid OOM, now memory grows but it takes quite a bit longer to get there.

However, #2386 solves the problem. Memory usage with --ponythreads=16 is a stable 2976 because the number of actual threads in usage is really low (1) so all the block/unblock etc messages is vastly reduced.

I still think that we should be looking at the "block, unblock and ack" messages and how they interact with the allocator but.... I would say that once #2386 is merged, we can close this.

SeanTAllen · 2017-12-20T21:08:49Z

Closed by #2386

andymcn added the bug label Mar 6, 2016

andymcn assigned sylvanc Mar 6, 2016

andymcn added the needs investigation label Mar 6, 2016

hakvroot mentioned this issue Mar 31, 2016

Uncontrolled memory usage #647

Closed

SeanTAllen added difficulty: 2 - medium and removed bug labels Apr 6, 2016

SeanTAllen added bug: 1 - needs investigation triggers release Major issue that when fixed, results in an "emergency" release and removed bug: 3 - ready for work labels Sep 29, 2016

SeanTAllen added enhancement: 2 - under discussion and removed bug: 1 - needs investigation labels Oct 30, 2016

SeanTAllen mentioned this issue Nov 17, 2017

memory leak with timer #2317

Closed

SeanTAllen added the needs discussion during sync label Nov 19, 2017

SeanTAllen mentioned this issue Nov 19, 2017

Improve work-stealing "scheduler is blocked" logic #2355

Merged

SeanTAllen self-assigned this Dec 15, 2017

SeanTAllen closed this as completed Dec 20, 2017

SeanTAllen mentioned this issue Dec 20, 2017

Pony 0.21.1 #2435

Closed

dipinhora mentioned this issue Mar 14, 2018

Revert "Make scheduler threads share memory more quickly" #2569

Merged

jemc removed the needs discussion during sync label Apr 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pony runtime goes out of memory under high CPU load / concurrency #517

Pony runtime goes out of memory under high CPU load / concurrency #517

hakvroot commented Feb 2, 2016

andymcn commented Mar 6, 2016

hakvroot commented Mar 8, 2016

sparseinference commented Mar 24, 2016

SeanTAllen commented Mar 24, 2016

sparseinference commented Mar 24, 2016

SeanTAllen commented Sep 29, 2016

SeanTAllen commented Oct 14, 2016

jemc commented Oct 14, 2016

SeanTAllen commented Oct 14, 2016

SeanTAllen commented Oct 14, 2016

Praetonus commented Oct 14, 2016

SeanTAllen commented Oct 14, 2016

jemc commented Oct 14, 2016

SeanTAllen commented Oct 14, 2016

Praetonus commented Oct 14, 2016

SeanTAllen commented Oct 16, 2016 •

edited

Loading

sparseinference commented Oct 16, 2016

SeanTAllen commented Oct 16, 2016

Praetonus commented Oct 16, 2016

SeanTAllen commented Oct 16, 2016

SeanTAllen commented Oct 19, 2016

SeanTAllen commented Oct 30, 2016

SeanTAllen commented Nov 4, 2017

SeanTAllen commented Nov 17, 2017

SeanTAllen commented Nov 19, 2017

SeanTAllen commented Nov 19, 2017

SeanTAllen commented Dec 16, 2017

SeanTAllen commented Dec 20, 2017

Pony runtime goes out of memory under high CPU load / concurrency #517

Pony runtime goes out of memory under high CPU load / concurrency #517

Comments

hakvroot commented Feb 2, 2016

andymcn commented Mar 6, 2016

hakvroot commented Mar 8, 2016

sparseinference commented Mar 24, 2016

SeanTAllen commented Mar 24, 2016

sparseinference commented Mar 24, 2016

SeanTAllen commented Sep 29, 2016

SeanTAllen commented Oct 14, 2016

jemc commented Oct 14, 2016

SeanTAllen commented Oct 14, 2016

SeanTAllen commented Oct 14, 2016

Praetonus commented Oct 14, 2016

SeanTAllen commented Oct 14, 2016

jemc commented Oct 14, 2016

SeanTAllen commented Oct 14, 2016

Praetonus commented Oct 14, 2016

SeanTAllen commented Oct 16, 2016 • edited Loading

sparseinference commented Oct 16, 2016

SeanTAllen commented Oct 16, 2016

Praetonus commented Oct 16, 2016

SeanTAllen commented Oct 16, 2016

SeanTAllen commented Oct 19, 2016

SeanTAllen commented Oct 30, 2016

SeanTAllen commented Nov 4, 2017

SeanTAllen commented Nov 17, 2017

SeanTAllen commented Nov 19, 2017

SeanTAllen commented Nov 19, 2017

SeanTAllen commented Dec 16, 2017

SeanTAllen commented Dec 20, 2017

SeanTAllen commented Oct 16, 2016 •

edited

Loading