-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pony runtime goes out of memory under high CPU load / concurrency #517
Comments
This looks suspiciously like issue #494. Running on Windows I don't see the segfault, but I do terminate (which I shouldn't) after between a few and a hundred |
To be sure the out of memory behavior still occurs I tried to reproduce this issue again with version I have also tried to reproduce the issue on several virtual machines (all Ubuntu servers with various kernels, cloud hosted) and was able to reproduce the issue when the virtual machine had more than one core. I did reuse the binary from my own machine in these tests. Furthermore #ponylang user ponysaurus tried to reproduce the issue on Debian and OS X but did not observe the same behavior. Unfortunately I am not aware of the number of cores that were available during his test runs. If I find some time I'll try to build the reproduction case on one of the vm's where I was able to reproduce the issue with a locally compiled binary and see whether the same behavior still occurs. |
I can reproduce this problem on build: 0.2.1-700-g827a243 With the default number of threads (5) the process slowly starts to leak and continues leaking until all the free memory is used and then starts to swap. With --ponythreads 8 as mentioned above by @hakvroot the leak is greatly accelerated. |
@sparseinference how many physical cpus do you have in your machine 4 or 2? |
@SeanTAllen I have 1 physical CPU : Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz according to: /proc/cpuinfo |
I tested this just now, no matter what I do, this eventually segfaults. |
I tried this after upgrading from LLVM 3.8 to LLVM 3.8.1 and get the same results. |
@SeanTAllen any chance the segfault was cleared up by #1321? |
nope. still segfaults. |
when i compile this with a debug build of the compiler, i get a more interesting and familiar looking result:
|
@SeanTAllen Could you get a stack trace for that? |
interesting...
|
@Praetonus - is it possible that our atomics aren't actually atomic on @SeanTAllen's platform? |
OSX El Capitan 10.11.6 Its a fairly standard Macbook Pro so the not actually being atomic would be concerning. |
No, that's not possible. Even if the code generation for the C atomics was bugged, which is unlikely, memory operations on x86 are atomic with acquire/release semantics by default. |
@Praetonus i'll keep looking. re: the patch. do you want to PR that or shall I? I have it in place already and can easily commit and open. if i do it though, i could use a little help with what to put in the CHANGELOG beyond "fix segfault bug introduced by 83bc2aa" |
I ran the test with Interestingly, at My computer has an i7-5930K CPU @ 3.50GHz - 6 hyperthreading cores. |
Thanks @sparseinference. |
@SeanTAllen The code must be adapted to the pool allocator so I'll do it if you're not sure. |
@Praetonus i'm not sure how it would be adapted, so have at it. |
This and #647 are the same problem. In general what happens is, with a higher number of schedulers, when a scheduler goes to steal work or otherwise get work to do, it finds none available. This causes it to send a block message to the cycle detector (the message causes a memory allocation). They do this every time they try to run, which results in a ton of block messages being sent. Combine this with our intentionally not running the cycle detector often and memory explosion. The short term fix for this is to run with --ponynoblock. Running with --ponynoblock means your application isn't going to exit. Ever. But, it gets you around this problem. (Or, you can run with fewer ponythreads). I had an idea for how to handle cycle detection that wouldn't cause this problem. After talking to @sylvanc, it turns out he had already been entertaining a similar and more fully fleshed out idea. We need to prove it will work, but the general idea is. The cycle detector observes the state of the world and if it finds what it believes to be a cycle, it can send a message to the members of that cycle saying "this is my view of the world, I believe you are a cycle". Those actors can contradict the cycle detector and let it know its view is incorrect or they agree that they are in a cycle and exit. By following a system of this general sort, no messages are sent to the cycle detector without it first sending a message. There's a lot of details that would need to be fleshed out and @sylvanc's idea is farther along than mine, but this explanation should give you a vague handwave idea of the long term fix. |
I've created a backpressure branch that address this by applying the in progress backpressure system to the cycle detector: |
it appears that the cycle detector is a bit of a red herring. its not the real issue. this is the same bug as #647 and #2317. the problem is with the memory allocator. block messages if changed to be sent as the smaller size pony_msgp_t results in no memory growth. i believe because unblock messages also are the same nice and the allocator is more like to find the space required. in #2317, @slfritchie points to the issue. i did some testing where i:
void ponyint_cycle_block(pony_ctx_t* ctx, pony_actor_t* actor, gc_t* gc)
{
pony_assert(ctx->current == actor);
pony_assert(&actor->gc == gc);
pony_msgp_t* m = (pony_msgp_t*)pony_alloc_msg(
POOL_INDEX(sizeof(pony_msgp_t)), ACTORMSG_BLOCK);
pony_sendv(ctx, cycle_detector, &m->msg, &m->msg, false);
} this results in what is usually very stable memory usage after a period of time, with rare occassional jumps. i'm fairly sure that the allocator needs to be addressed. i can see that the cycle detector is not getting overwhelmed with messages. i believe the cycle detector play a part in backpressure that seemed to help address this was merely changing the pattern of block message allocations thereby changing the memory usage pattern. |
I found an interesting possible "fix" for this. The real fix is we need a coalescing allocator. However, run away memory can be contained by making all block messages allocate the same amount of memory. This results in less fragmentation in memory and less overall memory being allocated then when messages are of a different size. To do this, we would make the block, unblock and ack messages all the same size. it "wastes" more memory unblock and ack messages but tamps down on the run away memory growth edge case. even if its not a perfect solution. i think we should make it part of a larger solution given how they are so often paired together (block/unblock messages). |
Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647
Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647
Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647
Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647
PR #2355 helps address this issue and it unrelated to the aforementioned allocator twiddling. |
Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647
Prior to this commit, we sent actor block and unblock messages each time we entered and left `steal`. Every instance of work stealing resulted in a block/unblock message pair being sent; even if stealing was immediately successful. This was wasteful in a number of ways: 1. extra memory allocations 2. extra message sends 3. extra handling and processing of pointless block/unblock messages This commit changes block/unblock message sending logic. Hat tip to Scott Fritchie for pointing out to be how bad the issue was. He spent some time with DTrace and come up with some truly terrifying numbers for how much extra work was being done. Dipin Hora and I independently came up with what was effectively the same solution for this problem. This commit melds the best of his implementation with the best of mine. With this commit applied, work stealing will only result in a block/unblock message pair being sent if: 1) the scheduler in question has attempted to steal from every other scheduler (new behavior) 2) the scheduler in question has tried to steal for at least 10 billion clock cycles (about 5 seconds on most machines) (new behavior) 3) the scheduler in question has no unscheduled actors in its mutemap (existing behavior) Item 2 is the biggest change. What we are doing is increasing program shutdown time by at least 5 seconds (perhaps slightly more due to cross scheduler timing issues) in return for much better application performance while running. Issue #2317 is mostly fixed by this issue (although there is still a small amount of memory growth due to another issue). Issue #517 is changed by this commit. It has memory growth that is much slower than before but still quite noticeable. On my machine #517 will no longer OOM as it eventually gets to around 8 gigs in memory usage and is able to keep up with freeing memory ahead of new memory allocations. Given that there is still an underlying problem with memory allocation patterns (the same as #2317), I think that it's possible that the example program in #517 would still OOM on some test machines. Fixes #647
The memory situation has improved quite a lot with 0.21.0 with this but its still in need of improvement. Before it was a very rapid OOM, now memory grows but it takes quite a bit longer to get there. However, #2386 solves the problem. Memory usage with --ponythreads=16 is a stable 2976 because the number of actual threads in usage is really low (1) so all the block/unblock etc messages is vastly reduced. I still think that we should be looking at the "block, unblock and ack" messages and how they interact with the allocator but.... I would say that once #2386 is merged, we can close this. |
Closed by #2386 |
So I ran into a somewhat strange issue where a Pony program with a fixed set of actors can eventually run out of memory. The program in question is https://gist.github.com/hakvroot/fbf734017260d99f5a61.
I expected to be able to run this program indefinitely, and am in fact able to do so with a low number of pony threads (1-3). However when I fire up the program with at least 4 pony threads and I manage to get my CPU load over 8 (#cores * 2 for HT) I observe that the program keeps on allocating memory without releasing it.
So for example
./test --ponythreads 16
is a sure way to go OOM, but 3 times./test --ponythreads 4
will do the trick for each instance just as well. Once the program starts to allocate more memory it will also continue to do this until it is killed by the OS, even if the CPU load falls below (in my case) 8.Tested with ponyc
0.2.1-478-gebb7b36
, built withconfig=release
. I am running Linux Mint 17.1 with LLVM 3.6.2. If any more details are required, please let me know!The text was updated successfully, but these errors were encountered: