Occasional segfault soon-after-startup for UDP client #1872

Cloven · 2017-04-27T22:46:25Z

UDP Server: https://gist.github.com/992d3e599fdf4cae164296d2aabcf5a5
UDP Client: https://gist.github.com/1fd3260168f28212e8867c161e75d01e

OSX 10.11.6, pony version 0.13.1-874c0f8 [release] compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1) (but this also occurred under pony 0.11.x)

When you start a server (a simple UDP reflector) and then the client (a simple UDP spammer), then occasionally, you get a segmentation fault. This seems to generally occur 'early' (after only a few thousand packets have been sent); if it doesn't occur early, the process appears to be able to run forever. There looks to be a long pause right before the crash, although that may be due to the core file being written (or it could be a threading deadlock). The output of the client at that time looks like this:

[...]
3854
3855
3856
[1] 40929 segmentation fault (core dumped)

and lldb says:

lldb) bt

thread Should sequences be scopes #1: tid = 0x0000, 0x00007fffa17c210a libsystem_kernel.dylib`__semwait_signal + 10, stop reason = signal SIGSTOP
- frame #0: 0x00007fffa17c210a libsystem_kernel.dylib__semwait_signal + 10 frame #1: 0x00007fff8e235787 libsystem_pthread.dylibpthread_join + 444
  frame Nested tuple access can't parse #2: 0x0000000100013ddc boltthrowerpony_start + 572 frame #3: 0x00007fffa1c695ad libdyld.dylibstart + 1

I'm not familiar enough with debugging under lldb/osx to be able to be much help further, but I'll happily help however I can.

The text was updated successfully, but these errors were encountered:

Cloven · 2017-04-27T22:50:45Z

Adding to this: if the printed counter advances beyond around 30k, I haven't seen it crash after that (and have run it to 600k+). However, if you ^C the process and restart it, it fails about 5-10% of the time. One doesn't need to restart or otherwise touch the reflection server once it's running, the bug is entirely triggered by the client.

SeanTAllen · 2017-04-27T22:53:34Z

@Cloven which segfaults, the client or the server?

Cloven · 2017-04-27T22:57:49Z

Client only.

SeanTAllen · 2017-04-27T23:02:23Z

@Cloven how many CPUs does your machine have?

SeanTAllen · 2017-04-27T23:03:38Z

@Cloven also what type of CPU? Info about the type of Apple computer might be helpful.

SeanTAllen · 2017-04-27T23:13:42Z

@Cloven the pause is most likely the OSX giving something else access to the CPU, i get it without a crash from time to time and it appears to be the client being momentarily paused by OSX ( from what I can see).

I'm on 10.11.6
Mid 2015 Macbook Pro.
LLVM 3.9.1
Same ponyc version you are using
I have 4 cpus so both the server and the client are running with 4 ponythreads.

I was about to hit comment on this as I had run about 100 times with no segfault but I just got one. No pause though and it was up to 67843 when it happened.

SeanTAllen · 2017-04-27T23:14:27Z

@Cloven can you build the client using --debug and see if you still get the problem?

SeanTAllen · 2017-04-27T23:25:06Z

I was able to recreate in the debugger. This is going to look familiar to some folks:

Process 29460 stopped
* thread #2: tid = 0x2d9c10, 0x000000010001232d client`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000010001232d client`ponyint_gc_release + 157
client`ponyint_gc_release:
->  0x10001232d <+157>: subq   %rcx, 0x8(%rax)
    0x100012331 <+161>: cmpq   $0x0, 0x18(%r14)
    0x100012336 <+166>: je     0x1000123ab               ; <+283>
    0x100012338 <+168>: movq   0x20(%r14), %rcx
(lldb) bt
* thread #2: tid = 0x2d9c10, 0x000000010001232d client`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x000000010001232d client`ponyint_gc_release + 157
    frame #1: 0x000000010000b5a7 client`handle_message + 263
    frame #2: 0x0000000100016450 client`run_thread + 720
    frame #3: 0x00007fff9249099d libsystem_pthread.dylib`_pthread_body + 131
    frame #4: 0x00007fff9249091a libsystem_pthread.dylib`_pthread_start + 168
    frame #5: 0x00007fff9248e351 libsystem_pthread.dylib`thread_start + 13
(lldb)

Cloven · 2017-04-27T23:25:09Z

That box is a 2012 macbook air (dual-core 2.0GHz Intel Core i7 (Turbo Boost up to 3.2GHz) with 4MB shared L3 cache) with 8G of RAM. I'll build with debug when I can. It's interesting that you're getting the bug more rarely than I am.

…

On Thu, Apr 27, 2017 at 4:14 PM, Sean T Allen ***@***.***> wrote: @Cloven <https://github.com/Cloven> can you build the client using --debug and see if you still get the problem? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1872 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAeOrxZ6nKcBobqAlb3wQ1hFHkK-Qn5xks5r0SFVgaJpZM4NK23S> .

SeanTAllen · 2017-04-27T23:29:44Z

I seem to be able to make this happen far more consistently with 2 threads rather than 4

SeanTAllen · 2017-04-27T23:33:09Z

Different segfault this time and yeah, running with 2 ponythreads seems to be the key to easily reproducing for me:

Process 32252 stopped
* thread #2: tid = 0x2ddce5, 0x0000000100015709 client`ponyint_gc_release(gc=0x0000000120fffca8, aref=0x0000000120f44e80) + 201 at gc.c:683, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x0000000100015709 client`ponyint_gc_release(gc=0x0000000120fffca8, aref=0x0000000120f44e80) + 201 at gc.c:683
   680 	    void* p = obj->address;
   681 	    object_t* obj_local = ponyint_objectmap_getobject(&gc->local, p, &index);
   682
-> 683 	    pony_assert(obj_local->rc >= obj->rc);
   684 	    obj_local->rc -= obj->rc;
   685 	  }
   686
(lldb) bt
warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.
* thread #2: tid = 0x2ddce5, 0x0000000100015709 client`ponyint_gc_release(gc=0x0000000120fffca8, aref=0x0000000120f44e80) + 201 at gc.c:683, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x0000000100015709 client`ponyint_gc_release(gc=0x0000000120fffca8, aref=0x0000000120f44e80) + 201 at gc.c:683
    frame #1: 0x000000010000c0ce client`handle_message(ctx=0x0000000108fff9c8, actor=0x0000000120fffc00, msg=0x0000000120fbba80) + 222 at actor.c:68
    frame #2: 0x000000010000be7b client`ponyint_actor_run(ctx=0x0000000108fff9c8, actor=0x0000000120fffc00, batch=100) + 283 at actor.c:158
    frame #3: 0x000000010001e1d8 client`run(sched=0x0000000108fff980) + 168 at scheduler.c:287
    frame #4: 0x000000010001df49 client`run_thread(arg=0x0000000108fff980) + 57 at scheduler.c:338
    frame #5: 0x00007fff9249099d libsystem_pthread.dylib`_pthread_body + 131
    frame #6: 0x00007fff9249091a libsystem_pthread.dylib`_pthread_start + 168
    frame #7: 0x00007fff9248e351 libsystem_pthread.dylib`thread_start + 13
(lldb) p obj->rc
(size_t) $0 = 1
(lldb) p obj_local->rc
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory

I added an assert to verify and obj_local is indeed NULL.

SeanTAllen · 2017-04-27T23:42:59Z

And another slightly different:

src/libponyrt/gc/gc.c:163: recv_local_object: Assertion `obj != NULL` failed.

Backtrace:
  0   client                              0x000000010001d421 ponyint_assert_fail + 161
  1   client                              0x00000001000143ad recv_local_object + 125
  2   client                              0x00000001000142f9 ponyint_gc_recvobject + 137
  3   client                              0x0000000100017140 pony_traceunknown + 96
  4   client                              0x0000000100001bde net_UDPSocket_Dispatch + 286
  5   client                              0x000000010000bdfb ponyint_actor_run + 283
  6   client                              0x000000010001e188 run + 168
  7   client                              0x000000010001def9 run_thread + 57
  8   libsystem_pthread.dylib             0x00007fff9249099d _pthread_body + 131
  9   libsystem_pthread.dylib             0x00007fff9249091a _pthread_body + 0
  10  libsystem_pthread.dylib             0x00007fff9248e351 thread_start + 13
Process 32707 stopped
* thread #2: tid = 0x2df049, 0x00007fff9a257f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
    frame #0: 0x00007fff9a257f06 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff9a257f06 <+10>: jae    0x7fff9a257f10            ; <+20>
    0x7fff9a257f08 <+12>: movq   %rax, %rdi
    0x7fff9a257f0b <+15>: jmp    0x7fff9a2527cd            ; cerror_nocancel
    0x7fff9a257f10 <+20>: retq
(lldb) bt
warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.
* thread #2: tid = 0x2df049, 0x00007fff9a257f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
  * frame #0: 0x00007fff9a257f06 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff924934ec libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff889db6df libsystem_c.dylib`abort + 129
    frame #3: 0x000000010001d513 client`ponyint_assert_fail(expr="obj != NULL", file="src/libponyrt/gc/gc.c", line=163, func="recv_local_object") + 403 at ponyassert.c:60
    frame #4: 0x00000001000143ad client`recv_local_object(ctx=0x0000000108fff848, p=0x0000000120e35580, t=0x000000010002c690, mutability=1) + 125 at gc.c:163
    frame #5: 0x00000001000142f9 client`ponyint_gc_recvobject(ctx=0x0000000108fff848, p=0x0000000120e35580, t=0x000000010002c690, mutability=1) + 137 at gc.c:457
    frame #6: 0x0000000100017140 client`pony_traceunknown(ctx=0x0000000108fff848, p=0x0000000120e35580, m=1) + 96 at trace.c:117
    frame #7: 0x0000000100001bde client`net_UDPSocket_Dispatch + 286
    frame #8: 0x000000010000c139 client`handle_message(ctx=0x0000000108fff848, actor=0x0000000118fffc00, msg=0x0000000120efaec0) + 457 at actor.c:103
    frame #9: 0x000000010000bdfb client`ponyint_actor_run(ctx=0x0000000108fff848, actor=0x0000000118fffc00, batch=100) + 283 at actor.c:158
    frame #10: 0x000000010001e188 client`run(sched=0x0000000108fff800) + 168 at scheduler.c:287
    frame #11: 0x000000010001def9 client`run_thread(arg=0x0000000108fff800) + 57 at scheduler.c:338
    frame #12: 0x00007fff9249099d libsystem_pthread.dylib`_pthread_body + 131
    frame #13: 0x00007fff9249091a libsystem_pthread.dylib`_pthread_start + 168
    frame #14: 0x00007fff9248e351 libsystem_pthread.dylib`thread_start + 13

SeanTAllen · 2017-04-27T23:46:42Z

with 2 ponythreads, debug compiler, debug client I can seemingly reproduce all the time.

SeanTAllen · 2017-04-28T00:06:23Z

It appears that anytime this happens, we've recently gone through this branch in hash.c search:

    if(p_length >
        (oi_p_length = get_probe_length(map, elem_hash, index, mask))) {
      // our probe length is greater than the elements probe length
      // we would normally have swapped so return this position
      *pos = index;
      *probe_length = p_length;
      *oi_probe_length = oi_p_length;
      return NULL;
    }

SeanTAllen · 2017-04-28T00:19:53Z

big question: is there a bug in the hash implementation, are we doing a double free, or are we deleting from the object map early?

SeanTAllen · 2017-04-28T00:50:38Z

there's a bug in hash.c

SeanTAllen · 2017-04-28T01:01:03Z

So what I am regularly seeing and I need @dipinhora's assistance with:

The map size is always 64.
The index that we start from is always near the map size (60,61 etc)
The item we want is in the map. It seems always 2 away from the index we start at:

for example, index is 61
the value we want (for example 8726618367696014013) is at bucket 63.

get_probe_length(map, elem_hash, index, mask) returns a number lower than p_length before we get to the value in the bucket (in the above case after the index 62 lookup) so we never get the item we are looking for.

the elem_hash at 62 that triggers our issue is 10452750943628629886

SeanTAllen · 2017-04-28T02:17:03Z

A couple updates. The map size has always been 64 for I see this.

The item is always in the hash but its just after an element that should be higher than it based on probe_length. ie

the item is at index 63

but there;s an item at 62

where when you run the probe length, our item when checked at 62 has a higher probe length than the item at 62.

SeanTAllen · 2017-04-28T03:58:07Z

@dipinhora thinks its a memory clobbering issue as we can't as yet reproduce with 1 thread and if you get rid of the actor that does the printing, the problem also goes away. its odd because the object map that hits the error looks fine except for being out of order.

very odd.

SeanTAllen · 2017-04-29T11:35:05Z

I picked this up this morning while talking to sylvan. Same binaries as the other night and can't get a consistent crash anymore.

SeanTAllen · 2017-04-29T11:54:42Z

AND i just determined that if zoom isnt running, it happens

SeanTAllen · 2017-04-29T17:02:04Z

With some teamwork amongst @sylvanc, @dipinhora and myself, we appear to have found the source of the problem. Working on a fix.

@dipinhora

I'm going to update this commit if it holds true. I'd like some additional testing. @dipinhora @sylvanc can you test #1872 for this. @Cloven can you see if this fixes the issue for you. @agarman @Theodus I don't think this will fix #1483 but can you check and see as there is a chance it might.

If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. Closes #1781 Closes #1872 Closes #1483

If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. It should be noted, that because of the complex chain of events that needs to occur to trigger this problem that we were unable to devise a unit test to catch this problem. If we had property based testing for the Pony runtime, this most likely would have been caught. Hopefully, PR #1840 to add rapidcheck into Pony happens soon. Closes #1781 Closes #1872 Closes #1483

SeanTAllen · 2017-05-06T12:26:25Z

Fix for this has been released as part of 0.14.0

SeanTAllen added bug: 1 - needs investigation labels Apr 27, 2017

agarman mentioned this issue Apr 28, 2017

EXC_BAD_ACCESS crash in try_gc when under heavy load #1483

Closed

SeanTAllen added bug: 4 - in progress and removed bug: 1 - needs investigation labels Apr 29, 2017

SeanTAllen self-assigned this Apr 29, 2017

This was referenced Apr 29, 2017

Possible #1872 issue fix #1876

Closed

Release 0.14.0 #1877

Closed

SeanTAllen mentioned this issue May 4, 2017

Fix invariant violation in Pony runtime hashes #1886

Merged

SeanTAllen closed this as completed in #1886 May 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional segfault soon-after-startup for UDP client #1872

Occasional segfault soon-after-startup for UDP client #1872

Cloven commented Apr 27, 2017

Cloven commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

Cloven commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

Cloven commented Apr 27, 2017 via email

SeanTAllen commented Apr 27, 2017 •

edited

Loading

SeanTAllen commented Apr 27, 2017 •

edited

Loading

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 28, 2017 •

edited

Loading

SeanTAllen commented Apr 28, 2017

SeanTAllen commented Apr 28, 2017

SeanTAllen commented Apr 28, 2017 •

edited

Loading

SeanTAllen commented Apr 28, 2017

SeanTAllen commented Apr 28, 2017

SeanTAllen commented Apr 29, 2017

SeanTAllen commented Apr 29, 2017

SeanTAllen commented Apr 29, 2017

SeanTAllen commented May 6, 2017

Occasional segfault soon-after-startup for UDP client #1872

Occasional segfault soon-after-startup for UDP client #1872

Comments

Cloven commented Apr 27, 2017

Cloven commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

Cloven commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

Cloven commented Apr 27, 2017 via email

SeanTAllen commented Apr 27, 2017 • edited Loading

SeanTAllen commented Apr 27, 2017 • edited Loading

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 27, 2017

SeanTAllen commented Apr 28, 2017 • edited Loading

SeanTAllen commented Apr 28, 2017

SeanTAllen commented Apr 28, 2017

SeanTAllen commented Apr 28, 2017 • edited Loading

SeanTAllen commented Apr 28, 2017

SeanTAllen commented Apr 28, 2017

SeanTAllen commented Apr 29, 2017

SeanTAllen commented Apr 29, 2017

SeanTAllen commented Apr 29, 2017

SeanTAllen commented May 6, 2017

SeanTAllen commented Apr 27, 2017 •

edited

Loading

SeanTAllen commented Apr 27, 2017 •

edited

Loading

SeanTAllen commented Apr 28, 2017 •

edited

Loading

SeanTAllen commented Apr 28, 2017 •

edited

Loading