Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional segfault soon-after-startup for UDP client #1872

Closed
Cloven opened this issue Apr 27, 2017 · 23 comments · Fixed by #1886
Closed

Occasional segfault soon-after-startup for UDP client #1872

Cloven opened this issue Apr 27, 2017 · 23 comments · Fixed by #1886
Assignees

Comments

@Cloven
Copy link

Cloven commented Apr 27, 2017

UDP Server: https://gist.github.com/992d3e599fdf4cae164296d2aabcf5a5
UDP Client: https://gist.github.com/1fd3260168f28212e8867c161e75d01e

OSX 10.11.6, pony version 0.13.1-874c0f8 [release] compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1) (but this also occurred under pony 0.11.x)

When you start a server (a simple UDP reflector) and then the client (a simple UDP spammer), then occasionally, you get a segmentation fault. This seems to generally occur 'early' (after only a few thousand packets have been sent); if it doesn't occur early, the process appears to be able to run forever. There looks to be a long pause right before the crash, although that may be due to the core file being written (or it could be a threading deadlock). The output of the client at that time looks like this:

[...]
3854
3855
3856
[1] 40929 segmentation fault (core dumped)

and lldb says:

lldb) bt

  • thread Should sequences be scopes #1: tid = 0x0000, 0x00007fffa17c210a libsystem_kernel.dylib`__semwait_signal + 10, stop reason = signal SIGSTOP
    • frame #0: 0x00007fffa17c210a libsystem_kernel.dylib__semwait_signal + 10 frame #1: 0x00007fff8e235787 libsystem_pthread.dylibpthread_join + 444
      frame Nested tuple access can't parse #2: 0x0000000100013ddc boltthrowerpony_start + 572 frame #3: 0x00007fffa1c695ad libdyld.dylibstart + 1

I'm not familiar enough with debugging under lldb/osx to be able to be much help further, but I'll happily help however I can.

@Cloven
Copy link
Author

Cloven commented Apr 27, 2017

Adding to this: if the printed counter advances beyond around 30k, I haven't seen it crash after that (and have run it to 600k+). However, if you ^C the process and restart it, it fails about 5-10% of the time. One doesn't need to restart or otherwise touch the reflection server once it's running, the bug is entirely triggered by the client.

@SeanTAllen
Copy link
Member

@Cloven which segfaults, the client or the server?

@Cloven
Copy link
Author

Cloven commented Apr 27, 2017

Client only.

@SeanTAllen
Copy link
Member

@Cloven how many CPUs does your machine have?

@SeanTAllen
Copy link
Member

@Cloven also what type of CPU? Info about the type of Apple computer might be helpful.

@SeanTAllen
Copy link
Member

@Cloven the pause is most likely the OSX giving something else access to the CPU, i get it without a crash from time to time and it appears to be the client being momentarily paused by OSX ( from what I can see).

I'm on 10.11.6
Mid 2015 Macbook Pro.
LLVM 3.9.1
Same ponyc version you are using
I have 4 cpus so both the server and the client are running with 4 ponythreads.

I was about to hit comment on this as I had run about 100 times with no segfault but I just got one. No pause though and it was up to 67843 when it happened.

@SeanTAllen
Copy link
Member

@Cloven can you build the client using --debug and see if you still get the problem?

@SeanTAllen
Copy link
Member

I was able to recreate in the debugger. This is going to look familiar to some folks:

Process 29460 stopped
* thread #2: tid = 0x2d9c10, 0x000000010001232d client`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000010001232d client`ponyint_gc_release + 157
client`ponyint_gc_release:
->  0x10001232d <+157>: subq   %rcx, 0x8(%rax)
    0x100012331 <+161>: cmpq   $0x0, 0x18(%r14)
    0x100012336 <+166>: je     0x1000123ab               ; <+283>
    0x100012338 <+168>: movq   0x20(%r14), %rcx
(lldb) bt
* thread #2: tid = 0x2d9c10, 0x000000010001232d client`ponyint_gc_release + 157, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x000000010001232d client`ponyint_gc_release + 157
    frame #1: 0x000000010000b5a7 client`handle_message + 263
    frame #2: 0x0000000100016450 client`run_thread + 720
    frame #3: 0x00007fff9249099d libsystem_pthread.dylib`_pthread_body + 131
    frame #4: 0x00007fff9249091a libsystem_pthread.dylib`_pthread_start + 168
    frame #5: 0x00007fff9248e351 libsystem_pthread.dylib`thread_start + 13
(lldb)

@Cloven
Copy link
Author

Cloven commented Apr 27, 2017 via email

@SeanTAllen
Copy link
Member

SeanTAllen commented Apr 27, 2017

I seem to be able to make this happen far more consistently with 2 threads rather than 4

@SeanTAllen
Copy link
Member

SeanTAllen commented Apr 27, 2017

Different segfault this time and yeah, running with 2 ponythreads seems to be the key to easily reproducing for me:

Process 32252 stopped
* thread #2: tid = 0x2ddce5, 0x0000000100015709 client`ponyint_gc_release(gc=0x0000000120fffca8, aref=0x0000000120f44e80) + 201 at gc.c:683, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x0000000100015709 client`ponyint_gc_release(gc=0x0000000120fffca8, aref=0x0000000120f44e80) + 201 at gc.c:683
   680 	    void* p = obj->address;
   681 	    object_t* obj_local = ponyint_objectmap_getobject(&gc->local, p, &index);
   682
-> 683 	    pony_assert(obj_local->rc >= obj->rc);
   684 	    obj_local->rc -= obj->rc;
   685 	  }
   686
(lldb) bt
warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.
* thread #2: tid = 0x2ddce5, 0x0000000100015709 client`ponyint_gc_release(gc=0x0000000120fffca8, aref=0x0000000120f44e80) + 201 at gc.c:683, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x0000000100015709 client`ponyint_gc_release(gc=0x0000000120fffca8, aref=0x0000000120f44e80) + 201 at gc.c:683
    frame #1: 0x000000010000c0ce client`handle_message(ctx=0x0000000108fff9c8, actor=0x0000000120fffc00, msg=0x0000000120fbba80) + 222 at actor.c:68
    frame #2: 0x000000010000be7b client`ponyint_actor_run(ctx=0x0000000108fff9c8, actor=0x0000000120fffc00, batch=100) + 283 at actor.c:158
    frame #3: 0x000000010001e1d8 client`run(sched=0x0000000108fff980) + 168 at scheduler.c:287
    frame #4: 0x000000010001df49 client`run_thread(arg=0x0000000108fff980) + 57 at scheduler.c:338
    frame #5: 0x00007fff9249099d libsystem_pthread.dylib`_pthread_body + 131
    frame #6: 0x00007fff9249091a libsystem_pthread.dylib`_pthread_start + 168
    frame #7: 0x00007fff9248e351 libsystem_pthread.dylib`thread_start + 13
(lldb) p obj->rc
(size_t) $0 = 1
(lldb) p obj_local->rc
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory

I added an assert to verify and obj_local is indeed NULL.

@SeanTAllen
Copy link
Member

And another slightly different:

src/libponyrt/gc/gc.c:163: recv_local_object: Assertion `obj != NULL` failed.

Backtrace:
  0   client                              0x000000010001d421 ponyint_assert_fail + 161
  1   client                              0x00000001000143ad recv_local_object + 125
  2   client                              0x00000001000142f9 ponyint_gc_recvobject + 137
  3   client                              0x0000000100017140 pony_traceunknown + 96
  4   client                              0x0000000100001bde net_UDPSocket_Dispatch + 286
  5   client                              0x000000010000bdfb ponyint_actor_run + 283
  6   client                              0x000000010001e188 run + 168
  7   client                              0x000000010001def9 run_thread + 57
  8   libsystem_pthread.dylib             0x00007fff9249099d _pthread_body + 131
  9   libsystem_pthread.dylib             0x00007fff9249091a _pthread_body + 0
  10  libsystem_pthread.dylib             0x00007fff9248e351 thread_start + 13
Process 32707 stopped
* thread #2: tid = 0x2df049, 0x00007fff9a257f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
    frame #0: 0x00007fff9a257f06 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff9a257f06 <+10>: jae    0x7fff9a257f10            ; <+20>
    0x7fff9a257f08 <+12>: movq   %rax, %rdi
    0x7fff9a257f0b <+15>: jmp    0x7fff9a2527cd            ; cerror_nocancel
    0x7fff9a257f10 <+20>: retq
(lldb) bt
warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.
* thread #2: tid = 0x2df049, 0x00007fff9a257f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
  * frame #0: 0x00007fff9a257f06 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff924934ec libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff889db6df libsystem_c.dylib`abort + 129
    frame #3: 0x000000010001d513 client`ponyint_assert_fail(expr="obj != NULL", file="src/libponyrt/gc/gc.c", line=163, func="recv_local_object") + 403 at ponyassert.c:60
    frame #4: 0x00000001000143ad client`recv_local_object(ctx=0x0000000108fff848, p=0x0000000120e35580, t=0x000000010002c690, mutability=1) + 125 at gc.c:163
    frame #5: 0x00000001000142f9 client`ponyint_gc_recvobject(ctx=0x0000000108fff848, p=0x0000000120e35580, t=0x000000010002c690, mutability=1) + 137 at gc.c:457
    frame #6: 0x0000000100017140 client`pony_traceunknown(ctx=0x0000000108fff848, p=0x0000000120e35580, m=1) + 96 at trace.c:117
    frame #7: 0x0000000100001bde client`net_UDPSocket_Dispatch + 286
    frame #8: 0x000000010000c139 client`handle_message(ctx=0x0000000108fff848, actor=0x0000000118fffc00, msg=0x0000000120efaec0) + 457 at actor.c:103
    frame #9: 0x000000010000bdfb client`ponyint_actor_run(ctx=0x0000000108fff848, actor=0x0000000118fffc00, batch=100) + 283 at actor.c:158
    frame #10: 0x000000010001e188 client`run(sched=0x0000000108fff800) + 168 at scheduler.c:287
    frame #11: 0x000000010001def9 client`run_thread(arg=0x0000000108fff800) + 57 at scheduler.c:338
    frame #12: 0x00007fff9249099d libsystem_pthread.dylib`_pthread_body + 131
    frame #13: 0x00007fff9249091a libsystem_pthread.dylib`_pthread_start + 168
    frame #14: 0x00007fff9248e351 libsystem_pthread.dylib`thread_start + 13

@SeanTAllen
Copy link
Member

with 2 ponythreads, debug compiler, debug client I can seemingly reproduce all the time.

@SeanTAllen
Copy link
Member

SeanTAllen commented Apr 28, 2017

It appears that anytime this happens, we've recently gone through this branch in hash.c search:

    if(p_length >
        (oi_p_length = get_probe_length(map, elem_hash, index, mask))) {
      // our probe length is greater than the elements probe length
      // we would normally have swapped so return this position
      *pos = index;
      *probe_length = p_length;
      *oi_probe_length = oi_p_length;
      return NULL;
    }

@SeanTAllen
Copy link
Member

big question: is there a bug in the hash implementation, are we doing a double free, or are we deleting from the object map early?

@SeanTAllen
Copy link
Member

there's a bug in hash.c

@SeanTAllen
Copy link
Member

SeanTAllen commented Apr 28, 2017

So what I am regularly seeing and I need @dipinhora's assistance with:

The map size is always 64.
The index that we start from is always near the map size (60,61 etc)
The item we want is in the map. It seems always 2 away from the index we start at:

for example, index is 61
the value we want (for example 8726618367696014013) is at bucket 63.

get_probe_length(map, elem_hash, index, mask) returns a number lower than p_length before we get to the value in the bucket (in the above case after the index 62 lookup) so we never get the item we are looking for.

the elem_hash at 62 that triggers our issue is 10452750943628629886

@SeanTAllen
Copy link
Member

A couple updates. The map size has always been 64 for I see this.

The item is always in the hash but its just after an element that should be higher than it based on probe_length. ie

the item is at index 63

but there;s an item at 62

where when you run the probe length, our item when checked at 62 has a higher probe length than the item at 62.

@SeanTAllen
Copy link
Member

@dipinhora thinks its a memory clobbering issue as we can't as yet reproduce with 1 thread and if you get rid of the actor that does the printing, the problem also goes away. its odd because the object map that hits the error looks fine except for being out of order.

very odd.

@SeanTAllen
Copy link
Member

I picked this up this morning while talking to sylvan. Same binaries as the other night and can't get a consistent crash anymore.

@SeanTAllen
Copy link
Member

AND i just determined that if zoom isnt running, it happens

@SeanTAllen
Copy link
Member

With some teamwork amongst @sylvanc, @dipinhora and myself, we appear to have found the source of the problem. Working on a fix.

SeanTAllen added a commit that referenced this issue Apr 29, 2017
I'm going to update this commit if it holds true.
I'd like some additional testing.

@dipinhora @sylvanc can you test #1872 for this.
@Cloven can you see if this fixes the issue for you.

@agarman @Theodus I don't think this will fix #1483 but
can you check and see as there is a chance it might.
This was referenced Apr 29, 2017
SeanTAllen added a commit that referenced this issue May 4, 2017
If you were being facetious, you could describe the Pony runtime as
a  series of hashmaps that are held together by some code. Hash
performance  and correctness can have a great impact on everything
else in the runtime because they are at the basis of most everything
else in the runtime.

This change fixes a number of issues that appears to be garbage
collection bugs but were in fact, problems with invariant violation
in the underlying hash implementation. It should be noted that while
the rest of this comment discuss invariant violations that exist in
our Robin Hood hash implementation, some of the bugs that this closes
predate the Robin Hood implementation. This leads me to believe that
the previous implementation had some subtle problem that could occur
under some rare interleaving of operations. How this occurred is
unknown at this time and probably always will be unless someone wants
to go back to the previous version and use what we learned here to
diagnose the state of the code at that time.

This patch closes issues #1781, #1872, and #1483. It's the result of
teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should
show that we were all involved in this resolution.

The skinny:

When garbage collecting items from our hash, that is, removing
deleted  items to free up space, we can end up violating hash
invariants. Previously, one of these invariants was correctly fixed,
however, it incorrectly assumed another invariant held but that is
not the case.

Post garbage collection, if any items have been deleted from our
hash, we do an "optimize" operation on each hash item. We check to
see if the location the item would hash to is now an empty bucket.
If it is, we move the item to that location thereby restoring the
expected chaining. There is, however, a problem with doing this.
It's possible over time to violate another invariant when fixing the
first violation.

For a given item at a given location in the hash, each item has a
probe value. An invariant of our data structure is that items at
earlier locations in the  hash will always have an equal or lower
probe value for that location than items that come later.

For example, items: "foo" and "bar". Given a hashmap whose size is
8, where  "foo" would made to index 1 and "bar" would map to index
"2". When looking at  the probe values for "foo" and "bar" at index
1, "foo" would have a probe  value of "0" as it is at the location
it hashes to whereas "bar" would have a  probe value of "7". The
value is the number of indexes away from our "natural" hash index
that the item is.

When search the hash, we can use this probe value to not do a linear
search of all indexes for the a given key. Once we find an item
whose probe value for a given index is higher than ours, we know
that the key can't be in the map past that index.

Except our course for when we are restoring invariants after a
delete. It's possible, due to the sequential nature of our
"optimize" repair step, to  violate this "always lower probe
value" invariant.

The previous implementation of "optimize_item" assumed that in
invariant held true. By not detecting the invariant violation and
fixing it, we could end up with maps where a key existed in it but
it wouldn't be found. When the map in question was an object map
used to hold gc'able items, this would result in an error that
appears to be a gc error. See #1781, #1872, and #1483.

Closes #1781
Closes #1872
Closes #1483
SeanTAllen added a commit that referenced this issue May 4, 2017
If you were being facetious, you could describe the Pony runtime as
a  series of hashmaps that are held together by some code. Hash
performance  and correctness can have a great impact on everything
else in the runtime because they are at the basis of most everything
else in the runtime.

This change fixes a number of issues that appears to be garbage
collection bugs but were in fact, problems with invariant violation
in the underlying hash implementation. It should be noted that while
the rest of this comment discuss invariant violations that exist in
our Robin Hood hash implementation, some of the bugs that this closes
predate the Robin Hood implementation. This leads me to believe that
the previous implementation had some subtle problem that could occur
under some rare interleaving of operations. How this occurred is
unknown at this time and probably always will be unless someone wants
to go back to the previous version and use what we learned here to
diagnose the state of the code at that time.

This patch closes issues #1781, #1872, and #1483. It's the result of
teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should
show that we were all involved in this resolution.

The skinny:

When garbage collecting items from our hash, that is, removing
deleted  items to free up space, we can end up violating hash
invariants. Previously, one of these invariants was correctly fixed,
however, it incorrectly assumed another invariant held but that is
not the case.

Post garbage collection, if any items have been deleted from our
hash, we do an "optimize" operation on each hash item. We check to
see if the location the item would hash to is now an empty bucket.
If it is, we move the item to that location thereby restoring the
expected chaining. There is, however, a problem with doing this.
It's possible over time to violate another invariant when fixing the
first violation.

For a given item at a given location in the hash, each item has a
probe value. An invariant of our data structure is that items at
earlier locations in the  hash will always have an equal or lower
probe value for that location than items that come later.

For example, items: "foo" and "bar". Given a hashmap whose size is
8, where  "foo" would made to index 1 and "bar" would map to index
"2". When looking at  the probe values for "foo" and "bar" at index
1, "foo" would have a probe  value of "0" as it is at the location
it hashes to whereas "bar" would have a  probe value of "7". The
value is the number of indexes away from our "natural" hash index
that the item is.

When search the hash, we can use this probe value to not do a linear
search of all indexes for the a given key. Once we find an item
whose probe value for a given index is higher than ours, we know
that the key can't be in the map past that index.

Except our course for when we are restoring invariants after a
delete. It's possible, due to the sequential nature of our
"optimize" repair step, to  violate this "always lower probe
value" invariant.

The previous implementation of "optimize_item" assumed that in
invariant held true. By not detecting the invariant violation and
fixing it, we could end up with maps where a key existed in it but
it wouldn't be found. When the map in question was an object map
used to hold gc'able items, this would result in an error that
appears to be a gc error. See #1781, #1872, and #1483.

Closes #1781
Closes #1872
Closes #1483
SeanTAllen added a commit that referenced this issue May 4, 2017
If you were being facetious, you could describe the Pony runtime as
a  series of hashmaps that are held together by some code. Hash
performance  and correctness can have a great impact on everything
else in the runtime because they are at the basis of most everything
else in the runtime.

This change fixes a number of issues that appears to be garbage
collection bugs but were in fact, problems with invariant violation
in the underlying hash implementation. It should be noted that while
the rest of this comment discuss invariant violations that exist in
our Robin Hood hash implementation, some of the bugs that this closes
predate the Robin Hood implementation. This leads me to believe that
the previous implementation had some subtle problem that could occur
under some rare interleaving of operations. How this occurred is
unknown at this time and probably always will be unless someone wants
to go back to the previous version and use what we learned here to
diagnose the state of the code at that time.

This patch closes issues #1781, #1872, and #1483. It's the result of
teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should
show that we were all involved in this resolution.

The skinny:

When garbage collecting items from our hash, that is, removing
deleted  items to free up space, we can end up violating hash
invariants. Previously, one of these invariants was correctly fixed,
however, it incorrectly assumed another invariant held but that is
not the case.

Post garbage collection, if any items have been deleted from our
hash, we do an "optimize" operation on each hash item. We check to
see if the location the item would hash to is now an empty bucket.
If it is, we move the item to that location thereby restoring the
expected chaining. There is, however, a problem with doing this.
It's possible over time to violate another invariant when fixing the
first violation.

For a given item at a given location in the hash, each item has a
probe value. An invariant of our data structure is that items at
earlier locations in the  hash will always have an equal or lower
probe value for that location than items that come later.

For example, items: "foo" and "bar". Given a hashmap whose size is
8, where  "foo" would made to index 1 and "bar" would map to index
"2". When looking at  the probe values for "foo" and "bar" at index
1, "foo" would have a probe  value of "0" as it is at the location
it hashes to whereas "bar" would have a  probe value of "7". The
value is the number of indexes away from our "natural" hash index
that the item is.

When search the hash, we can use this probe value to not do a linear
search of all indexes for the a given key. Once we find an item
whose probe value for a given index is higher than ours, we know
that the key can't be in the map past that index.

Except our course for when we are restoring invariants after a
delete. It's possible, due to the sequential nature of our
"optimize" repair step, to  violate this "always lower probe
value" invariant.

The previous implementation of "optimize_item" assumed that in
invariant held true. By not detecting the invariant violation and
fixing it, we could end up with maps where a key existed in it but
it wouldn't be found. When the map in question was an object map
used to hold gc'able items, this would result in an error that
appears to be a gc error. See #1781, #1872, and #1483.

Closes #1781
Closes #1872
Closes #1483
SeanTAllen added a commit that referenced this issue May 4, 2017
If you were being facetious, you could describe the Pony runtime as
a  series of hashmaps that are held together by some code. Hash
performance  and correctness can have a great impact on everything
else in the runtime because they are at the basis of most everything
else in the runtime.

This change fixes a number of issues that appears to be garbage
collection bugs but were in fact, problems with invariant violation
in the underlying hash implementation. It should be noted that while
the rest of this comment discuss invariant violations that exist in
our Robin Hood hash implementation, some of the bugs that this closes
predate the Robin Hood implementation. This leads me to believe that
the previous implementation had some subtle problem that could occur
under some rare interleaving of operations. How this occurred is
unknown at this time and probably always will be unless someone wants
to go back to the previous version and use what we learned here to
diagnose the state of the code at that time.

This patch closes issues #1781, #1872, and #1483. It's the result of
teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should
show that we were all involved in this resolution.

The skinny:

When garbage collecting items from our hash, that is, removing
deleted  items to free up space, we can end up violating hash
invariants. Previously, one of these invariants was correctly fixed,
however, it incorrectly assumed another invariant held but that is
not the case.

Post garbage collection, if any items have been deleted from our
hash, we do an "optimize" operation on each hash item. We check to
see if the location the item would hash to is now an empty bucket.
If it is, we move the item to that location thereby restoring the
expected chaining. There is, however, a problem with doing this.
It's possible over time to violate another invariant when fixing the
first violation.

For a given item at a given location in the hash, each item has a
probe value. An invariant of our data structure is that items at
earlier locations in the  hash will always have an equal or lower
probe value for that location than items that come later.

For example, items: "foo" and "bar". Given a hashmap whose size is
8, where  "foo" would made to index 1 and "bar" would map to index
"2". When looking at  the probe values for "foo" and "bar" at index
1, "foo" would have a probe  value of "0" as it is at the location
it hashes to whereas "bar" would have a  probe value of "7". The
value is the number of indexes away from our "natural" hash index
that the item is.

When search the hash, we can use this probe value to not do a linear
search of all indexes for the a given key. Once we find an item
whose probe value for a given index is higher than ours, we know
that the key can't be in the map past that index.

Except our course for when we are restoring invariants after a
delete. It's possible, due to the sequential nature of our
"optimize" repair step, to  violate this "always lower probe
value" invariant.

The previous implementation of "optimize_item" assumed that in
invariant held true. By not detecting the invariant violation and
fixing it, we could end up with maps where a key existed in it but
it wouldn't be found. When the map in question was an object map
used to hold gc'able items, this would result in an error that
appears to be a gc error. See #1781, #1872, and #1483.

Closes #1781
Closes #1872
Closes #1483
SeanTAllen added a commit that referenced this issue May 4, 2017
If you were being facetious, you could describe the Pony runtime as
a  series of hashmaps that are held together by some code. Hash
performance  and correctness can have a great impact on everything
else in the runtime because they are at the basis of most everything
else in the runtime.

This change fixes a number of issues that appears to be garbage
collection bugs but were in fact, problems with invariant violation
in the underlying hash implementation. It should be noted that while
the rest of this comment discuss invariant violations that exist in
our Robin Hood hash implementation, some of the bugs that this closes
predate the Robin Hood implementation. This leads me to believe that
the previous implementation had some subtle problem that could occur
under some rare interleaving of operations. How this occurred is
unknown at this time and probably always will be unless someone wants
to go back to the previous version and use what we learned here to
diagnose the state of the code at that time.

This patch closes issues #1781, #1872, and #1483. It's the result of
teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should
show that we were all involved in this resolution.

The skinny:

When garbage collecting items from our hash, that is, removing
deleted  items to free up space, we can end up violating hash
invariants. Previously, one of these invariants was correctly fixed,
however, it incorrectly assumed another invariant held but that is
not the case.

Post garbage collection, if any items have been deleted from our
hash, we do an "optimize" operation on each hash item. We check to
see if the location the item would hash to is now an empty bucket.
If it is, we move the item to that location thereby restoring the
expected chaining. There is, however, a problem with doing this.
It's possible over time to violate another invariant when fixing the
first violation.

For a given item at a given location in the hash, each item has a
probe value. An invariant of our data structure is that items at
earlier locations in the  hash will always have an equal or lower
probe value for that location than items that come later.

For example, items: "foo" and "bar". Given a hashmap whose size is
8, where  "foo" would made to index 1 and "bar" would map to index
"2". When looking at  the probe values for "foo" and "bar" at index
1, "foo" would have a probe  value of "0" as it is at the location
it hashes to whereas "bar" would have a  probe value of "7". The
value is the number of indexes away from our "natural" hash index
that the item is.

When search the hash, we can use this probe value to not do a linear
search of all indexes for the a given key. Once we find an item
whose probe value for a given index is higher than ours, we know
that the key can't be in the map past that index.

Except our course for when we are restoring invariants after a
delete. It's possible, due to the sequential nature of our
"optimize" repair step, to  violate this "always lower probe
value" invariant.

The previous implementation of "optimize_item" assumed that in
invariant held true. By not detecting the invariant violation and
fixing it, we could end up with maps where a key existed in it but
it wouldn't be found. When the map in question was an object map
used to hold gc'able items, this would result in an error that
appears to be a gc error. See #1781, #1872, and #1483.

It should be noted, that because of the complex chain of events that
needs to occur to trigger this problem that we were unable to devise
a unit test to catch this problem. If we had property based testing
for the Pony runtime, this most likely would have been caught.
Hopefully, PR #1840 to add rapidcheck into Pony happens soon.

Closes #1781
Closes #1872
Closes #1483
SeanTAllen added a commit that referenced this issue May 4, 2017
If you were being facetious, you could describe the Pony runtime as
a  series of hashmaps that are held together by some code. Hash
performance  and correctness can have a great impact on everything
else in the runtime because they are at the basis of most everything
else in the runtime.

This change fixes a number of issues that appears to be garbage
collection bugs but were in fact, problems with invariant violation
in the underlying hash implementation. It should be noted that while
the rest of this comment discuss invariant violations that exist in
our Robin Hood hash implementation, some of the bugs that this closes
predate the Robin Hood implementation. This leads me to believe that
the previous implementation had some subtle problem that could occur
under some rare interleaving of operations. How this occurred is
unknown at this time and probably always will be unless someone wants
to go back to the previous version and use what we learned here to
diagnose the state of the code at that time.

This patch closes issues #1781, #1872, and #1483. It's the result of
teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should
show that we were all involved in this resolution.

The skinny:

When garbage collecting items from our hash, that is, removing
deleted  items to free up space, we can end up violating hash
invariants. Previously, one of these invariants was correctly fixed,
however, it incorrectly assumed another invariant held but that is
not the case.

Post garbage collection, if any items have been deleted from our
hash, we do an "optimize" operation on each hash item. We check to
see if the location the item would hash to is now an empty bucket.
If it is, we move the item to that location thereby restoring the
expected chaining. There is, however, a problem with doing this.
It's possible over time to violate another invariant when fixing the
first violation.

For a given item at a given location in the hash, each item has a
probe value. An invariant of our data structure is that items at
earlier locations in the  hash will always have an equal or lower
probe value for that location than items that come later.

For example, items: "foo" and "bar". Given a hashmap whose size is
8, where  "foo" would made to index 1 and "bar" would map to index
"2". When looking at  the probe values for "foo" and "bar" at index
1, "foo" would have a probe  value of "0" as it is at the location
it hashes to whereas "bar" would have a  probe value of "7". The
value is the number of indexes away from our "natural" hash index
that the item is.

When search the hash, we can use this probe value to not do a linear
search of all indexes for the a given key. Once we find an item
whose probe value for a given index is higher than ours, we know
that the key can't be in the map past that index.

Except our course for when we are restoring invariants after a
delete. It's possible, due to the sequential nature of our
"optimize" repair step, to  violate this "always lower probe
value" invariant.

The previous implementation of "optimize_item" assumed that in
invariant held true. By not detecting the invariant violation and
fixing it, we could end up with maps where a key existed in it but
it wouldn't be found. When the map in question was an object map
used to hold gc'able items, this would result in an error that
appears to be a gc error. See #1781, #1872, and #1483.

It should be noted, that because of the complex chain of events that
needs to occur to trigger this problem that we were unable to devise
a unit test to catch this problem. If we had property based testing
for the Pony runtime, this most likely would have been caught.
Hopefully, PR #1840 to add rapidcheck into Pony happens soon.

Closes #1781
Closes #1872
Closes #1483
SeanTAllen added a commit that referenced this issue May 5, 2017
If you were being facetious, you could describe the Pony runtime as
a  series of hashmaps that are held together by some code. Hash
performance  and correctness can have a great impact on everything
else in the runtime because they are at the basis of most everything
else in the runtime.

This change fixes a number of issues that appears to be garbage
collection bugs but were in fact, problems with invariant violation
in the underlying hash implementation. It should be noted that while
the rest of this comment discuss invariant violations that exist in
our Robin Hood hash implementation, some of the bugs that this closes
predate the Robin Hood implementation. This leads me to believe that
the previous implementation had some subtle problem that could occur
under some rare interleaving of operations. How this occurred is
unknown at this time and probably always will be unless someone wants
to go back to the previous version and use what we learned here to
diagnose the state of the code at that time.

This patch closes issues #1781, #1872, and #1483. It's the result of
teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should
show that we were all involved in this resolution.

The skinny:

When garbage collecting items from our hash, that is, removing
deleted  items to free up space, we can end up violating hash
invariants. Previously, one of these invariants was correctly fixed,
however, it incorrectly assumed another invariant held but that is
not the case.

Post garbage collection, if any items have been deleted from our
hash, we do an "optimize" operation on each hash item. We check to
see if the location the item would hash to is now an empty bucket.
If it is, we move the item to that location thereby restoring the
expected chaining. There is, however, a problem with doing this.
It's possible over time to violate another invariant when fixing the
first violation.

For a given item at a given location in the hash, each item has a
probe value. An invariant of our data structure is that items at
earlier locations in the  hash will always have an equal or lower
probe value for that location than items that come later.

For example, items: "foo" and "bar". Given a hashmap whose size is
8, where  "foo" would made to index 1 and "bar" would map to index
"2". When looking at  the probe values for "foo" and "bar" at index
1, "foo" would have a probe  value of "0" as it is at the location
it hashes to whereas "bar" would have a  probe value of "7". The
value is the number of indexes away from our "natural" hash index
that the item is.

When search the hash, we can use this probe value to not do a linear
search of all indexes for the a given key. Once we find an item
whose probe value for a given index is higher than ours, we know
that the key can't be in the map past that index.

Except our course for when we are restoring invariants after a
delete. It's possible, due to the sequential nature of our
"optimize" repair step, to  violate this "always lower probe
value" invariant.

The previous implementation of "optimize_item" assumed that in
invariant held true. By not detecting the invariant violation and
fixing it, we could end up with maps where a key existed in it but
it wouldn't be found. When the map in question was an object map
used to hold gc'able items, this would result in an error that
appears to be a gc error. See #1781, #1872, and #1483.

It should be noted, that because of the complex chain of events that
needs to occur to trigger this problem that we were unable to devise
a unit test to catch this problem. If we had property based testing
for the Pony runtime, this most likely would have been caught.
Hopefully, PR #1840 to add rapidcheck into Pony happens soon.

Closes #1781
Closes #1872
Closes #1483
@SeanTAllen
Copy link
Member

Fix for this has been released as part of 0.14.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants