-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional segfault soon-after-startup for UDP client #1872
Comments
Adding to this: if the printed counter advances beyond around 30k, I haven't seen it crash after that (and have run it to 600k+). However, if you ^C the process and restart it, it fails about 5-10% of the time. One doesn't need to restart or otherwise touch the reflection server once it's running, the bug is entirely triggered by the client. |
@Cloven which segfaults, the client or the server? |
Client only. |
@Cloven how many CPUs does your machine have? |
@Cloven also what type of CPU? Info about the type of Apple computer might be helpful. |
@Cloven the pause is most likely the OSX giving something else access to the CPU, i get it without a crash from time to time and it appears to be the client being momentarily paused by OSX ( from what I can see). I'm on 10.11.6 I was about to hit comment on this as I had run about 100 times with no segfault but I just got one. No pause though and it was up to 67843 when it happened. |
@Cloven can you build the client using |
I was able to recreate in the debugger. This is going to look familiar to some folks:
|
That box is a 2012 macbook air (dual-core 2.0GHz Intel Core i7 (Turbo Boost
up to 3.2GHz) with 4MB shared L3 cache) with 8G of RAM. I'll build with
debug when I can. It's interesting that you're getting the bug more rarely
than I am.
…On Thu, Apr 27, 2017 at 4:14 PM, Sean T Allen ***@***.***> wrote:
@Cloven <https://github.com/Cloven> can you build the client using --debug
and see if you still get the problem?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1872 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAeOrxZ6nKcBobqAlb3wQ1hFHkK-Qn5xks5r0SFVgaJpZM4NK23S>
.
|
I seem to be able to make this happen far more consistently with 2 threads rather than 4 |
Different segfault this time and yeah, running with 2 ponythreads seems to be the key to easily reproducing for me:
I added an assert to verify and |
And another slightly different:
|
with 2 ponythreads, debug compiler, debug client I can seemingly reproduce all the time. |
It appears that anytime this happens, we've recently gone through this branch in if(p_length >
(oi_p_length = get_probe_length(map, elem_hash, index, mask))) {
// our probe length is greater than the elements probe length
// we would normally have swapped so return this position
*pos = index;
*probe_length = p_length;
*oi_probe_length = oi_p_length;
return NULL;
} |
big question: is there a bug in the hash implementation, are we doing a double free, or are we deleting from the object map early? |
there's a bug in hash.c |
So what I am regularly seeing and I need @dipinhora's assistance with: The map size is always 64. for example, index is 61 get_probe_length(map, elem_hash, index, mask) returns a number lower than p_length before we get to the value in the bucket (in the above case after the index 62 lookup) so we never get the item we are looking for. the elem_hash at 62 that triggers our issue is 10452750943628629886 |
A couple updates. The map size has always been 64 for I see this. The item is always in the hash but its just after an element that should be higher than it based on probe_length. ie the item is at index 63 but there;s an item at 62 where when you run the probe length, our item when checked at 62 has a higher probe length than the item at 62. |
@dipinhora thinks its a memory clobbering issue as we can't as yet reproduce with 1 thread and if you get rid of the actor that does the printing, the problem also goes away. its odd because the object map that hits the error looks fine except for being out of order. very odd. |
I picked this up this morning while talking to sylvan. Same binaries as the other night and can't get a consistent crash anymore. |
AND i just determined that if zoom isnt running, it happens |
With some teamwork amongst @sylvanc, @dipinhora and myself, we appear to have found the source of the problem. Working on a fix. |
If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. Closes #1781 Closes #1872 Closes #1483
If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. Closes #1781 Closes #1872 Closes #1483
If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. Closes #1781 Closes #1872 Closes #1483
If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. Closes #1781 Closes #1872 Closes #1483
If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. It should be noted, that because of the complex chain of events that needs to occur to trigger this problem that we were unable to devise a unit test to catch this problem. If we had property based testing for the Pony runtime, this most likely would have been caught. Hopefully, PR #1840 to add rapidcheck into Pony happens soon. Closes #1781 Closes #1872 Closes #1483
If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. It should be noted, that because of the complex chain of events that needs to occur to trigger this problem that we were unable to devise a unit test to catch this problem. If we had property based testing for the Pony runtime, this most likely would have been caught. Hopefully, PR #1840 to add rapidcheck into Pony happens soon. Closes #1781 Closes #1872 Closes #1483
If you were being facetious, you could describe the Pony runtime as a series of hashmaps that are held together by some code. Hash performance and correctness can have a great impact on everything else in the runtime because they are at the basis of most everything else in the runtime. This change fixes a number of issues that appears to be garbage collection bugs but were in fact, problems with invariant violation in the underlying hash implementation. It should be noted that while the rest of this comment discuss invariant violations that exist in our Robin Hood hash implementation, some of the bugs that this closes predate the Robin Hood implementation. This leads me to believe that the previous implementation had some subtle problem that could occur under some rare interleaving of operations. How this occurred is unknown at this time and probably always will be unless someone wants to go back to the previous version and use what we learned here to diagnose the state of the code at that time. This patch closes issues #1781, #1872, and #1483. It's the result of teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should show that we were all involved in this resolution. The skinny: When garbage collecting items from our hash, that is, removing deleted items to free up space, we can end up violating hash invariants. Previously, one of these invariants was correctly fixed, however, it incorrectly assumed another invariant held but that is not the case. Post garbage collection, if any items have been deleted from our hash, we do an "optimize" operation on each hash item. We check to see if the location the item would hash to is now an empty bucket. If it is, we move the item to that location thereby restoring the expected chaining. There is, however, a problem with doing this. It's possible over time to violate another invariant when fixing the first violation. For a given item at a given location in the hash, each item has a probe value. An invariant of our data structure is that items at earlier locations in the hash will always have an equal or lower probe value for that location than items that come later. For example, items: "foo" and "bar". Given a hashmap whose size is 8, where "foo" would made to index 1 and "bar" would map to index "2". When looking at the probe values for "foo" and "bar" at index 1, "foo" would have a probe value of "0" as it is at the location it hashes to whereas "bar" would have a probe value of "7". The value is the number of indexes away from our "natural" hash index that the item is. When search the hash, we can use this probe value to not do a linear search of all indexes for the a given key. Once we find an item whose probe value for a given index is higher than ours, we know that the key can't be in the map past that index. Except our course for when we are restoring invariants after a delete. It's possible, due to the sequential nature of our "optimize" repair step, to violate this "always lower probe value" invariant. The previous implementation of "optimize_item" assumed that in invariant held true. By not detecting the invariant violation and fixing it, we could end up with maps where a key existed in it but it wouldn't be found. When the map in question was an object map used to hold gc'able items, this would result in an error that appears to be a gc error. See #1781, #1872, and #1483. It should be noted, that because of the complex chain of events that needs to occur to trigger this problem that we were unable to devise a unit test to catch this problem. If we had property based testing for the Pony runtime, this most likely would have been caught. Hopefully, PR #1840 to add rapidcheck into Pony happens soon. Closes #1781 Closes #1872 Closes #1483
Fix for this has been released as part of 0.14.0 |
UDP Server: https://gist.github.com/992d3e599fdf4cae164296d2aabcf5a5
UDP Client: https://gist.github.com/1fd3260168f28212e8867c161e75d01e
OSX 10.11.6, pony version 0.13.1-874c0f8 [release] compiled with: llvm 3.9.1 -- Apple LLVM version 8.0.0 (clang-800.0.42.1) (but this also occurred under pony 0.11.x)
When you start a server (a simple UDP reflector) and then the client (a simple UDP spammer), then occasionally, you get a segmentation fault. This seems to generally occur 'early' (after only a few thousand packets have been sent); if it doesn't occur early, the process appears to be able to run forever. There looks to be a long pause right before the crash, although that may be due to the core file being written (or it could be a threading deadlock). The output of the client at that time looks like this:
[...]
3854
3855
3856
[1] 40929 segmentation fault (core dumped)
and lldb says:
lldb) bt
__semwait_signal + 10 frame #1: 0x00007fff8e235787 libsystem_pthread.dylib
pthread_join + 444frame Nested tuple access can't parse #2: 0x0000000100013ddc boltthrower
pony_start + 572 frame #3: 0x00007fffa1c695ad libdyld.dylib
start + 1I'm not familiar enough with debugging under lldb/osx to be able to be much help further, but I'll happily help however I can.
The text was updated successfully, but these errors were encountered: