Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible #1872 issue fix #1876

Closed
wants to merge 1 commit into from
Closed

Possible #1872 issue fix #1876

wants to merge 1 commit into from

Conversation

SeanTAllen
Copy link
Member

I'm going to update this commit if it holds true.
I'd like some additional testing.

@dipinhora @sylvanc can you test #1872 for this.
@Cloven can you see if this fixes the issue for you.

@agarman @Theodus I don't think this will fix #1483 but
can you check and see as there is a chance it might.

@SeanTAllen SeanTAllen added the do not merge This PR should not be merged at this time label Apr 29, 2017
@SeanTAllen SeanTAllen requested review from sylvanc and dipinhora April 29, 2017 22:01
@SeanTAllen
Copy link
Member Author

Coincidentally, property based testing of the runtime could find this. @sylvanc and I wrote a test to attempt to create this condition using all the elements in a known bad case and were unable to do so because the order of insert of the 17 different items probably impacted as well as who knows what else.

Looking forward to @WojciechMigda work in #1840. It's going to be the basis of a great correctness of implementation boon.

@SeanTAllen
Copy link
Member Author

Btw, for anyone watching on the internets, this possible fix is a group effort. I narrowed down a basic cause. @sylvanc and I narrowed that down to a specific method and @dipinhora helped with the final bug finding. TEAMWORK!

@SeanTAllen
Copy link
Member Author

If this turns out to be a fix for #1872 (and possibly other issues), I'm going to add comments explaining what is going on in optimize_item as well as a really good commit comment for this.

@agarman
Copy link

agarman commented Apr 30, 2017

#1483 may be fixed by this. My previous test script hasn't been able to reproduce the bug. I'll do more tests later.

@SeanTAllen
Copy link
Member Author

@agarman that would be incredibly awesome if this fixed #1483 as well.

@agarman
Copy link

agarman commented Apr 30, 2017

@SeanTAllen looks like it fixes #1483
I ran tests with different settings across two hours. Everything was solid.

@SeanTAllen
Copy link
Member Author

thank you for testing @agarman.

@agarman
Copy link

agarman commented Apr 30, 2017

Thank you for fixing.

@dipinhora
Copy link
Contributor

Finally had a chance to test this. This change fixed both #1872 and #1781 in my limited testing.

@albertnetymk and @shuppyloh would you be able to test to confirm if #1781 is fixed for you?

@SeanTAllen
Copy link
Member Author

I'm going to assume that thisfixes #1781, #1872, and #1483 based on feedback.

I'll work on turning this into a final, mergeable PR this week.

@albertnetymk
Copy link

#1781 seems to work for me now. Thank you very much. OOC, some explanation on why this mysterious else branch fixes the issue?

If you were being facetious, you could describe the Pony runtime as
a  series of hashmaps that are held together by some code. Hash
performance  and correctness can have a great impact on everything
else in the runtime because they are at the basis of most everything
else in the runtime.

This change fixes a number of issues that appears to be garbage
collection bugs but were in fact, problems with invariant violation
in the underlying hash implementation. It should be noted that while
the rest of this comment discuss invariant violations that exist in
our Robin Hood hash implementation, some of the bugs that this closes
predate the Robin Hood implementation. This leads me to believe that
the previous implementation had some subtle problem that could occur
under some rare interleaving of operations. How this occurred is
unknown at this time and probably always will be unless someone wants
to go back to the previous version and use what we learned here to
diagnose the state of the code at that time.

This patch closes issues #1781, #1872, and #1483. It's the result of
teamwork amongst myself, Sylvan Clebch and Dipin Hora. History should
show that we were all involved in this resolution.

The skinny:

When garbage collecting items from our hash, that is, removing
deleted  items to free up space, we can end up violating hash
invariants. Previously, one of these invariants was correctly fixed,
however, it incorrectly assumed another invariant held but that is
not the case.

Post garbage collection, if any items have been deleted from our
hash, we do an "optimize" operation on each hash item. We check to
see if the location the item would hash to is now an empty bucket.
If it is, we move the item to that location thereby restoring the
expected chaining. There is, however, a problem with doing this.
It's possible over time to violate another invariant when fixing the
first violation.

For a given item at a given location in the hash, each item has a
probe value. An invariant of our data structure is that items at
earlier locations in the  hash will always have an equal or lower
probe value for that location than items that come later.

For example, items: "foo" and "bar". Given a hashmap whose size is
8, where  "foo" would made to index 1 and "bar" would map to index
"2". When looking at  the probe values for "foo" and "bar" at index
1, "foo" would have a probe  value of "0" as it is at the location
it hashes to whereas "bar" would have a  probe value of "7". The
value is the number of indexes away from our "natural" hash index
that the item is.

When search the hash, we can use this probe value to not do a linear
search of all indexes for the a given key. Once we find an item
whose probe value for a given index is higher than ours, we know
that the key can't be in the map past that index.

Except our course for when we are restoring invariants after a
delete. It's possible, due to the sequential nature of our
"optimize" repair step, to  violate this "always lower probe
value" invariant.

The previous implementation of "optimize_item" assumed that in
invariant held true. By not detecting the invariant violation and
fixing it, we could end up with maps where a key existed in it but
it wouldn't be found. When the map in question was an object map
used to hold gc'able items, this would result in an error that
appears to be a gc error. See #1781, #1872, and #1483.

Closes #1781
Closes #1872
Closes #1483
@SeanTAllen
Copy link
Member Author

SeanTAllen commented May 4, 2017

closing this an opening a new one to make for an easier reference from changelog.

see #1886

@SeanTAllen SeanTAllen closed this May 4, 2017
@SeanTAllen
Copy link
Member Author

@albertnetymk see: #1886

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge This PR should not be merged at this time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EXC_BAD_ACCESS crash in try_gc when under heavy load
4 participants