-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix heap corruption issue #68443. #69106
Conversation
This has first seen with regions, but should be an issue with segments as well. What happened was that in revisit_written_pages, we determined the highest allocated address in a region, then obtained the dirty pages up to that high address. In parallel, another thread allocated a new object after the high address and wrote to it. The background GC thread saw the dirty page, but didn't explore the object because it was after its stale value for the high address. That caused some objects pointed at from the beginning of the new object to be seen as dead by background GC. Because the allocating thread also set the card table entries, the next ephemeral GC crashed because the references from the new object were now invalid. The fix is to refetch the high address after we have obtained the dirty pages. That way, we'll explore new objects allocated while we were getting the dirty pages. New objects allocated and written to after we obtained the dirty pages will cause more dirty pages and will thus be explored later. My repro case that caused about a crash every two hours or so has run overnight with this fix without issues.
Tagging subscribers to this area: @dotnet/gc Issue DetailsThis has first seen with regions, but should be an issue with segments as well. What happened was that in revisit_written_pages, we determined the highest allocated address in a region, then obtained the dirty pages up to that high address. In parallel, another thread allocated a new object after the high address and wrote to it. The background GC thread saw the dirty page, but didn't explore the object because it was after its stale value for the high address. That caused some objects pointed at from the beginning of the new object to be seen as dead by background GC. Because the allocating thread also set the card table entries, the next ephemeral GC crashed because the references from the new object were now invalid. The fix is to refetch the high address after we have obtained the dirty pages. That way, we'll explore new objects allocated while we were getting the dirty pages. New objects allocated and written to after we obtained the dirty pages will cause more dirty pages and will thus be explored later. My repro case that used to cause a crash every two hours or so has run overnight with this fix without issues.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PeterSolMS Thank you!
No problem - thank you for getting the investigation going and merging! |
Thanks for investigating and fixing this Peter! |
This has first seen with regions, but should be an issue with segments as well.
What happened was that in revisit_written_pages, we determined the highest allocated address in a region, then obtained the dirty pages up to that high address.
In parallel, another thread allocated a new object after the high address and wrote to it.
The background GC thread saw the dirty page, but didn't explore the object because it was after its stale value for the high address. That caused some objects pointed at from the beginning of the new object to be seen as dead by background GC.
Because the allocating thread also set the card table entries, the next ephemeral GC crashed because the references from the new object were now invalid.
The fix is to refetch the high address after we have obtained the dirty pages. That way, we'll explore new objects allocated while we were getting the dirty pages. New objects allocated and written to after we obtained the dirty pages will cause more dirty pages and will thus be explored later.
My repro case that used to cause a crash every two hours or so has run overnight with this fix without issues.