Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix heap corruption issue #68443. #69106

Merged
merged 2 commits into from
May 10, 2022
Merged

Conversation

PeterSolMS
Copy link
Contributor

This has first seen with regions, but should be an issue with segments as well.

What happened was that in revisit_written_pages, we determined the highest allocated address in a region, then obtained the dirty pages up to that high address.

In parallel, another thread allocated a new object after the high address and wrote to it.

The background GC thread saw the dirty page, but didn't explore the object because it was after its stale value for the high address. That caused some objects pointed at from the beginning of the new object to be seen as dead by background GC.

Because the allocating thread also set the card table entries, the next ephemeral GC crashed because the references from the new object were now invalid.

The fix is to refetch the high address after we have obtained the dirty pages. That way, we'll explore new objects allocated while we were getting the dirty pages. New objects allocated and written to after we obtained the dirty pages will cause more dirty pages and will thus be explored later.

My repro case that used to cause a crash every two hours or so has run overnight with this fix without issues.

This has first seen with regions, but should be an issue with segments as well.

What happened was that in revisit_written_pages, we determined the highest allocated address in a region, then obtained the dirty pages up to that high address.

In parallel, another thread allocated a new object after the high address and wrote to it.

The background GC thread saw the dirty page, but didn't explore the object because it was after its stale value for the high address. That caused some objects pointed at from the beginning of the new object to be seen as dead by background GC.

Because the allocating thread also set the card table entries, the next ephemeral GC crashed because the references from the new object were now invalid.

The fix is to refetch the high address after we have obtained the dirty pages. That way, we'll explore new objects allocated while we were getting the dirty pages. New objects allocated and written to after we obtained the dirty pages will cause more dirty pages and will thus be explored later.

My repro case that caused about a crash every two hours or so has run overnight with this fix without issues.
@ghost
Copy link

ghost commented May 10, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

This has first seen with regions, but should be an issue with segments as well.

What happened was that in revisit_written_pages, we determined the highest allocated address in a region, then obtained the dirty pages up to that high address.

In parallel, another thread allocated a new object after the high address and wrote to it.

The background GC thread saw the dirty page, but didn't explore the object because it was after its stale value for the high address. That caused some objects pointed at from the beginning of the new object to be seen as dead by background GC.

Because the allocating thread also set the card table entries, the next ephemeral GC crashed because the references from the new object were now invalid.

The fix is to refetch the high address after we have obtained the dirty pages. That way, we'll explore new objects allocated while we were getting the dirty pages. New objects allocated and written to after we obtained the dirty pages will cause more dirty pages and will thus be explored later.

My repro case that used to cause a crash every two hours or so has run overnight with this fix without issues.

Author: PeterSolMS
Assignees: PeterSolMS
Labels:

area-GC-coreclr

Milestone: -

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterSolMS Thank you!

@PeterSolMS
Copy link
Contributor Author

No problem - thank you for getting the investigation going and merging!

@mangod9
Copy link
Member

mangod9 commented May 12, 2022

Thanks for investigating and fixing this Peter!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants