Fix heap corruption issue #68443. #69106

PeterSolMS · 2022-05-10T07:05:11Z

This has first seen with regions, but should be an issue with segments as well.

What happened was that in revisit_written_pages, we determined the highest allocated address in a region, then obtained the dirty pages up to that high address.

In parallel, another thread allocated a new object after the high address and wrote to it.

The background GC thread saw the dirty page, but didn't explore the object because it was after its stale value for the high address. That caused some objects pointed at from the beginning of the new object to be seen as dead by background GC.

Because the allocating thread also set the card table entries, the next ephemeral GC crashed because the references from the new object were now invalid.

The fix is to refetch the high address after we have obtained the dirty pages. That way, we'll explore new objects allocated while we were getting the dirty pages. New objects allocated and written to after we obtained the dirty pages will cause more dirty pages and will thus be explored later.

My repro case that used to cause a crash every two hours or so has run overnight with this fix without issues.

This has first seen with regions, but should be an issue with segments as well. What happened was that in revisit_written_pages, we determined the highest allocated address in a region, then obtained the dirty pages up to that high address. In parallel, another thread allocated a new object after the high address and wrote to it. The background GC thread saw the dirty page, but didn't explore the object because it was after its stale value for the high address. That caused some objects pointed at from the beginning of the new object to be seen as dead by background GC. Because the allocating thread also set the card table entries, the next ephemeral GC crashed because the references from the new object were now invalid. The fix is to refetch the high address after we have obtained the dirty pages. That way, we'll explore new objects allocated while we were getting the dirty pages. New objects allocated and written to after we obtained the dirty pages will cause more dirty pages and will thus be explored later. My repro case that caused about a crash every two hours or so has run overnight with this fix without issues.

ghost · 2022-05-10T07:05:20Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

This has first seen with regions, but should be an issue with segments as well.

What happened was that in revisit_written_pages, we determined the highest allocated address in a region, then obtained the dirty pages up to that high address.

In parallel, another thread allocated a new object after the high address and wrote to it.

The background GC thread saw the dirty page, but didn't explore the object because it was after its stale value for the high address. That caused some objects pointed at from the beginning of the new object to be seen as dead by background GC.

Because the allocating thread also set the card table entries, the next ephemeral GC crashed because the references from the new object were now invalid.

The fix is to refetch the high address after we have obtained the dirty pages. That way, we'll explore new objects allocated while we were getting the dirty pages. New objects allocated and written to after we obtained the dirty pages will cause more dirty pages and will thus be explored later.

My repro case that used to cause a crash every two hours or so has run overnight with this fix without issues.

Author:	PeterSolMS
Assignees:	PeterSolMS
Labels:	`area-GC-coreclr`
Milestone:	-

jkotas

@PeterSolMS Thank you!

PeterSolMS · 2022-05-10T16:21:51Z

No problem - thank you for getting the investigation going and merging!

mangod9 · 2022-05-12T16:02:11Z

Thanks for investigating and fixing this Peter!

PeterSolMS requested review from cshung, Maoni0, mangod9 and mrsharm May 10, 2022 07:05

ghost assigned PeterSolMS May 10, 2022

dotnet-issue-labeler bot added the area-GC-coreclr label May 10, 2022

Maoni0 approved these changes May 10, 2022

View reviewed changes

Merge branch 'main' into Fix_issue_68443

39216ae

jkotas approved these changes May 10, 2022

View reviewed changes

jkotas merged commit a2bcd9b into dotnet:main May 10, 2022

jkotas mentioned this pull request May 10, 2022

Segmentation fault in LibraryImportGenerator.Unit.Tests #68443

Closed

MichalStrehovsky mentioned this pull request May 19, 2022

Try enabling regions for native AOT. #69108

Closed

ghost locked as resolved and limited conversation to collaborators Jun 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix heap corruption issue #68443. #69106

Fix heap corruption issue #68443. #69106

PeterSolMS commented May 10, 2022

ghost commented May 10, 2022

jkotas left a comment

PeterSolMS commented May 10, 2022

mangod9 commented May 12, 2022

Fix heap corruption issue #68443. #69106

Fix heap corruption issue #68443. #69106

Conversation

PeterSolMS commented May 10, 2022

ghost commented May 10, 2022

jkotas left a comment

Choose a reason for hiding this comment

PeterSolMS commented May 10, 2022

mangod9 commented May 12, 2022