-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assert failure: m_alignpad == 0 in libraries tests #70231
Comments
This is failing in the next-object validation.
The most likely explanation is that there is a race condition between |
Tagging subscribers to this area: @dotnet/gc Issue DetailsHit in #70226 Stacktrace:
|
Same assert here: #68511 |
The same assert happened for jitminopts in System.Text.RegularExpressions.Unit.Tests on win-x64: |
Same assert failed on System.Buffers.Tests https://dev.azure.com/dnceng/public/_build/results?buildId=1809197&view=ms.vss-test-web.build-test-results-tab&runId=48130508&resultId=197594&paneView=debug in this unrelated PR #70194 |
This is still failing quite a lot |
these tests' logs all say "windows.10.amd64.open.rt". what is "open.rt"? I've been trying running the system.IO.pipes tests for a few hours with no repro. |
so both failures I looked at (System.IO.Pipes and System.Text.Json.Tests) were running on "windows.10.amd64.open.rt". @hoyosjs tells me this is a type of queue that's "Windows Server 2016-Datacenter running on 2 cores". that's a pretty old OS. I've been running these on win10 over night and haven't seen a repro. is it possible to check if we only see ever saw these failures on this particular type of queue? |
'System.Text.RegularExpressions.Unit.Tests' that failed also ran on that particular queue. I would imagine we certainly these same tests running on newer OSs. |
We run of libraries tests with checked coreclr on Windows in windows.10.amd64.open.rt queue only. I do not see any other queues running this combination on Windows. Here is a query you can use to find all libraries tests that failed with runtime asserts in last 10 days. (Some of these failed with a different assert.) https://dataexplorer.azure.com/clusters/engsrvprod/databases/engineeringdata
|
this is to fix #70231. for regions we could run into this situation - object is the last object before heap_segment_allocated (hs) T0 calls NextObj, gets next obj which starts at heap_segment_allocated (hs) T1 changes ephemeral_heap_segment to hs T0 does these comparisons (nextobj >= heap_segment_allocated(hs) && hs != hp->ephemeral_heap_segment) || (nextobj >= hp->alloc_allocated)) both still false because alloc_allocated hasn't been changed just yet (and the old alloc_allocated is larger than nextobj) T0 validates next obj, concludes its m_alignpad is not 0, asserts T1 forms an allocation context starting at heap_segment_allocated, clears memory so by the time the dump is taken, m_alignpad is already cleared (actually we clear it in a_fit_segment_end) I'm fixing this by saving the ephemeral_heap_segment and alloc_allocated and bail if nextobj is not on the saved eph seg or if those 2 saved values are no long in sync.
Potentially still hitting this issue.
|
I have looked at the dump. Yes, it is still the same issue. |
The case that @jkotas has just closed as a dup happened on Linux arm64. I am in a process of attempting to repro it on Linux at the moment. Maybe I'll get more lucky with the repro there. |
so I just started looking at some dumps that @mrsharm gathered. from an initial look nothing seems unusual. it's not hitting what I fixed so either that issue was no longer reproing (which is good as it was definitely an issue) or it was never due to that reason. nextObj looks correct (as in, it is indeed the next obj from the object that's being validated) and is always in gen0 so just got allocated. in one case I see that it was at the end of an alloc context but in the other 2 cases it was in the middle of an alloc context. it's too bad that there's a stress log entry that actually doesn't get logged (even though it seems like its attention is since it calls |
some updates on this (and I will be OOF tomorrow) - @PeterSolMS continued the investigation and made a lot of progress on it - his theory is the nextObj was allocated from a free list item but it's escaping the demotion check (I didn't write the code in NextObj to begin with but my guess is the same as Peter that it's for escaping verifying when nextObj is allocated from the free list). and the syncblock and method table were cleared out of order. we have a potential fix. @janvorli could actually repro this so he will run with the fix tomorrow and see if our theory is correct. |
@Maoni0 @janvorli A sort of "stable" repro on Windows-x64:
^ eventually fails with:
I also observe the behaviour @jkotas noticed that m_alignpad is quickly cleared, I even put multiple |
note: doesn't repro when I change optimization level from |
@Maoni0 has a fix already. I also found a reasonably frequent repro last week - the readytorun\coreroot_determinism\coreroot_determinism coreclr test and run it with her fix over the weekend. Before the fix, it reproed about every 80 iterations. With the fix, 1000 iterations have passed without any repro. |
and thanks @EgorBo for finding another test that repros consistently. |
Thank you very much, @EgorBo! Based on your list of instructions, I was able to fairly quickly repro the assertion failure on my Windows desktop and have confirmed that after:
// if ((g == 0) && hp->settings.demotion)
// return NULL;//could be racing with another core allocating
Object * nextObj = GCHeapUtilities::GetGCHeap ()->NextObj (this);
if ((nextObj != NULL) &&
(nextObj->GetGCSafeMethodTable() != nullptr) &&
(nextObj->GetGCSafeMethodTable() != g_pFreeObjectMethodTable)) the assertion failure doesn't occur based on running the tests overnight. The main question I had was: how do I find out which test did the assertion failure occur for? Right now, I get a pretty general message that there was an assertion failure for a System.Collections.Test but no indication as to which test it failed on:
If we know the exact test, we'll have a quicker repro with a more targeted run. |
@mrsharm in my repro you can append |
This is still failing in libraries-pgo leg on linux-arm https://dev.azure.com/dnceng/public/_build/results?buildId=1927773&view=results |
This assert can be hit for many different GC holes, GC heap corruptions and GC bugs. It is important to at least capture the stacktrace where the assert is hit, and track different stack traces by separate issues. I think it is unlikely that the Linux arm crash you have seen is same root cause as this issue. I expect that it is going to have a different stacktrace. Unfortunately, we cannot tell for sure since no dump was collected. I am closing this as non-actionable. If you see a test failing with this assert, do not re-activate this issue. Instead open a new issue and capture stack trace where the assert is hit in the issue description. |
Hit in #70226
Dump and logs: https://dev.azure.com/dnceng/public/_build/results?buildId=1806089&view=ms.vss-test-web.build-test-results-tab&runId=48090090&resultId=197319&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab
Stacktrace:
The text was updated successfully, but these errors were encountered: