tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd fails in gcstress3, Windows x86 #2230

BruceForstall · 2020-01-27T17:08:39Z

This test fails in the CI in the last 3 runs of Windows x86 GCStress=0x3:

\r\nAssert failure(PID 4216 [0x00001078], Thread: 7224 [0x1c38]): Consistency check failed: Invalid transition into managed code!\r\n\r\nWe're walking this thread's stack and we've reached a managed frame at Esp=0x00F7D758. (The method is Advapi32::EventWriteTransfer_PInvoke) The very next FS:0 record (0x00F7D760) up from this point on the stack should be one of our 'unmanaged to managed SEH handlers', but its not... its something else, and that's very bad. It indicates that someone managed to call into managed code without setting up the proper exception handling.\r\n\r\nGet a good unmanaged stack trace for this thread. All FS:0 records are on the stack, so you can see who installed the last handler. Somewhere between that function and where the thread is now is where the bad transition occurred.\r\n\r\nA little extra info: FS:0 = 0x00F7D370, pEHR->Handler = 0x72E381BD\r\nFAILED: IsUnmanagedToManagedSEHHandler(pEHR)\r\n\r\nCORECLR! CHECK::Trigger + 0x2FB (0x727675be)\r\nCORECLR! VerifyValidTransitionFromManagedCode + 0x165 (0x7283af20)\r\nCORECLR! StackFrameIterator::NextRaw + 0x887 (0x72806395)\r\nCORECLR! StackFrameIterator::Next + 0x46 (0x72805aca)\r\nCORECLR! Thread::StackWalkFramesEx + 0x180 (0x7280705c)\r\nCORECLR! Thread::StackWalkFrames + 0x159 (0x72806e5c)\r\nCORECLR! ScanStackRoots + 0x198 (0x72cbc639)\r\nCORECLR! GCToEEInterface::GcScanRoots + 0x104 (0x72cbb9b6)\r\nCORECLR! WKS::gc_heap::mark_phase + 0x1AD (0x72c71931)\r\nCORECLR! WKS::gc_heap::gc1 + 0x167 (0x72c6bb08)\r\n File: F:\workspace\_work\1\s\src\coreclr\src\vm\i386\excepx86.cpp Line: 329\r\n Image: C:\h\w\A86D08D2\p\CoreRun.exe\r\n\r\n\r\nReturn code: 1\r\nRaw output file: C:\h\w\A86D08D2\w\9D9C0906\e\tracing\eventpipe\Reports\tracing.eventpipe\providervalidation\providervalidation\providervalidation.output.txt\r\nRaw output:\r\nBEGIN EXECUTION\r\n "C:\h\w\A86D08D2\p\corerun.exe" providervalidation.dll \r\n 0.0s: ==TEST STARTING==\r\n 5.6s: Started sending sentinel events...\r\n 6.0s: Connecting to EventPipe...\r\n 10.5s: Connected to EventPipe with sessionID '0x7633a78'\r\n 10.5s: Creating EventPipeEventSource...\r\n 22.7s: EventPipeEventSource created\r\n 26.4s: Dynamic.All callback registered\r\n 26.4s: Starting stream processing...\r\n 33.4s: Saw new provider 'Microsoft-DotNETCore-SampleProfiler'\r\n 50.9s: Saw sentinel event\r\n 51.0s: Stopped sending sentinel events\r\n 51.3s: Starting event generating action...\r\n 66.5s: Fired MyEvent 0/100,000 times...\r\n 70.3s: Saw new provider 'MyEventSource'\r\n 74.1s: Fired MyEvent 10,000/100,000 times...\r\n 75.0s: Fired MyEvent 20,000/100,000 times...\r\nExpected: 100\r\nActual: -1073740286\r\nEND EXECUTION - FAILED\r\nFAILED\r\nTest Harness Exitcode is : 1\r\nTo run the test:\r\n> set CORE_ROOT=C:\h\w\A86D08D2\p\r\n> C:\h\w\A86D08D2\w\9D9C0906\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd\r\nExpected: True\r\nActual: False

https://dev.azure.com/dnceng/public/_build/results?buildId=497294&view=ms.vss-test-web.build-test-results-tab&runId=15824298&resultId=110450&paneView=debug

The text was updated successfully, but these errors were encountered:

BruceForstall · 2020-01-27T17:08:55Z

@josalem

josalem · 2020-01-27T19:37:52Z

Taking a look today

josalem · 2020-01-28T01:43:58Z

Took a look at the dump that got generated and picked out a few things. I think the linked list of exception handlers might be corrupted. I took a look at 0x00F7D760 which points at the top level FS:0 entry and the address it has for a handler is 0x72e381bd. This seems to point to a seemingly random spot in the middle of jitted code (PAGE_EXECUTE_WRITECOPY owned by coreclr.dll) that doesn't resolve to a symbol (the closest symbol that windbg can resolve is 155,101 bytes away). The last managed->native transition happened in what I believe is generated P_Invoke code (Advapi32::EventWriteTransfer_PInvoke). When the GC walks back down the stack and hits this native->managed step, the validation check on the handler list fails this check:

runtime/src/coreclr/src/vm/i386/excepx86.cpp

Line 153 in 2cc926e

    
           BOOL IsUnmanagedToManagedSEHHandler(EXCEPTION_REGISTRATION_RECORD *pEstablisherFrame)

which is what causes this assertion to trigger.

I'm trying to run this test locally so I can figure out what is setting this top FS:0 entry into the list, but I can't reproduce the failure locally. I'll keep poking away at this.

@BruceForstall, any idea what might cause the exception handler list to get corrupted inside P_Invoke code? It looks like The P_Invoke code ended up in coreclr!JIT_PInvokeEndRarePath if that matters.

BruceForstall · 2020-01-28T01:58:03Z

I don't have an easy idea for why this corruption would occur. My first suggestion would be to run with COMPlus_GCStress=0xF and see if it reliably repros in a different location. If that occurs, and the issue is within JIT'ed code, then perhaps that indicates a JIT compiler GC reporting issue/hole. If not, it's possible that some VM stub/etc. is incorrectly preserving GC objects.

@janvorli Any advice here?

josalem · 2020-01-28T02:48:15Z

Could this be related to #2215?

janvorli · 2020-01-28T21:00:20Z

I'll take a look.

janvorli · 2020-01-28T21:04:35Z

@josalem I cannot find the dump - could you please tell me where to get it?

josalem · 2020-01-28T21:12:54Z

It got uploaded as part of the Helix workitem. Here's a direct link.

The build artifacts can be downloaded from AzDO for symbols.

janvorli · 2020-01-29T01:41:42Z

I have figured out what's going on. It is not a corruption. The check that fails is actually wrong in this state where we have just returned from a PInvoke and the stack walk was triggered from the JIT_PInvokeEndRarePath.

The problem is actually quite interesting and it has to do with the calling convention of the function that was pinvoked - namely the fact that the callee pops the arguments from stack. So here is what happens:

The managed code pinvokes the EventWriteTransfer_PInvoke, which is EventWriteTransfer in advapi32.dll, which in turn is just an import going to the ntdll!EventWriteTransfer. That function has quite a lot of parameters. The ret instruction at its end is ret 1Ch, so that it pops the arguments from the stack.
This is an inlined pinvoke and the sequence of operations is as follows:
- Push the pinvoke parameters to the stack
- Call PINVOKE_BEGIN helper that fills in an InlinedCallFrame on the stack. The interesting part here is that it sets the m_pCallSiteSP member to the current SP (which means SP with all the arguments already pushed)
- Call the actual pinvoke, which at its return pops all the arguments from the stack
- Call PINVOKE_END which ends up jumping to the JIT_PInvokeEndRarePath.
- The JIT_PInvokeEndRarePath, as a C++ function, installs a record into the FS:0 chain. The entry is a local in the frame of JIT_PInvokeEndRarePath. The interesting part here is that this location is actually at an address that is larger the m_pCallSiteSP stored in the InlinedCallFrame. That's due to the fact that the arguments of the pinvoke were popped before.
Now the thread->RareDisablePreemptiveGC(); in the JIT_PInvokeEndRarePath triggers stack walk and when it hits the InlinedCallFrame, it uses the SP that it thinks the managed caller of the pinvoke has (m_pCallSiteSP). But that is no longer the current SP in the managed function due to the argument popping. Then it calls VerifyValidTransitionFromManagedCode and it looks for the first record in the fs:0 chain that is at address that's >= the SP. And it finds the one from JIT_PInvokeEndRarePath due to the fact that it stored the fs:0 chain record in stack slot that contained the pinvoke arguments before. And it asserts thinking that things got corrupted.

Now the question is why we were not hitting this before. I think that the likely reason could be that the JIT was not pushing the pinvoke arguments to the stack before calling the PINVOKE_BEGIN.
@BruceForstall is this possible?

cc: @jkotas

jkotas · 2020-01-29T05:46:19Z

think that the likely reason could be that the JIT was not pushing the pinvoke arguments to the stack before calling the PINVOKE_BEGIN.

I do not think that this is the problem. The contract for m_pCallSiteSP always was that it ESP after pushing the arguments. It is documented in the doc - look for:

    // X86: ESP after pushing the outgoing arguments, and just before calling
    // out to unmanaged code.

I think the problem is https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/i386/cgenx86.cpp#L617 in combination with dotnet/coreclr#22560 (comment) .

The ReadyToRun helper does not set the MethodDesc for the PInvoke frame. It means that the unwinder is not able to adjust the stack for popped arguments and we get this problem.

jkotas · 2020-01-29T05:47:08Z

cc @fadimounir

janvorli · 2020-01-29T10:00:39Z

@jkotas thank you, I've missed the comment on the m_pCallSiteSP.

jkotas · 2020-01-29T17:11:55Z

Thoughts about how to fix this?

fadimounir · 2020-01-29T17:56:13Z

Can we make some changes to the JIT to make it pass the PInvoke MethodDesc to JIT_PInvokeBegin helper? The MethodDesc can be loaded by a R2R helper call

jkotas · 2020-01-29T18:01:37Z

Yes, that may be one option. Another option is to make the JIT to pass the size of the stack-arguments to JIT_PInvokeBegin helper, on x86 only.

fadimounir · 2020-01-29T18:18:46Z

#if !defined(_TARGET_64BIT_)
        // On 32-bit targets, indirect calls need the size of the stack args in InlinedCallFrame.m_Datum.
        const unsigned numStkArgBytes = call->fgArgInfo->GetNextSlotNum() * TARGET_POINTER_SIZE;

        src = comp->gtNewIconNode(numStkArgBytes, TYP_INT);
#else

Is that applicable also to arm32?

jkotas · 2020-01-29T18:25:53Z

I do not see why this would be needed for anything but Windows x86.

fadimounir · 2020-01-29T18:34:37Z

I do not see why this would be needed for anything but Windows x86.

Interesting... probably legacy code before the time of arm/arm64 then. @BruceForstall

@josalem did you try to repro this with complus_gcstress=0xf? If so, did it start to fail deterministically?

AriNuer · 2020-03-03T09:58:41Z

Same test tracing\\eventpipe\\providervalidation\\providervalidation\\providervalidation.cmd failed here :runtime-coreclr crossgen2 outerloop 20200302.1

Details:
https://dev.azure.com/dnceng/public/_build/results?buildId=544222&view=ms.vss-test-web.build-test-results-tab&runId=17227512&resultId=107009&paneView=debug

Error message:
\r\nAssert failure(PID 3448 [0x00000d78], Thread: 552 [0x0228]): Consistency check failed: AV in clr at this callstack:\r\n------\r\nCORECLR! memcmp + 0x12 (0x00007ffb25268c72)\r\nCORECLR! Thread::VirtualUnwindCallFrame + 0x404 (0x00007ffb24a5c908)\r\nCORECLR! EECodeManager::EnsureCallerContextIsValid + 0x1D6 (0x00007ffb24d25542)\r\nCORECLR! StackFrameIterator::CheckForSkippedFrames + 0x6D (0x00007ffb24a56d79)\r\nCORECLR! StackFrameIterator::ProcessCurrentFrame + 0x178 (0x00007ffb24a5b588)\r\nCORECLR! StackFrameIterator::Init + 0x213 (0x00007ffb24a593af)\r\nCORECLR! Thread::StackWalkFramesEx + 0x155 (0x00007ffb24a5c195)\r\nCORECLR! Thread::HandledJITCase + 0x23F (0x00007ffb24af456b)\r\nCORECLR! ThreadSuspend::SuspendRuntime + 0x444 (0x00007ffb24af9edc)\r\nCORECLR! ThreadSuspend::SuspendEE + 0x2D7 (0x00007ffb24af97bb)\r\nCORECLR! SampleProfiler::ThreadProc + 0x272 (0x00007ffb25036652)\r\nKERNEL32! BaseThreadInitThunk + 0x14 (0x00007ffb52d284d4)\r\nNTDLL! RtlUserThreadStart + 0x21 (0x00007ffb5316e8b1)\r\n-----\r\n.AV on tid=0x228 (552), cxr=00000049D5FFD040, exr=00000049D5FFD530\r\nFAILED: false\r\n\r\nCORECLR! CHECK::Trigger + 0x2CC (0x00007ffb2494c654)\r\nCORECLR! CLRVectoredExceptionHandlerPhase3 + 0x364 (0x00007ffb24ab6e80)\r\nCORECLR! CLRVectoredExceptionHandlerPhase2 + 0x69 (0x00007ffb24ab6901)\r\nCORECLR! CLRVectoredExceptionHandler + 0x1C6 (0x00007ffb24ab687a)\r\nCORECLR! CLRVectoredExceptionHandlerShim + 0x17A (0x00007ffb24ab70da)\r\nNTDLL! RtlInitializeCriticalSection + 0x190 (0x00007ffb53165ef0)\r\nNTDLL! RtlImageNtHeaderEx + 0x17B (0x00007ffb5313fa7b)\r\nNTDLL! KiUserExceptionDispatcher + 0x3A (0x00007ffb531a964a)\r\nCORECLR! memcmp + 0x12 (0x00007ffb25268c72)\r\nCORECLR! Thread::VirtualUnwindCallFrame + 0x404 (0x00007ffb24a5c908)\r\n File: F:\workspace_work\1\s\src\coreclr\src\vm\excep.cpp Line: 7558\r\n Image: C:\h\w\A9460959\p\CoreRun.exe\r\n\r\n\r\nReturn code: 1\r\nRaw output file: C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\Reports\tracing.eventpipe\providervalidation\providervalidation\providervalidation.output.txt\r\nRaw output:\r\nBEGIN EXECUTION\r\n 1 file(s) copied.\r\n" "C:\h\w\A9460959\p\crossgen2\crossgen2" C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.org -o:C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.dll --targetarch:x64 -O --inputbubble -r:"C:\h\w\A9460959\p\System..dll" -r:"C:\h\w\A9460959\p\Microsoft..dll" -r:"C:\h\w\A9460959\p\mscorlib.dll" -r:C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation*.dll"\r\nEmitting R2R PE file: C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.dll\r\n "C:\h\w\A9460959\p\corerun.exe" providervalidation.dll \r\n 0.0s: ==TEST STARTING==\r\n 0.6s: Started sending sentinel events...\r\n 0.6s: Connecting to EventPipe...\r\n 0.8s: Connected to EventPipe with sessionID '0x21aa213c650'\r\n 0.8s: Creating EventPipeEventSource...\r\nExpected: 100\r\nActual: -1073740286\r\nEND EXECUTION - FAILED\r\nFAILED\r\nTest Harness Exitcode is : 1\r\nTo run the test:\r\n> set CORE_ROOT=C:\h\w\A9460959\p\r\n> C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd\r\nExpected: True\r\nActual: False
`

Stack trace:
at tracing_eventpipe._providervalidation_providervalidation_providervalidation_._providervalidation_providervalidation_providervalidation_cmd() in F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.x64.Checked\TestWrappers\tracing.eventpipe\tracing.eventpipe.XUnitWrapper.cs:line 528

BruceForstall · 2020-03-07T03:04:56Z

@fadimounir Looks like #26834 fixed this? Can we close this?

The "new failure" above is crossgen2, and is unrelated.

fadimounir · 2020-03-09T23:37:30Z

@BruceForstall The fix has been reverted due to #31809.
@jkotas and I couldn't repro the second issue, so right now I have a new PR (#33249), and running CI testing for a couple of loops to see if some tests will fail again due to the fix (so far, nothing failed)

I'll close the issue once I merge the new PR

BruceForstall added arch-x86 EventPipe GCStress labels Jan 27, 2020

Dotnet-GitSync-Bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Jan 27, 2020

jkotas added area-ReadyToRun-coreclr and removed area-Tracing-coreclr labels Jan 29, 2020

jkotas added this to the 5.0 milestone Jan 29, 2020

jkotas added the reliability label Jan 29, 2020

fadimounir self-assigned this Jan 29, 2020

jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 30, 2020

fadimounir mentioned this issue Jan 31, 2020

Save stack arguments size in InlinedCallFrame.m_Datum on x86 to handle callee's popping of arguments #26834

Merged

jkotas added tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed reliability labels Feb 4, 2020

fadimounir closed this as completed Mar 10, 2020

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd fails in gcstress3, Windows x86 #2230

tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd fails in gcstress3, Windows x86 #2230

BruceForstall commented Jan 27, 2020

BruceForstall commented Jan 27, 2020

josalem commented Jan 27, 2020

josalem commented Jan 28, 2020

BruceForstall commented Jan 28, 2020

josalem commented Jan 28, 2020

janvorli commented Jan 28, 2020

janvorli commented Jan 28, 2020

josalem commented Jan 28, 2020

janvorli commented Jan 29, 2020

jkotas commented Jan 29, 2020

jkotas commented Jan 29, 2020

janvorli commented Jan 29, 2020

jkotas commented Jan 29, 2020

fadimounir commented Jan 29, 2020

jkotas commented Jan 29, 2020

fadimounir commented Jan 29, 2020

jkotas commented Jan 29, 2020

fadimounir commented Jan 29, 2020

AriNuer commented Mar 3, 2020

BruceForstall commented Mar 7, 2020

fadimounir commented Mar 9, 2020 •

edited

Loading

tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd fails in gcstress3, Windows x86 #2230

tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd fails in gcstress3, Windows x86 #2230

Comments

BruceForstall commented Jan 27, 2020

BruceForstall commented Jan 27, 2020

josalem commented Jan 27, 2020

josalem commented Jan 28, 2020

BruceForstall commented Jan 28, 2020

josalem commented Jan 28, 2020

janvorli commented Jan 28, 2020

janvorli commented Jan 28, 2020

josalem commented Jan 28, 2020

janvorli commented Jan 29, 2020

jkotas commented Jan 29, 2020

jkotas commented Jan 29, 2020

janvorli commented Jan 29, 2020

jkotas commented Jan 29, 2020

fadimounir commented Jan 29, 2020

jkotas commented Jan 29, 2020

fadimounir commented Jan 29, 2020

jkotas commented Jan 29, 2020

fadimounir commented Jan 29, 2020

AriNuer commented Mar 3, 2020

BruceForstall commented Mar 7, 2020

fadimounir commented Mar 9, 2020 • edited Loading

fadimounir commented Mar 9, 2020 •

edited

Loading