Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd fails in gcstress3, Windows x86 #2230

Closed
BruceForstall opened this issue Jan 27, 2020 · 21 comments
Assignees
Labels
arch-x86 area-ReadyToRun-coreclr EventPipe GCStress tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@BruceForstall
Copy link
Member

This test fails in the CI in the last 3 runs of Windows x86 GCStress=0x3:

\r\nAssert failure(PID 4216 [0x00001078], Thread: 7224 [0x1c38]): Consistency check failed: Invalid transition into managed code!\r\n\r\nWe're walking this thread's stack and we've reached a managed frame at Esp=0x00F7D758. (The method is Advapi32::EventWriteTransfer_PInvoke) The very next FS:0 record (0x00F7D760) up from this point on the stack should be one of our 'unmanaged to managed SEH handlers', but its not... its something else, and that's very bad. It indicates that someone managed to call into managed code without setting up the proper exception handling.\r\n\r\nGet a good unmanaged stack trace for this thread. All FS:0 records are on the stack, so you can see who installed the last handler. Somewhere between that function and where the thread is now is where the bad transition occurred.\r\n\r\nA little extra info: FS:0 = 0x00F7D370, pEHR->Handler = 0x72E381BD\r\nFAILED: IsUnmanagedToManagedSEHHandler(pEHR)\r\n\r\nCORECLR! CHECK::Trigger + 0x2FB (0x727675be)\r\nCORECLR! VerifyValidTransitionFromManagedCode + 0x165 (0x7283af20)\r\nCORECLR! StackFrameIterator::NextRaw + 0x887 (0x72806395)\r\nCORECLR! StackFrameIterator::Next + 0x46 (0x72805aca)\r\nCORECLR! Thread::StackWalkFramesEx + 0x180 (0x7280705c)\r\nCORECLR! Thread::StackWalkFrames + 0x159 (0x72806e5c)\r\nCORECLR! ScanStackRoots + 0x198 (0x72cbc639)\r\nCORECLR! GCToEEInterface::GcScanRoots + 0x104 (0x72cbb9b6)\r\nCORECLR! WKS::gc_heap::mark_phase + 0x1AD (0x72c71931)\r\nCORECLR! WKS::gc_heap::gc1 + 0x167 (0x72c6bb08)\r\n File: F:\workspace\_work\1\s\src\coreclr\src\vm\i386\excepx86.cpp Line: 329\r\n Image: C:\h\w\A86D08D2\p\CoreRun.exe\r\n\r\n\r\nReturn code: 1\r\nRaw output file: C:\h\w\A86D08D2\w\9D9C0906\e\tracing\eventpipe\Reports\tracing.eventpipe\providervalidation\providervalidation\providervalidation.output.txt\r\nRaw output:\r\nBEGIN EXECUTION\r\n "C:\h\w\A86D08D2\p\corerun.exe" providervalidation.dll \r\n 0.0s: ==TEST STARTING==\r\n 5.6s: Started sending sentinel events...\r\n 6.0s: Connecting to EventPipe...\r\n 10.5s: Connected to EventPipe with sessionID '0x7633a78'\r\n 10.5s: Creating EventPipeEventSource...\r\n 22.7s: EventPipeEventSource created\r\n 26.4s: Dynamic.All callback registered\r\n 26.4s: Starting stream processing...\r\n 33.4s: Saw new provider 'Microsoft-DotNETCore-SampleProfiler'\r\n 50.9s: Saw sentinel event\r\n 51.0s: Stopped sending sentinel events\r\n 51.3s: Starting event generating action...\r\n 66.5s: Fired MyEvent 0/100,000 times...\r\n 70.3s: Saw new provider 'MyEventSource'\r\n 74.1s: Fired MyEvent 10,000/100,000 times...\r\n 75.0s: Fired MyEvent 20,000/100,000 times...\r\nExpected: 100\r\nActual: -1073740286\r\nEND EXECUTION - FAILED\r\nFAILED\r\nTest Harness Exitcode is : 1\r\nTo run the test:\r\n> set CORE_ROOT=C:\h\w\A86D08D2\p\r\n> C:\h\w\A86D08D2\w\9D9C0906\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd\r\nExpected: True\r\nActual: False

https://dev.azure.com/dnceng/public/_build/results?buildId=497294&view=ms.vss-test-web.build-test-results-tab&runId=15824298&resultId=110450&paneView=debug

@BruceForstall
Copy link
Member Author

@josalem

@josalem
Copy link
Contributor

josalem commented Jan 27, 2020

Taking a look today

@josalem
Copy link
Contributor

josalem commented Jan 28, 2020

Took a look at the dump that got generated and picked out a few things. I think the linked list of exception handlers might be corrupted. I took a look at 0x00F7D760 which points at the top level FS:0 entry and the address it has for a handler is 0x72e381bd. This seems to point to a seemingly random spot in the middle of jitted code (PAGE_EXECUTE_WRITECOPY owned by coreclr.dll) that doesn't resolve to a symbol (the closest symbol that windbg can resolve is 155,101 bytes away). The last managed->native transition happened in what I believe is generated P_Invoke code (Advapi32::EventWriteTransfer_PInvoke). When the GC walks back down the stack and hits this native->managed step, the validation check on the handler list fails this check:

BOOL IsUnmanagedToManagedSEHHandler(EXCEPTION_REGISTRATION_RECORD *pEstablisherFrame)
which is what causes this assertion to trigger.

I'm trying to run this test locally so I can figure out what is setting this top FS:0 entry into the list, but I can't reproduce the failure locally. I'll keep poking away at this.

@BruceForstall, any idea what might cause the exception handler list to get corrupted inside P_Invoke code? It looks like The P_Invoke code ended up in coreclr!JIT_PInvokeEndRarePath if that matters.

@BruceForstall
Copy link
Member Author

I don't have an easy idea for why this corruption would occur. My first suggestion would be to run with COMPlus_GCStress=0xF and see if it reliably repros in a different location. If that occurs, and the issue is within JIT'ed code, then perhaps that indicates a JIT compiler GC reporting issue/hole. If not, it's possible that some VM stub/etc. is incorrectly preserving GC objects.

@janvorli Any advice here?

@josalem
Copy link
Contributor

josalem commented Jan 28, 2020

Could this be related to #2215?

@janvorli
Copy link
Member

I'll take a look.

@janvorli
Copy link
Member

@josalem I cannot find the dump - could you please tell me where to get it?

@josalem
Copy link
Contributor

josalem commented Jan 28, 2020

It got uploaded as part of the Helix workitem. Here's a direct link.

The build artifacts can be downloaded from AzDO for symbols.

@janvorli
Copy link
Member

I have figured out what's going on. It is not a corruption. The check that fails is actually wrong in this state where we have just returned from a PInvoke and the stack walk was triggered from the JIT_PInvokeEndRarePath.

The problem is actually quite interesting and it has to do with the calling convention of the function that was pinvoked - namely the fact that the callee pops the arguments from stack. So here is what happens:

  • The managed code pinvokes the EventWriteTransfer_PInvoke, which is EventWriteTransfer in advapi32.dll, which in turn is just an import going to the ntdll!EventWriteTransfer. That function has quite a lot of parameters. The ret instruction at its end is ret 1Ch, so that it pops the arguments from the stack.
  • This is an inlined pinvoke and the sequence of operations is as follows:
    • Push the pinvoke parameters to the stack
    • Call PINVOKE_BEGIN helper that fills in an InlinedCallFrame on the stack. The interesting part here is that it sets the m_pCallSiteSP member to the current SP (which means SP with all the arguments already pushed)
    • Call the actual pinvoke, which at its return pops all the arguments from the stack
    • Call PINVOKE_END which ends up jumping to the JIT_PInvokeEndRarePath.
    • The JIT_PInvokeEndRarePath, as a C++ function, installs a record into the FS:0 chain. The entry is a local in the frame of JIT_PInvokeEndRarePath. The interesting part here is that this location is actually at an address that is larger the m_pCallSiteSP stored in the InlinedCallFrame. That's due to the fact that the arguments of the pinvoke were popped before.
  • Now the thread->RareDisablePreemptiveGC(); in the JIT_PInvokeEndRarePath triggers stack walk and when it hits the InlinedCallFrame, it uses the SP that it thinks the managed caller of the pinvoke has (m_pCallSiteSP). But that is no longer the current SP in the managed function due to the argument popping. Then it calls VerifyValidTransitionFromManagedCode and it looks for the first record in the fs:0 chain that is at address that's >= the SP. And it finds the one from JIT_PInvokeEndRarePath due to the fact that it stored the fs:0 chain record in stack slot that contained the pinvoke arguments before. And it asserts thinking that things got corrupted.

Now the question is why we were not hitting this before. I think that the likely reason could be that the JIT was not pushing the pinvoke arguments to the stack before calling the PINVOKE_BEGIN.
@BruceForstall is this possible?

cc: @jkotas

@jkotas
Copy link
Member

jkotas commented Jan 29, 2020

think that the likely reason could be that the JIT was not pushing the pinvoke arguments to the stack before calling the PINVOKE_BEGIN.

I do not think that this is the problem. The contract for m_pCallSiteSP always was that it ESP after pushing the arguments. It is documented in the doc - look for:

    // X86: ESP after pushing the outgoing arguments, and just before calling
    // out to unmanaged code.

I think the problem is https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/i386/cgenx86.cpp#L617 in combination with dotnet/coreclr#22560 (comment) .

The ReadyToRun helper does not set the MethodDesc for the PInvoke frame. It means that the unwinder is not able to adjust the stack for popped arguments and we get this problem.

@jkotas
Copy link
Member

jkotas commented Jan 29, 2020

cc @fadimounir

@janvorli
Copy link
Member

@jkotas thank you, I've missed the comment on the m_pCallSiteSP.

@jkotas
Copy link
Member

jkotas commented Jan 29, 2020

Thoughts about how to fix this?

@jkotas jkotas added this to the 5.0 milestone Jan 29, 2020
@fadimounir
Copy link
Contributor

Can we make some changes to the JIT to make it pass the PInvoke MethodDesc to JIT_PInvokeBegin helper? The MethodDesc can be loaded by a R2R helper call

@jkotas
Copy link
Member

jkotas commented Jan 29, 2020

Yes, that may be one option. Another option is to make the JIT to pass the size of the stack-arguments to JIT_PInvokeBegin helper, on x86 only.

@fadimounir
Copy link
Contributor

#if !defined(_TARGET_64BIT_)
        // On 32-bit targets, indirect calls need the size of the stack args in InlinedCallFrame.m_Datum.
        const unsigned numStkArgBytes = call->fgArgInfo->GetNextSlotNum() * TARGET_POINTER_SIZE;

        src = comp->gtNewIconNode(numStkArgBytes, TYP_INT);
#else

Is that applicable also to arm32?

@jkotas
Copy link
Member

jkotas commented Jan 29, 2020

I do not see why this would be needed for anything but Windows x86.

@fadimounir fadimounir self-assigned this Jan 29, 2020
@fadimounir
Copy link
Contributor

I do not see why this would be needed for anything but Windows x86.

Interesting... probably legacy code before the time of arm/arm64 then. @BruceForstall

@josalem did you try to repro this with complus_gcstress=0xf? If so, did it start to fail deterministically?

@jkotas jkotas added tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed reliability labels Feb 4, 2020
@AriNuer
Copy link

AriNuer commented Mar 3, 2020

Same test tracing\\eventpipe\\providervalidation\\providervalidation\\providervalidation.cmd failed here :runtime-coreclr crossgen2 outerloop 20200302.1

Details:
https://dev.azure.com/dnceng/public/_build/results?buildId=544222&view=ms.vss-test-web.build-test-results-tab&runId=17227512&resultId=107009&paneView=debug

Error message:
\r\nAssert failure(PID 3448 [0x00000d78], Thread: 552 [0x0228]): Consistency check failed: AV in clr at this callstack:\r\n------\r\nCORECLR! memcmp + 0x12 (0x00007ffb25268c72)\r\nCORECLR! Thread::VirtualUnwindCallFrame + 0x404 (0x00007ffb24a5c908)\r\nCORECLR! EECodeManager::EnsureCallerContextIsValid + 0x1D6 (0x00007ffb24d25542)\r\nCORECLR! StackFrameIterator::CheckForSkippedFrames + 0x6D (0x00007ffb24a56d79)\r\nCORECLR! StackFrameIterator::ProcessCurrentFrame + 0x178 (0x00007ffb24a5b588)\r\nCORECLR! StackFrameIterator::Init + 0x213 (0x00007ffb24a593af)\r\nCORECLR! Thread::StackWalkFramesEx + 0x155 (0x00007ffb24a5c195)\r\nCORECLR! Thread::HandledJITCase + 0x23F (0x00007ffb24af456b)\r\nCORECLR! ThreadSuspend::SuspendRuntime + 0x444 (0x00007ffb24af9edc)\r\nCORECLR! ThreadSuspend::SuspendEE + 0x2D7 (0x00007ffb24af97bb)\r\nCORECLR! SampleProfiler::ThreadProc + 0x272 (0x00007ffb25036652)\r\nKERNEL32! BaseThreadInitThunk + 0x14 (0x00007ffb52d284d4)\r\nNTDLL! RtlUserThreadStart + 0x21 (0x00007ffb5316e8b1)\r\n-----\r\n.AV on tid=0x228 (552), cxr=00000049D5FFD040, exr=00000049D5FFD530\r\nFAILED: false\r\n\r\nCORECLR! CHECK::Trigger + 0x2CC (0x00007ffb2494c654)\r\nCORECLR! CLRVectoredExceptionHandlerPhase3 + 0x364 (0x00007ffb24ab6e80)\r\nCORECLR! CLRVectoredExceptionHandlerPhase2 + 0x69 (0x00007ffb24ab6901)\r\nCORECLR! CLRVectoredExceptionHandler + 0x1C6 (0x00007ffb24ab687a)\r\nCORECLR! CLRVectoredExceptionHandlerShim + 0x17A (0x00007ffb24ab70da)\r\nNTDLL! RtlInitializeCriticalSection + 0x190 (0x00007ffb53165ef0)\r\nNTDLL! RtlImageNtHeaderEx + 0x17B (0x00007ffb5313fa7b)\r\nNTDLL! KiUserExceptionDispatcher + 0x3A (0x00007ffb531a964a)\r\nCORECLR! memcmp + 0x12 (0x00007ffb25268c72)\r\nCORECLR! Thread::VirtualUnwindCallFrame + 0x404 (0x00007ffb24a5c908)\r\n File: F:\workspace_work\1\s\src\coreclr\src\vm\excep.cpp Line: 7558\r\n Image: C:\h\w\A9460959\p\CoreRun.exe\r\n\r\n\r\nReturn code: 1\r\nRaw output file: C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\Reports\tracing.eventpipe\providervalidation\providervalidation\providervalidation.output.txt\r\nRaw output:\r\nBEGIN EXECUTION\r\n 1 file(s) copied.\r\n" "C:\h\w\A9460959\p\crossgen2\crossgen2" C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.org -o:C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.dll --targetarch:x64 -O --inputbubble -r:"C:\h\w\A9460959\p\System..dll" -r:"C:\h\w\A9460959\p\Microsoft..dll" -r:"C:\h\w\A9460959\p\mscorlib.dll" -r:C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation*.dll"\r\nEmitting R2R PE file: C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.dll\r\n "C:\h\w\A9460959\p\corerun.exe" providervalidation.dll \r\n 0.0s: ==TEST STARTING==\r\n 0.6s: Started sending sentinel events...\r\n 0.6s: Connecting to EventPipe...\r\n 0.8s: Connected to EventPipe with sessionID '0x21aa213c650'\r\n 0.8s: Creating EventPipeEventSource...\r\nExpected: 100\r\nActual: -1073740286\r\nEND EXECUTION - FAILED\r\nFAILED\r\nTest Harness Exitcode is : 1\r\nTo run the test:\r\n> set CORE_ROOT=C:\h\w\A9460959\p\r\n> C:\h\w\A9460959\w\A17608F1\e\tracing\eventpipe\providervalidation\providervalidation\providervalidation.cmd\r\nExpected: True\r\nActual: False
`

Stack trace:
at tracing_eventpipe._providervalidation_providervalidation_providervalidation_._providervalidation_providervalidation_providervalidation_cmd() in F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.x64.Checked\TestWrappers\tracing.eventpipe\tracing.eventpipe.XUnitWrapper.cs:line 528

@BruceForstall
Copy link
Member Author

@fadimounir Looks like #26834 fixed this? Can we close this?

The "new failure" above is crossgen2, and is unrelated.

@fadimounir
Copy link
Contributor

fadimounir commented Mar 9, 2020

@BruceForstall The fix has been reverted due to #31809.
@jkotas and I couldn't repro the second issue, so right now I have a new PR (#33249), and running CI testing for a couple of loops to see if some tests will fail again due to the fix (so far, nothing failed)

I'll close the issue once I merge the new PR

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x86 area-ReadyToRun-coreclr EventPipe GCStress tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

8 participants