Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: NullReferenceException in Internal.JitInterface.CorInfoImpl._beginInlining #105441

Closed
v-wenyuxu opened this issue Jul 25, 2024 · 87 comments · Fixed by #105551 or #105832
Closed
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs blocking-release in-pr There is an active PR which will close this issue when it is merged Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Milestone

Comments

@v-wenyuxu
Copy link

v-wenyuxu commented Jul 25, 2024

Failed in: runtime-coreclr outerloop 20240724.3

Failed tests:

R2R-CG2 linux arm64 Checked @ (Alpine.317.Arm64.Open)[email protected]/dotnet-buildtools/prereqs:alpine-3.17-helix-arm64v8
    - Regressions/coreclr/GitHub_87879/test87879/test87879.cmd

Error message:

 /root/helix/work/workitem/e/Regressions/Regressions/../coreclr/GitHub_87879/test87879/test87879.sh: line 315: -r:/root/helix/work/workitem/e/Regressions/coreclr/GitHub_87879/test87879/IL-CG2/*.dll: No such file or directory
Unhandled exception. ILCompiler.CodeGenerationFailedException: Code generation failed for method '[test87879]__GeneratedMainWrapper..ctor()'
 ---> System.NullReferenceException: Object reference not set to an instance of an object.
   at Internal.JitInterface.CorInfoImpl._beginInlining(IntPtr thisHandle, IntPtr* ppException, CORINFO_METHOD_STRUCT_* inlinerHnd, CORINFO_METHOD_STRUCT_* inlineeHnd) in /_/src/coreclr/tools/Common/JitInterface/CorInfoImpl_generated.cs:line 154
   --- End of inner exception stack trace ---
   at Internal.JitInterface.CorInfoImpl.CompileMethodInternal(IMethodNode methodCodeNodeNeedingCode, MethodIL methodIL) in /_/src/coreclr/tools/Common/JitInterface/CorInfoImpl.cs:line 381
   at Internal.JitInterface.CorInfoImpl.CompileMethod(MethodWithGCInfo methodCodeNodeNeedingCode, Logger logger) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs:line 810
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOneMethod|5(DependencyNodeCore`1 dependency, Int32 compileThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 898
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOnThread|4(Int32 compilationThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 833
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompilationThread|3(Object objThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 811
Unhandled exception. System.ArgumentNullException: Value cannot be null. (Parameter 'array')
   at System.Array.Clear(Array array, Int32 index, Int32 length)
   at Internal.JitInterface.CorInfoImpl.CompileMethodCleanup() in /_/src/coreclr/tools/Common/JitInterface/CorInfoImpl.cs:line 700
   at Internal.JitInterface.CorInfoImpl.CompileMethod(MethodWithGCInfo methodCodeNodeNeedingCode, Logger logger) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs:line 826
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOneMethod|5(DependencyNodeCore`1 dependency, Int32 compileThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 898
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOnThread|4(Int32 compilationThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 833
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompilationThread|3(Object objThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 811
/root/helix/work/workitem/e/Regressions/Regressions/../coreclr/GitHub_87879/test87879/test87879.sh: line 254:  4711 Aborted                 (core dumped) $__Command

Return code:      1
Raw output file:      /root/helix/work/workitem/uploads/coreclr/GitHub_87879/test87879/output.txt
Raw output:
BEGIN EXECUTION
in takeLock
/root/helix/work/workitem/e/Regressions/coreclr/GitHub_87879/test87879/IL-CG2/test87879.dll
19:50:09
Response file: /root/helix/work/workitem/e/Regressions/coreclr/GitHub_87879/test87879/test87879.dll.rsp
/root/helix/work/workitem/e/Regressions/coreclr/GitHub_87879/test87879/IL-CG2/test87879.dll
-o:/root/helix/work/workitem/e/Regressions/coreclr/GitHub_87879/test87879/test87879.dll
-r:/root/helix/work/correlation/System.*.dll
-r:/root/helix/work/correlation/Microsoft.*.

Stack trace:

   at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String assemblyPath, String testPathPrefix)
   at Program.<<Main>$>g__TestExecutor74|0_75(StreamWriter tempLogSw, StreamWriter statsCsvSw, <>c__DisplayClass0_0&)

Known Issue Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": ["NullReferenceException","_beginInlining"],
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=753860
Error message validated: [NullReferenceException _beginInlining]
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 7/27/2024 1:53:56 AM UTC

Report

Build Definition Step Name Console log Pull Request
2582592 dotnet-runtime Build managed CoreCLR and host components, all libraries, and packs Log
2580533 dotnet-runtime Build managed CoreCLR and host components, all libraries, and packs Log
2578580 dotnet-runtime Build managed CoreCLR and host components, all libraries, and packs Log
2575423 dotnet-runtime Build managed CoreCLR and host components, all libraries, and packs Log

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 4
@jkotas
Copy link
Member

jkotas commented Jul 25, 2024

@steveisok We have picked up intermittent crossgen crash on arm64 in _beginInlining method. It is causing a lot of noise (see duplicate issues). Could you please take a look?

@jkotas
Copy link
Member

jkotas commented Jul 25, 2024

I think it was likely introduced by the #104696 toolset update, so it is not easy to roll-back.

@steveisok
Copy link
Member

@steveisok We have picked up intermittent crossgen crash on arm64 in _beginInlining method. It is causing a lot of noise (see duplicate issues). Could you please take a look?

Yep, I can have someone take a closer look.

@v-wenyuxu
Copy link
Author

Failed in: runtime-coreclr outerloop 20240725.5

Failed tests:

R2R-CG2 linux arm64 Checked @ (Ubuntu.2004.Arm64.Open)[email protected]/dotnet-buildtools/prereqs:ubuntu-20.04-helix-arm64v8
    - reflection/SetValue/TrySetReadonlyStaticField/TrySetReadonlyStaticField.cmd

Error message:

 /root/helix/work/workitem/e/reflection/reflection/../SetValue/TrySetReadonlyStaticField/TrySetReadonlyStaticField.sh: line 315: -r:/root/helix/work/workitem/e/reflection/SetValue/TrySetReadonlyStaticField/IL-CG2/*.dll: No such file or directory
Unhandled exception. ILCompiler.CodeGenerationFailedException: Code generation failed for method '[TrySetReadonlyStaticField]X.Set(string,bool)'
 ---> System.NullReferenceException: Object reference not set to an instance of an object.
   at Internal.JitInterface.CorInfoImpl._beginInlining(IntPtr thisHandle, IntPtr* ppException, CORINFO_METHOD_STRUCT_* inlinerHnd, CORINFO_METHOD_STRUCT_* inlineeHnd) in /_/src/coreclr/tools/Common/JitInterface/CorInfoImpl_generated.cs:line 154
   --- End of inner exception stack trace ---
   at Internal.JitInterface.CorInfoImpl.CompileMethodInternal(IMethodNode methodCodeNodeNeedingCode, MethodIL methodIL) in /_/src/coreclr/tools/Common/JitInterface/CorInfoImpl.cs:line 381
   at Internal.JitInterface.CorInfoImpl.CompileMethod(MethodWithGCInfo methodCodeNodeNeedingCode, Logger logger) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs:line 810
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOneMethod|5(DependencyNodeCore`1 dependency, Int32 compileThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 898
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOnThread|4(Int32 compilationThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 833
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompilationThread|3(Object objThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 811
Unhandled exception. System.ArgumentNullException: Value cannot be null. (Parameter 'array')
   at System.Array.Clear(Array array, Int32 index, Int32 length)
   at Internal.JitInterface.CorInfoImpl.CompileMethodCleanup() in /_/src/coreclr/tools/Common/JitInterface/CorInfoImpl.cs:line 700
   at Internal.JitInterface.CorInfoImpl.CompileMethod(MethodWithGCInfo methodCodeNodeNeedingCode, Logger logger) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs:line 826
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOneMethod|5(DependencyNodeCore`1 dependency, Int32 compileThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 898
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOnThread|4(Int32 compilationThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 833
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompilationThread|3(Object objThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 811
/root/helix/work/workitem/e/reflection/reflection/../SetValue/TrySetReadonlyStaticField/TrySetReadonlyStaticField.sh: line 254:   954 Aborted                 (core dumped) $__Command

Return code:      1
Raw output file:      /root/helix/work/workitem/uploads/SetValue/TrySetReadonlyStaticField/output.txt
Raw output:
BEGIN EXECUTION
in takeLock
/root/helix/work/workitem/e/reflection/SetValue/TrySetReadonlyStaticField/IL-CG2/TrySetReadonlyStaticField.dll
20:03:02
Response file: /root/helix/work/workitem/e/reflection/SetValue/TrySetReadonlyStaticField/TrySetReadonlyStaticField.dll.rsp
/root/helix/work/workitem/e/reflection/SetValue/TrySetReadonlyStaticField/IL-CG2/TrySetReadonlyStaticField.dll
-o:/root/helix/work/workitem/e/reflection/SetValue/TrySetReadonlyStaticFi

Stack trace:

   at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String assemblyPath, String testPathPrefix)
   at Program.<<Main>$>g__TestExecutor13|0_14(StreamWriter tempLogSw, StreamWriter statsCsvSw, <>c__DisplayClass0_0&)

@jakobbotsch
Copy link
Member

This might be the same issue as #102370 and #104123.

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Jul 26, 2024
…opy with write barrier calls

When the JIT generates code for a tailcall it must generate code to
write the arguments into the incoming parameter area. Since the GC ness
of the arguments of the tailcall may not match the GC ness of the
parameters, we have to disable GC before we start writing these. This is
done by finding the earliest `GT_PUTARG_STK` node and placing the start
of the NOGC region right before it.

In addition, there is logic to take care of potential overlap between
the arguments and parameters. For example, if the call has an operand
that uses one of the parameters, then we must take care that we do not
override that parameter with the tailcall argument before the use of it.
To do so, we sometimes may need to introduce copies from the parameter
locals to locals on the stack frame.

This used to work fine, however, with dotnet#101761 we started transforming
block copies into managed calls in certain scenarios. It was possible
for the JIT to decide to introduce a copy to a local and for this
transformation to then kick in. This would cause us to end up with the
managed helper call after starting the nogc region. In checked builds
this would hit an assert during GC scan; in release builds, it would end
up with corrupted data.

The fix here is to make sure we insert the `GT_START_NOGC` after all the
potential temporary copies we may introduce as part of the tailcat stll
logic.

There was an additional assumption that the first `PUTARG_STK` operand
was the earliest one in execution order. That is not guaranteed, so this
change stops relying on that as well by introducing a new
`LIR::FirstNode` and using that to determine the earliest `PUTARG_STK`
node.

Fix dotnet#102370
Fix dotnet#104123
Fix dotnet#105441
github-actions bot pushed a commit that referenced this issue Jul 26, 2024
…opy with write barrier calls

When the JIT generates code for a tailcall it must generate code to
write the arguments into the incoming parameter area. Since the GC ness
of the arguments of the tailcall may not match the GC ness of the
parameters, we have to disable GC before we start writing these. This is
done by finding the earliest `GT_PUTARG_STK` node and placing the start
of the NOGC region right before it.

In addition, there is logic to take care of potential overlap between
the arguments and parameters. For example, if the call has an operand
that uses one of the parameters, then we must take care that we do not
override that parameter with the tailcall argument before the use of it.
To do so, we sometimes may need to introduce copies from the parameter
locals to locals on the stack frame.

This used to work fine, however, with #101761 we started transforming
block copies into managed calls in certain scenarios. It was possible
for the JIT to decide to introduce a copy to a local and for this
transformation to then kick in. This would cause us to end up with the
managed helper call after starting the nogc region. In checked builds
this would hit an assert during GC scan; in release builds, it would end
up with corrupted data.

The fix here is to make sure we insert the `GT_START_NOGC` after all the
potential temporary copies we may introduce as part of the tailcat stll
logic.

There was an additional assumption that the first `PUTARG_STK` operand
was the earliest one in execution order. That is not guaranteed, so this
change stops relying on that as well by introducing a new
`LIR::FirstNode` and using that to determine the earliest `PUTARG_STK`
node.

Fix #102370
Fix #104123
Fix #105441
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Jul 26, 2024
@jakobbotsch jakobbotsch reopened this Jul 26, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 26, 2024
@mangod9
Copy link
Member

mangod9 commented Sep 24, 2024

oh yeah its that one, thought it had missed RC1, but maybe not.

@AndyAyersMS
Copy link
Member

This (or something that looks like this) is still happening post RC1: https://dev.azure.com/dnceng-public/public/_build/results?buildId=816197&view=ms.vss-test-web.build-test-results-tab

Running CrossGen2:  dotnet /root/helix/work/correlation/crossgen2/crossgen2.dll @/root/helix/work/workitem/e/JIT/Regression/Regression_6/b37608.dll.rsp  
Unhandled exception. ILCompiler.CodeGenerationFailedException: Code generation failed for method '[b37608]Test.AA.TestEntryPoint()'
 ---> System.NullReferenceException: Object reference not set to an instance of an object.
   at Internal.JitInterface.CorInfoImpl._beginInlining(IntPtr thisHandle, IntPtr* ppException, CORINFO_METHOD_STRUCT_* inlinerHnd, CORINFO_METHOD_STRUCT_* inlineeHnd) in /_/src/coreclr/tools/Common/JitInterface/CorInfoImpl_generated.cs:line 154
   --- End of inner exception stack trace ---

Have not been able to make much headway with crash dumps yet, at least with LLDB.

Also downloaded the above and am trying to get a local repro on WSL2/Volterra like I did with the earlier failure.

@AndyAyersMS AndyAyersMS self-assigned this Sep 26, 2024
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Sep 26, 2024

I got a local repro after 95 runs of the full suite (~8 hours) but oddly the process is hung in SignalHandlerLoop. At any rate:

Running CrossGen2:  dotnet /home/andya/bugs/arm64-crossgen-crash2/Regression_6/correlation-payload/crossgen2/crossgen2.dll @/home/andya/bugs/arm64-crossgen-crash2/Regression_6/workitems/Regression_6/JIT/Regression/Regression_6/b40380.dll.rsp  
To repro, add following arguments to the command line:
--singlemethodtypename "ILGEN_0x37ae0554,b40380" --singlemethodname "Main" --singlemethodindex 1
Unhandled exception: System.ArgumentNullException: Value cannot be null. (Parameter 'array')
   at System.Array.Clear(Array array, Int32 index, Int32 length)
   at Internal.JitInterface.CorInfoImpl.CompileMethodCleanup() in /_/src/coreclr/tools/Common/JitInterface/CorInfoImpl.cs:line 700
   at Internal.JitInterface.CorInfoImpl.CompileMethod(MethodWithGCInfo methodCodeNodeNeedingCode, Logger logger) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs:line 826
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOneMethod|5(DependencyNodeCore`1 dependency, Int32 compileThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 898
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileOnThread|4(Int32 compilationThreadId) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 833
   at ILCompiler.ReadyToRunCodegenCompilation.<>c__DisplayClass50_0.<ComputeDependencyNodeDependencies>g__CompileMethodList|2(IEnumerable`1 methodList) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 793
   at ILCompiler.ReadyToRunCodegenCompilation.ComputeDependencyNodeDependencies(List`1 obj) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 717
   at ILCompiler.DependencyAnalysisFramework.DependencyAnalyzer`2.ComputeMarkedNodes() in /_/src/coreclr/tools/aot/ILCompiler.DependencyAnalysisFramework/DependencyAnalyzer.cs:line 316
   at ILCompiler.ReadyToRunCodegenCompilation.Compile(String outputFile) in /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/ReadyToRunCodegenCompilation.cs:line 387
   at ILCompiler.Program.RunSingleCompilation(Dictionary`2 inFilePaths, InstructionSetSupport instructionSetSupport, String compositeRootPath, Dictionary`2 unrootedInputFilePaths, HashSet`1 versionBubbleModulesHash, ReadyToRunCompilerContext typeSystemContext, Logger logger) in /_/src/coreclr/tools/aot/crossgen2/Program.cs:line 637
   at ILCompiler.Program.Run() in /_/src/coreclr/tools/aot/crossgen2/Program.cs:line 302
   at ILCompiler.Crossgen2RootCommand.<>c__DisplayClass205_0.<.ctor>b__0(ParseResult result) in /_/src/coreclr/tools/aot/crossgen2/Crossgen2RootCommand.cs:line 261
   at System.CommandLine.Invocation.InvocationPipeline.Invoke(ParseResult parseResult)

;; innermost exception

(lldb) pe 0000FF5818983580
Exception object: 0000ff5818983580
Exception type:   System.NullReferenceException
Message:          Object reference not set to an instance of an object.
InnerException:   <none>
StackTrace (generated):
    SP               IP               Function
    0000FFFFC7E75200 0000FF981EB58C70 ILCompiler.ReadyToRun.dll!Internal.JitInterface.CorInfoImpl._beginInlining(IntPtr, IntPtr*, Internal.JitInterface.CORINFO_METHOD_STRUCT_*, Internal.JitInterface.CORINFO_METHOD_STRUCT_*)+0x90

This is the same issue with the stack's internal array being null. And we are now using ldapr to test the state of the init flag:

/_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs @ 3176:
0000ff981eb4bb58 80a20c91             add     x0, x20, #0x328
0000ff981eb4bb5c b1d8bb97             bl      0xff981da41e20
0000ff981eb4bb60 f50300aa             mov     x21, x0
0000ff981eb4bb64 0e4e94d2             mov     x14, #0xa270
0000ff981eb4bb68 ced8a3f2             movk    x14, #0x1ec6, lsl #16
0000ff981eb4bb6c 0ef3dff2             movk    x14, #0xff98, lsl #32
0000ff981eb4bb70 cec1bfb8             ldapr   w14, [x14]
0000ff981eb4bb74 ee160036             tbz     w14, #0x0, 0xff981eb4be50

(lldb) dumpobj 0000ff5818982850
Name:        System.Collections.Generic.Stack`1[[System.Collections.Generic.HashSet`1[[Internal.TypeSystem.MethodDesc, ILCompiler.TypeSystem]], System.Private.CoreLib]]
MethodTable: 0000ff981ec3fc28
EEClass:     0000ff981e3d2678
Tracked Type: false
Size:        32(0x20) bytes
File:        /home/andya/bugs/arm64-crossgen-crash2/Regression_6/correlation-payload/dotnet-cli/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/System.Collections.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
0000ff981dbe1ee0  40000a0        8     System.__Canon[]  0 instance 0000000000000000 _array
0000ff981d9d7160  40000a1       10         System.Int32  1 instance                0 _size
0000ff981d9d7160  40000a2       14         System.Int32  1 instance                0 _version

@EgorBo
Copy link
Member

EgorBo commented Sep 26, 2024

wonder if the issue doesn't repro under DOTNET_EnableArm64Rcpc=0 DOTNET_EnableArm64Rcpc2=0 (in that case, we will emit ldar instead of ldapr)

@EgorBo
Copy link
Member

EgorBo commented Sep 26, 2024

A stable minimal repro on macos-arm64 (Apple M2 Max): https://gist.github.com/EgorBo/c636520798d2c2d79969cf5299d53c79

dotnet build -c Release && while true; do bin/Release/net9.0/myapp; ; done

on RC1 (fails on Main's corerun too). Reproduces with JitMinOpts too, so it means the issue is definitely not in the static cctor inlining in JIT

@AndyAyersMS
Copy link
Member

@EgorBo 's repro also fails on the volterra under WSL2, pretty much every run. Seems like the failure is with statics for the shared generic case?

@EgorBo
Copy link
Member

EgorBo commented Sep 26, 2024

I have confirmed locally that the regression appeared in eb8f54d, doesn't reproduce before this commit.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Sep 26, 2024

@davidwrighton we have a simplified repro (see just above), seems related to the class statics change #99183. Still digging into exactly how it fails, seems to be arm64 linux/osx only, seems not to be related to jit codegen -- perhaps you can spot this faster that we can?

@EgorBo
Copy link
Member

EgorBo commented Sep 26, 2024

Not sure if it's the issue, but the repro stops reproducing if I remove WithoutBarrier suffix from these:

bool GetIsInitedAndGCStaticsPointerIfInited(PTR_OBJECTREF *ptrResult) { TADDR staticsVal = VolatileLoadWithoutBarrier(&m_pGCStatics); *ptrResult = dac_cast<PTR_OBJECTREF>(staticsVal); return !(staticsVal & ISCLASSNOTINITED); }
bool GetIsInitedAndNonGCStaticsPointerIfInited(PTR_BYTE *ptrResult) { TADDR staticsVal = VolatileLoadWithoutBarrier(&m_pNonGCStatics); *ptrResult = dac_cast<PTR_BYTE>(staticsVal); return !(staticsVal & ISCLASSNOTINITED); }

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Sep 26, 2024

Repros if the runtime dynamic LSE detection is disabled. So not LSE related evidently.

Also repros with checked runtime.

Also fails on windows arm64.

@AndyAyersMS
Copy link
Member

Say two threads try to initialize a class at nearly the same time.

Thread 1 sees the class is not initialized, kicks of initialization, allocates the static data, invokes the .cctor, then calls SetClassInited() on the method table, this calls SetClassInited on the dynamic statics, which makes the dynamic static address available, and then calls SetClassInited on the method table level flag, which does a load acquire / store release.

Thread 2 is lagging just a bit behind. It calls JIT_GetGCStaticBase_Portable, which calls GetIsInitedAndGCStaticsPointerIfInited, and this can return true before the SetClassInited() call on the method table finishes, so there is no store release yet, and thread 2 can read the uninitialized static field.

@jkotas
Copy link
Member

jkotas commented Sep 27, 2024

SetClassInited on dynamic statics does is done using InterlockedCompareChange that should be full store release barrier:

oldValFromInterlockedOp = InterlockedCompareExchangeT(pAddr, oldVal & STATICSPOINTERMASK, oldVal);

I think we are missing barrier on the reader side as @EgorBo pointed in #105441 (comment)

@AndyAyersMS
Copy link
Member

Ah, ok. What's (a little) surprising is that even in Tier0 the JIT is baking the static address into codegen.

;; Tier0
;;          EmptyArray<MyClass998>.Value.Length == 0

            movz    x0, #104
            movk    x0, #0xCE30 LSL #16
            movk    x0, #0xFFAE LSL #32
            bl      CORINFO_HELP_GET_GCSTATIC_BASE
            movz    x0, #0x2A88      // data for EmptyArray`1[Program+MyClass998]:Value
            movk    x0, #0xB800 LSL #16
            movk    x0, #0xFF6E LSL #32
            ldr     x0, [x0]          // .Value
            ldr     w0, [x0, #0x08]   // .Length
            cbnz    w0, G_M9030_IG68

So the helper is just called for effect and the static field load afterwards is not data dependent on it.

Seems like for Tier0 code would be more compact if we didn't do this.

@davidwrighton
Copy link
Member

Ah, that's the problem. This really needs to either use the result of the helper, or have a barrier inserted. We could insert a barrier by adding a barrier to the code of the static base helper, or into the generated code stream any time after the call to GET_GCSTATIC_BASE. Any one of these 3 solutions would work. We could probably get the JIT to do the right thing with a simple change to the jit interface, where we simply refuse to provide a pre-computed address for a static without requiring that the class be initialized. I'll put together a proposed fix with that approach, so we can see what it looks like.

@davidwrighton
Copy link
Member

@EgorBo I don't have a setup for actually running any arm64 code available tonight, but I think either of the two PR's I just linked to this issue should solve this problem. Could you try them out? The code to have the jit interface force the use of helpers is probably not a final change (we wouldn't need that logic on any platform with a TSO based memory model), but it should be enough to verify that it works.

@AndyAyersMS
Copy link
Member

I can do some testing.

@EgorBo
Copy link
Member

EgorBo commented Sep 27, 2024

#108309 didn't fix the issue for me, but #108311 did

@AndyAyersMS
Copy link
Member

#108311 works for me too, didn't try the other one.

@davidwrighton
Copy link
Member

Well, I'm going to want to investigate why #108309 didn't work later, since I'm kinda surprised it didn't fix that test case, so I'm not understanding some part of this system, which isn't good, but the barriers are strictly a more complete solution, so let's go with that.

@AndyAyersMS
Copy link
Member

Well, I'm going to want to investigate why #108309 didn't work

I think it also needs fixes in getFieldInfo and maybe other places.

Seems like giving the JIT static addresses before the class is initialized is asking for trouble. Also (as above) it is creating bigger code. So maybe we want both of these?

@jkotas
Copy link
Member

jkotas commented Sep 27, 2024

Seems like giving the JIT static addresses before the class is initialized is asking for trouble.

It can expose bugs elsewhere in the system as we have seen here, but I do not think there are fundamental problems with it.

Also (as above) it is creating bigger code.

It is creating bigger code only when the JIT is not inlining the cctor check (e.g. optimizations are off).

@jakobbotsch
Copy link
Member

@jkotas Do you think this is likely to still reproduce in CI even if the underlying issue turns out to not have been fixed? IIUC the testing strategy for crossgen2 has changed significantly recently.

@jkotas
Copy link
Member

jkotas commented Oct 24, 2024

In main branch, both crossgen2 and ilc run on runtime specified in https://github.com/dotnet/runtime/blob/main/global.json#L8 . This version is at 9.0.100-rc.2.24474.11 currently. 9.0.100-rc.2.24474.11 does not contain fix #108347. This failure should be gone once we update global.json to rtm.

@EgorBo
Copy link
Member

EgorBo commented Nov 27, 2024

🎉

@github-actions github-actions bot locked and limited conversation to collaborators Dec 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs blocking-release in-pr There is an active PR which will close this issue when it is merged Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Projects
None yet