Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 #100558

Open
carlossanlop opened this issue Apr 2, 2024 · 14 comments
Assignees
Labels
arch-x64 area-System.IO Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Milestone

Comments

@carlossanlop
Copy link
Member

carlossanlop commented Apr 2, 2024

The System.IO.Net5Compat.Tests and the System.IO.Tests test processes are intermittengly getting killed on Linux shortly after starting, and the exit code is 137.

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407
Build error leg or test failing: System.IO.Net5Compat.Tests

Error Message

{
  "ErrorPattern": ["Starting:    System\\.IO\\.(Net5Compat\\.)?Tests", "exit code 137"],
  "BuildRetry" : true,
  "ExcludeConsoleLog" : false
}

System.IO.Net5Compat.Tests example

===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.IO.Net5Compat.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.Net5Compat.Tests (found 679 of 685 test cases)
  Starting:    System.IO.Net5Compat.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162:    25 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.Net5Compat.Tests.runtimeconfig.json --depsfile System.IO.Net5Compat.Tests.deps.json xunit.console.dll System.IO.Net5Compat.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Tue Apr 2 20:20:02 UTC 2024 ----- exit code 137 ----------------------------------------------------------

System.IO.Test example

===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.IO.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.Tests (found 679 of 685 test cases)
  Starting:    System.IO.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 162:    25 Killed                  "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.Tests.runtimeconfig.json --depsfile System.IO.Tests.deps.json xunit.console.dll System.IO.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Tue Apr 2 20:20:10 UTC 2024 ----- exit code 137 ----------------------------------------------------------

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=627407
Error message validated: [Starting: System\.IO\.(Net5Compat\.)?Tests exit code 137]
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 4/2/2024 11:08:28 PM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@carlossanlop carlossanlop added area-System.IO arch-x64 runtime-coreclr specific to the CoreCLR runtime os-linux-musl Linux distributions using musl library. Known Build Error Use this to report build issues in the .NET Helix tab labels Apr 2, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 2, 2024
@carlossanlop carlossanlop changed the title System.IO.Net5Compat.Tests suddenly exiting with error 137 System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 Apr 2, 2024
@carlossanlop carlossanlop added os-linux Linux OS (any supported distro) runtime-mono specific to the Mono runtime labels Apr 2, 2024
@ericstj ericstj removed os-linux Linux OS (any supported distro) os-linux-musl Linux distributions using musl library. runtime-mono specific to the Mono runtime runtime-coreclr specific to the CoreCLR runtime labels Apr 12, 2024
@ericstj
Copy link
Member

ericstj commented Apr 12, 2024

@dotnet/area-system-io there are a lot of hits on this and relatively recent. It seems to me to be happening across many configurations. I think it's worth having a look.

@adamsitnik
Copy link
Member

The System.IO.Net5Compat.Tests and the System.IO.Tests test processes are intermittengly getting killed on Linux shortly after starting, and the exit code is 137.

137 means out of memory. We have not made any changes to 6.0 in System.IO, so I expect that either there was some infra change (like less memory available) or a bug was introduced in the product itself. The bug would be specific to Linux.

@carlossanlop is it possible to perform some kind of binary search based on the merged PRs and when it started to fail?

@carlossanlop
Copy link
Member Author

@adamsitnik @jozkee This is one of the most impactful failures in servicing. It only affects System.IO.Tests and System.IO.Net5Compat.Tests. Any chance you can take a look soon?

@adamsitnik
Copy link
Member

@carlossanlop sure, but could you please answer the question I've asked in #100558 (comment) ?

@carlossanlop
Copy link
Member Author

Sorry, I missed that question. Yes, you can use Kusto. David has used it many times in the past.

@carlossanlop
Copy link
Member Author

This is the super basic kusto query you can execute if looking via issue:

TestKnownIssues
| union KnownIssues
| where IssueId == ""

This database stores data from the last 4 months so hopefully there's still info from April.

This is the cluster where you would look for that info: https://dataexplorer.azure.com/clusters/dotnetperf.westus/databases/PerformanceData

Unfortunately it seems that failure data is not stored if it's not linked to an issue.

Thanks @AlitzelMendez for the above info.

@adamsitnik adamsitnik modified the milestones: 9.0.0, 10.0.0 Aug 21, 2024
@jeffhandley
Copy link
Member

This test is failing a lot with 33 hits over the past 24 hours. We need to bring this back into 9.0.0, get it resolved, and plan to backport whatever change we make to the release/9.0 branch to clean up the failures there.

@vcsjones
Copy link
Member

I suspect it is this test

[ConditionalTheory(typeof(PlatformDetection), nameof(PlatformDetection.Is64BitProcess))]
[MemberData(nameof(MemoryStream_PositionOverflow_Throws_MemberData))]
[SkipOnPlatform(TestPlatforms.iOS | TestPlatforms.tvOS, "https://github.com/dotnet/runtime/issues/92467")]
[ActiveIssue("https://github.com/dotnet/runtime/issues/100225", typeof(PlatformDetection), nameof(PlatformDetection.IsMonoRuntime), nameof(PlatformDetection.IsWindows), nameof(PlatformDetection.IsX64Process))]
public void MemoryStream_SeekOverflow_Throws(SeekMode mode, int bufferSize, int origin)

It is already disabled and noted to be problematic in certain environments. I don't know how much memory the ADO containers have, but this test does a couple of 2GB allocations.

I suspect you are just hitting the CoreCLR version of this Mono failure. #100225

@vcsjones
Copy link
Member

That is the only test in System.IO.Tests that does any significant memory allocation that I was able to observe.

@adamsitnik
Copy link
Member

this test does a couple of 2GB allocations.

It's most likely one of the tests that causes the OOM 👍

But I am not sure that it's the only one:

@vcsjones
Copy link
Member

this should manifest as a managed OOM that does not take the testing app down?

This test failure looks like the Linux OOM killer. The .NET process was able to allocate memory, but Linux shortly later ran out of memory. When that happens, Linux runs the OOM killer to start taking processes.

See https://www.kernel.org/doc/gorman/html/understand/understand016.html for more information.

The OOM killer decided that the .NET process was the right one to take down.

@adamsitnik
Copy link
Member

@vcsjones thanks, I was not aware of that! (BTW it sucks as in a way it hides quite important information like stacktrace of the method that caused OOM)

@vcsjones
Copy link
Member

vcsjones commented Sep 1, 2024

I think that test was contributing to the problem. The issue is still occurring in the release/9.0 branch which is why the 24-hour cell is not zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-x64 area-System.IO Known Build Error Use this to report build issues in the .NET Helix tab os-linux Linux OS (any supported distro)
Projects
None yet
Development

No branches or pull requests

6 participants