-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Green CI pipeline on a failed run #85824
Comments
Tagging subscribers to this area: @dotnet/runtime-infrastructure Issue DetailsAt least for 2 linux distros, the tests are NOT run because of some internal problems, BUT the test suite is reported as PASSED outside. I reported this to CoreEng and their reply was that we ignore the exit code in our scripts. Below is the example of the problem: Pipeline has "Libraries Test Run release coreclr linux x64 Release" as green Pipelines - Run 20230505.2 logs (azure.com) Centos 7 run, test suites have state "passed" e.g. https://helix.dot.net/api/jobs/0746b253-5b79-402b-b185-b63a8e921e92/workitems/System.Composition.Runtime.Tests?api-version=2019-06-17 (it's not only this test suite, it's all of them) Console output shows exit code 1 on trying to run the tests: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-0746b2535b79402bb1/System.Composition.Runtime.Tests/1/console.c9dbb469.log?helixlogtype=result
the same is for RHEL 7
Below is comment from CoreEng team: So it seems your Helix job script ignores the exit code from the dotnet exec command because then the whole script exits with 0 and Helix then assumes the job succeeded. I think you need to change the script you're sending to Helix to not ignore the exit code?
This is the payload that is getting executed, inside I can see the RunTests.sh script that runs this
Where the exit codes that it checks for are these:
|
This is expected: runtime/eng/testing/RunnerTemplate.sh Lines 231 to 237 in 46693f2
Looks like we got unlucky and this case also returns exit code 1 which we then misinterpret as failing tests (which are intentionally a "passing" state as far as our helix usage is concerned) |
Interesting, so I guess we need to find a way to tell a difference between a normal run with some failing tests vs a run that actually didn't happen at all 🤔 |
In any case, we've dropped support for CentOS7/RHEL7, so we should update any legs that are testing on it to test on distros like CentOS Stream 8 or AlmaLinux 8. |
I agree we need to update from unsupported distros, but I believe we should not ignore the issue in hand. This is literally unnoticeable from the outside, I wouldn't discover the problem if I didn't go all the way down to the console logs -- if this would happen again for some other distro, we wouldn't know that tests are not running, possibly for quite some time. |
I think an easy fix would be to add a check whether a testResults.xml was produced in the "exit code == 1" case. Maybe also check whether it is > 0 bytes but that might already be handled by the arcade test reporter. |
That would make sense to me @akoeplinger. The fact that failed run can be reported as success is highly problematic IMHO as its can make us blind to whole range of issues - obsolete distro is only now of them. And for now is great waste of resources as we do runs that yield no useful results. We can test the presence of the result file with |
At least for 2 linux distros, the tests are NOT run because of some internal problems, BUT the test suite is reported as PASSED outside. I reported this to CoreEng and their reply was that we ignore the exit code in our scripts.
Below is the example of the problem:
Pipeline has "Libraries Test Run release coreclr linux x64 Release" as green Pipelines - Run 20230505.2 logs (azure.com)
Centos 7 run, test suites have state "passed" e.g. https://helix.dot.net/api/jobs/0746b253-5b79-402b-b185-b63a8e921e92/workitems/System.Composition.Runtime.Tests?api-version=2019-06-17 (it's not only this test suite, it's all of them)
Console output shows exit code 1 on trying to run the tests: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-0746b2535b79402bb1/System.Composition.Runtime.Tests/1/console.c9dbb469.log?helixlogtype=result
the same is for RHEL 7
https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-a3e4a99b995b4511b7/ComInterfaceGenerator.Tests/1/console.6f507810.log?helixlogtype=result
Below is comment from CoreEng team:
So it seems your Helix job script ignores the exit code from the dotnet exec command because then the whole script exits with 0 and Helix then assumes the job succeeded. I think you need to change the script you're sending to Helix to not ignore the exit code?
I don't believe there's anything else to do from our side here if I understand this correctly.
This is the payload that is getting executed, inside I can see the RunTests.sh script that runs this
Where the exit codes that it checks for are these:
The text was updated successfully, but these errors were encountered: