-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM] Potential codegen issues on Graviton3 #100705
Comments
Would be nice to have more details - does it fail on other ARM64 machines, which tests and what is the expected/actual inputs/outputs? Is it floating point related? E.g. we do not guarantee that all floating-point related operations e.g. casts to integers, edge cases in Math.* calls are always 100% the same for the same input on all architectures/platforms. Would be nice to confirm that the issue is indeed on dotnet/runtime side. |
Some are binding errors at the runtime level which points into heap corruptions. One is showing 2 floating point numbers which is clearly not a floating point issue, more like getting the data from a different memory location. I only have Graviton3 from AWS on ARM64, it is consistently failing on all of the machines provisioned. All the tests execute properly on Linux and Windows on 32-bits and 64-bit Intel. That's why I provided 3 different ways to trigger the behavior with completely different error behaviors. But since, they all trigger after a number of passes happen (and the number is consistent across runs) and they all run successfully run under Debug build, it leads into suspect that after a certain optimization is performed, the errors trigger. |
Okay, I have a much smaller scope for reproducing this.
To make things easier, what this does is execute this funciton:
Over this input:
Will sometimes return:
Note that the number of iterations is different. I am assuming that this is tier JIT running and a bug happening there somewhere. I wasn't able to simplify this further, I'm afraid. When running this with debug mode, it completes successfully each time.
Once this happens, it is entirely reproducible. |
We dont see the error immediately because of tiered compilation, but it fails immediately on the first iteration when disabled.
This with a simplifier reproduction that barely needs anything confirms the suspicion that the error is related to code generation on ARM64 platform. When we disable quick jit even if tiered compilation is on, we also fail on first iteration.
However, when we disable only quick jit for loops we revert to fail after a set amount of iterations again.
|
@EgorBo can you see if the repro above works for us? If not I can look later this week. |
Some more information on this. A user is reporting that running RavenDB or Mac M2 will crash with:
When running When setting
|
We tried with:
And the user is getting:
Is there any way for us to extract additional information here? |
I managed to repro it using PS: it doesn't even repro when I copy my locally built net8.0 bits (from release branch) to the .net 8.0 sdk's folder |
What is the component that contains this assert check? (There is no assert check like this in .NET runtime.) |
It's from the Rosetta emulation layer on macOS. |
PS: My comment about
instead of:
In Release, not for the Rosetta issue which I guess is a separate issue - I was running the repro on Linux-arm64 |
When running Note that I don't know if both of those issues are related. They are both with ARM64 and the env vars make it looks like it has the same impact, but may be two different items. |
I take it this doesn't (yet) work on windows arm64?
I'll try it under WSL2. |
I can repro (I think) on my volterra under WSL2:
|
Above was with 8.0.4. @EgorBo this also no longer repros for me if I drop in a locally built checked libclrjit.so. Am going to see if release does likewise. It does. Going to try an SPMI collection and then diff replay using the retail jit and the ones I've built locally. |
Three methods with diffs, all are Impacted methods are
though without assembly info this is probably not enough to figure out where to disable opts (I can retry with a checked runtime and jit I suppose, if you all want to try crafting a workaround). Looks like this is #92201. The fix was ported to 8.0 in #100372 but hasn't made it into a release yet. Looks like it should appear in 8.0.5, due the middle of May. |
We have a customer and our own test servers impacted by this. Is there any way we could get the pre-release servicing build in order for those systems to be upgraded? |
We do not publish official pre-release servicing builds. The servicing builds contain security fixes that we cannot disclose publicly before the release. You may want to create your own build using the same docker image that is used for official builds https://github.com/dotnet/runtime/blob/release/8.0/docs/workflow/building/coreclr/linux-instructions.md#docker-images , |
@redknightlois, please let us know if either your own build or an official servicing build fixes the problem. |
As soon as the official servicing build is out I will install and check. |
Confirm 8.0.5 solves the issue. |
Description
We have included Graviton3 chipsets on our testing matrix and found many different tests that error consistently. However, those errors are not triggering when run in the following platforms:
Moreover, when run continuously errors only trigger after 2 conditions:
At first we suspected that we were doing something wrong on our end [its possible we are corrupting the heap somehow] but the deterministic nature of the failure and repeatability after it triggers for the first time makes us suspect it is something more permanent than a heap corruption.
I include several different ways to reproduce the errors.
Reproduction Steps
Download any of the following reproductions:
Run them in order to ensure that optimizations will kick in:
Observe the errors after a set amount of loops.
Expected behavior
They should execute the 1000 loops without printing any exception.
Actual behavior
Trigger exception repeatedly after a set amount of loops.
Regression?
No response
Known Workarounds
None
Configuration
Other information
No response
The text was updated successfully, but these errors were encountered: