-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mac crash on osx1011-x86-2 in array copy test #9782
Comments
@DanHeidinga fyi |
@gacholio can you take a look at this? |
Crash is here:
Assuming we just recently entered the interpreter, the reason is:
So it seems likely a bad object was returned from a JIT call. |
The stack is not useful (vmThread roots have not been updated). Looking through the registers, R9 maps to a bytecode PC:
|
My best guess at the stack (from the registers and hand-examining the memory) is:
|
So the String argument is fine, but the StringBuilder appears to be trash. |
I suspect jitted
|
@gacholio / @DanHeidinga do you have any failure rate information for this or any history of when it might have started. I agree we should look at it, but I'm not sure if 0.21 is an appropriate target. @cathyzhyi could you help find someone to look at this? Note the 0.21 tag. |
It appears to be a new failure we haven't seen before. The first instance I can find is the one in the opening description.
I added the 0.21 tag to ensure there would be a conversation about whether this is stop-ship or not given it appears to be a newly introduced (exposed?) issue |
@liqunl could you have a sniff? |
I looked at the disassembly of |
I've tried another 50 runs on my laptop, no reproduce. I'm trying to run grinders on internal jenkins, but I can't find a Mac uma build. @pshipton What is a Mac uma build? How is it different from Mac build? |
A Mac uma build is using the old (uma) makefile system, any build before May 28th used this. Since ibmruntimes/openj9-openjdk-jdk11#308 was merged, the jdk11 Mac builds use the cmake build system, which is supposed to generate the same binaries. We continue to run a nightly build using the old uma makefile system in case any problems occur, we can compare between the two builds. Likely the problem isn't particular to the uma build, it's just intermittent and happened in that build. We'll be switching the xlinux cmake jdk11 build to cmake next, ideally before the end of next week. |
I tried 100 jobs on internal grinder, still can't reproduce the issue. So far the crash is not reproducible in 200 jobs. |
The StringBuilder object seems to be corrupted by array zero initialization instruction in the jitted code. The zero initialization is accomplished by The bad StringBuilder object has everything right except the header. The header is zero. However, the header can’t be zero when passed to the jitted code, as the jitted code read a flag on the object's j9class to check if the object is finalizable. I double checked the jitted code to see if it’s possible that we can write to the StringBuilder object.
So it seems to me that we didn’t initialize the byte array. We didn’t skip zero init the byte array either. The following instruction is used to zero the array. rdi initially points to the first element of the array, the counter rcx is 0x20.
I suspect that we did a backward memset. We didn’t clear the direction flag DF before The memory content seems to be in line with this theory. @andrewcraik @JamesKingdon Do you think this is the right cause of the crash. If so, I think the solution is to clear the DF flag before zeroing. |
I was under the impression that DF is supposed to remain clear at all times (in C at least). It would be a simple matter to clear it on transition to JIT code (and maybe you would need to do it on return from JNI calls). |
@liqunl Great find! It certainly sounds possible. Given GAC's comments I need to check how DF can be set, but I'm leaning towards a conservative fix of making sure it has the correct value before executing the rep sto. @andrewcraik ? Looks like it can only be changed directly using CLD/STD. I feel a little uneasy about relying on it being in the correct state before the rep sto though. |
Looking at the Agner Fog instruction tables https://www.agner.org/optimize/instruction_tables.pdf it seems that CLD/STD are not cheap - latency 4 on skylake/skylakeX and latency 3 on ryzen. I think we need to be a bit careful about spraying CLD etc everywhere. Now there is some curious looking code in https://github.com/eclipse/openj9/blob/71bf16037a88e1116afdcc7516db040aa5365ab1/runtime/compiler/x/codegen/J9TreeEvaluator.cpp#L1622 We can consider the CLD/STD always solution as a temporary fix but we need to perf evaluate that and also try to figure out how the flag got messed up since protecting around natives and stuff as suggested by @gacholio is probably the better long term solution. |
oh, you mean potentially we set STD for a backward array copy and then never clear it? I'm still very nervous about the idea of relying on it having the right value over long periods of time. |
The rationale is that backwards copy is vastly less likely than forwards, so keeping the DF clear when not doing backwards copy will be a noticeable performance win. |
From what I can see, backwards copy always emits STD/REP/CLD, so it's still a mystery why it would ever be set (assuming I'm right about the C linkage keeping it clear). |
DF can also be set via popf which makes it a little harder to narrow down possible changes. Potentially DF might have been set as an accidental side effect of some other flag manipulation. A very cursory grep of the codebase shows some hits for popf* but I haven't explored further to see what's being done. |
Assuming pushf/pop are matched, that shouldn't be an issue. If there are places using popf on other stacked values, that could certainly cause this (this seems exceedingly unlikely). |
A second core reveals a similar failure. In both cases the source and destination address are valid and aligned heap addresses (and hence read/writable). One curiosity in both cases is that the destination address is 4 bytes ahead of the source address. That shouldn't cause a problem though. |
Is that allowed for movsd? It's doing 8 byte transfers isn't it? |
No, movsd is moving 4 bytes. This code isn't specific to macOS, so not sure why we're tripping over this instruction with valid source and dest addresses only on macOS. |
and only on one particular machine, osx1011-x86-2. |
A similar failure at https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_x86-64_mac_OMR/42/tapResults/ (
Also seen at https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_x86-64_mac_Nightly/45/tapResults/
|
Another failure seen here:
|
A similar
This was at Edit: also occurred at JDK8 nightly build https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_x86-64_mac_Nightly/52/tapResults/ |
Another failure on
https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_x86-64_mac_Personal/66/ |
@VermaSh it's actually on osx1011-x86-2 just like all the others. The top level job you link to ran on osx1014-x86-1 but it just starts a couple of children. The actual failed job is https://openj9-jenkins.osuosl.org/job/Test_openjdk8_j9_sanity.functional_x86-64_mac_Personal_testList_0/64/ |
Curiously, this problem hasn't been seen for some time (or at least it hasn't been reported in this issue). It used to show up quite regularly in PR testing. Moving out nonetheless. |
@0xdaryl it's because I disabled the machine so tests don't run on it any more, so we don't have to see the failures and deal with failed acceptance builds due to this. We got 2 new Mac's, although only of them is working properly so far. |
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.functional_x86-64_mac_uma_Nightly_testList_0/5
TestArrayCopy_3
osx1011-x86-2
The text was updated successfully, but these errors were encountered: