Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenJDK AIX jdk_security2 Agent communication error #19962

Open
pshipton opened this issue Aug 2, 2024 · 45 comments
Open

OpenJDK AIX jdk_security2 Agent communication error #19962

pshipton opened this issue Aug 2, 2024 · 45 comments

Comments

@pshipton
Copy link
Member

pshipton commented Aug 2, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_OMR_testList_1/123 - p8-java1-ibm10
jdk_security2_1
javax/crypto/Cipher/TestCipherMode.java

11:02:54  AES/KWP/NoPadding with UNWRAP
11:02:54  => expected ISE thrown for update()
11:02:54  => expected ISE thrown for doFinal()
11:02:54  AES/KWP/NoPadding with NONEXISTENT
11:02:54  => expected IPE thrown for init()
11:02:54  
11:02:54  TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details
11:02:54  --------------------------------------------------
11:05:46  Test results: passed: 223; error: 1
11:06:02  Report written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_OMR_testList_1/aqa-tests/TKG/output_17226057497025/jdk_security2_1/report/html/report.html
11:06:02  Results written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_OMR_testList_1/aqa-tests/TKG/output_17226057497025/jdk_security2_1/work
11:06:02  Error: Some tests failed or other problems occurred.
@pshipton
Copy link
Member Author

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_OpenJDK21_testList_1/29 - p8-java1-ibm04
jdk_security2_0
javax/crypto/Cipher/TestCipherMode.java

15:33:41  AES/KWP/NoPadding with UNWRAP
15:33:41  => expected ISE thrown for update()
15:33:41  => expected ISE thrown for doFinal()
15:33:41  AES/KWP/NoPadding with NONEXISTENT
15:33:41  => expected IPE thrown for init()
15:33:41  
15:33:41  TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details
15:33:41  --------------------------------------------------
15:37:18  Test results: passed: 239; error: 1

@pshipton
Copy link
Member Author

pshipton commented Oct 10, 2024

@pshipton
Copy link
Member Author

@pshipton
Copy link
Member Author

This is 0.48
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Release_testList_1/28/
javax/crypto/Cipher/TestCipherMode.java

@pshipton
Copy link
Member Author

I suspect this is the same cause, although no test is named. I've seen a number of these.
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/310
jdk_security2_0

00:51:01  Running test jdk_security2_0 ...
00:51:01  ===============================================
00:51:01  jdk_security2_0 Start Time: Wed Oct 23 04:34:08 2024 Epoch Time (ms): 1729658048837
00:51:01  variation: Mode150
00:51:01  JVM_OPTIONS:  -XX:+UseCompressedOops -Xverbosegclog 
00:51:01  { \
00:51:01  echo "";	echo "TEST SETUP:"; \
00:51:01  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done"; \
00:51:01  mkdir -p "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0"; \
00:51:01  cd "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0"; \
00:51:01  echo "";	echo "TESTING:"; \
00:51:01  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xmx512m -jar "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/openjdk/jtreg/lib/jtreg.jar" \
00:51:01  -agentvm -a -ea -esa -v:fail,error,time,nopass -retain:fail,error,*.dmp,javacore.*,heapdump.*,*.trc -ignore:quiet -timeoutFactor:8 -xml:verify  -concurrency:1 -nativepath:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/openjdk-test-image/jdk/jtreg/native" -vmoptions:"-Xmx512m  -XX:+UseCompressedOops -Xverbosegclog  " \
00:51:01  -timeoutHandler:jtreg.openj9.CoreDumpTimeoutHandler -timeoutHandlerDir:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/../../testDependency/lib/openj9jtregtimeouthandler.jar" \
00:51:01  -w ""/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0"/work" \
00:51:01  -r ""/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0"/report" \
00:51:01  -jdk:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image" \
00:51:01  -exclude:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../openjdk/openjdk-jdk/test/jdk/ProblemList.txt" \
00:51:01  -exclude:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../../jvmtest/openjdk/excludes/ProblemList_openjdk17-openj9.txt" \
00:51:01   \
00:51:01  -exclude:"/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../openjdk/excludes/vendors/eclipse/ProblemList_openjdk17.txt" \
00:51:01  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../openjdk/openjdk-jdk/test/jdk:jdk_security2"; \
00:51:01  if [ $? -eq 0 ]; then echo "-----------------------------------"; echo "jdk_security2_0""_PASSED"; echo "-----------------------------------"; cd /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/..;  else echo "-----------------------------------"; echo "jdk_security2_0""_FAILED"; echo "-----------------------------------"; fi; \
00:51:01  echo "";	echo "TEST TEARDOWN:"; \
00:51:01  "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xshareclasses:destroyAll; "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/jdkbinary/j2sdk-image/bin/java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done"; \
00:51:01   } 2>&1 | tee -a "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/TestTargetResult";
00:51:01  
00:51:01  TEST SETUP:
00:51:01  JVMSHRC005I No shared class caches available
00:51:01  JVMSHRC005I No shared class caches available
00:51:01  cache cleanup done
00:51:01  
00:51:01  TESTING:
00:51:03  Directory "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0/work" not found: creating
00:51:03  Directory "/home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/../TKG/output_17296532351357/jdk_security2_0/report" not found: creating
00:51:11  XML output with verification to /home/jenkins/workspace/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/aqa-tests/TKG/output_17296532351357/jdk_security2_0/work
00:57:54  Cannot contact p8-java1-ibm08: java.lang.InterruptedException
01:04:08  gmake[4]: *** [autoGen.mk:271: jdk_security2_0] Killed

@pshipton
Copy link
Member Author

pshipton commented Oct 29, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/313/ - p8-java1-ibm06
00:02:49 Cannot contact p8-java1-ibm06: java.lang.InterruptedException

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/312 - p8-java1-ibm12
21:13:32 Cannot contact p8-java1-ibm12: java.lang.InterruptedException

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_OMR_testList_2/138 - p8-java1-ibm10
javax/crypto/Cipher/TestCipherMode.java

08:33:25  AES/KWP/NoPadding with NONEXISTENT
08:33:25  => expected IPE thrown for init()
08:33:25  
08:33:25  TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

@pshipton
Copy link
Member Author

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/316
jdk_security2_0
javax/crypto/CryptoPermissions/TestExemption.java
TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

@pshipton
Copy link
Member Author

pshipton commented Nov 1, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/317 - p8-java1-ibm10
jdk_security2_0
javax/crypto/CryptoPermissions/CryptoPolicyFallback.java
ACTION: main -- Failed. Unexpected exit from test [exit code: 137]

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/316
jdk_security2_1
Cannot contact p8-java1-ibm11: java.lang.InterruptedException

@pshipton pshipton added the os:aix label Nov 1, 2024
@pshipton pshipton changed the title OpenJDK javax/crypto/Cipher/TestCipherMode Agent communication error OpenJDK AIX jdk_security2 Agent communication error Nov 1, 2024
@pshipton
Copy link
Member Author

pshipton commented Nov 5, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/319 - p8-java1-ibm05
jdk_security2_0
javax/crypto/Cipher/TestCipherMode.java

04:41:21  AES/KWP/NoPadding with NONEXISTENT
04:41:21  => expected IPE thrown for init()
04:41:21  
04:41:21  TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

@pshipton
Copy link
Member Author

pshipton commented Nov 8, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_OMR_testList_2/145 - p8-java1-ibm10
jdk_security2_1
javax/crypto/Cipher/ByteBuffers.java
ACTION: main -- Error. Agent communication error: java.io.EOFException; check console log for any additional details

@pshipton
Copy link
Member Author

pshipton commented Nov 11, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_OMR_testList_1/147 - p8-java1-ibm03
jdk_security2_0
javax/crypto/Cipher/TestCipherMode.java
TEST RESULT: Error. Agent communication error: java.io.EOFException; check console log for any additional details

@pshipton
Copy link
Member Author

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/323
jdk_security2_0
Cannot contact p8-java1-ibm09: java.lang.InterruptedException

@pshipton
Copy link
Member Author

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/324
jdk_security2_0
javax/crypto/Cipher/TestCipherMode.java
Cannot contact p8-java1-ibm04: java.lang.InterruptedException

@pshipton
Copy link
Member Author

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/325
Cannot contact p8-java1-ibm02: java.lang.InterruptedException

@pshipton
Copy link
Member Author

@jasonkatonica I've set this as a blocker.

@pshipton
Copy link
Member Author

pshipton commented Nov 14, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/325
javax/crypto/CipherSpi/TestGCMWithByteBuffer.java.TestGCMWithByteBuffer
javax/crypto/CryptoPermission/AllPermCheck.java.AllPermCheck
Cannot contact p8-java1-ibm02: java.lang.InterruptedException

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_OpenJDK17_testList_0/39
javax/crypto/Cipher/TestCipherMode.java
Cannot contact p8-java1-ibm04: java.lang.InterruptedException

@pshipton
Copy link
Member Author

Probably related to https://github.ibm.com/runtimes/infrastructure/issues/10081
It seems likely there is a huge consumption of memory, which causes swapping, perhaps running out of swap space, which may cause the machine to become unusable. Either a memory leak, or some big memory consumption.

@taoliult
Copy link
Contributor

Checking the failures, they occur in different tests. Investigating these tests to see if there are any similarities.

@zl-wang
Copy link
Contributor

zl-wang commented Dec 3, 2024

in almost all cases, the TestCipherMode test process is picked and killed, while no other test is affected.

no, it didn't sound like that. the machine was even rendered inaccessible, wasn't it? that meant sshd or some other critical process(es) were killed as well.

@zl-wang
Copy link
Contributor

zl-wang commented Dec 3, 2024

by the way, you can configure AIX to pro-actively allocate paging space for sure. i.e. paging space is used as if it were a copy of all active virtual memory, instead of being an extension to the physical memory. if you config like so, java process(es) just failed to start in the current situation. for this case, system had better have more paging space (a few times of) than physical memory (otherwise, machine is likely under-utilized).

@taoliult
Copy link
Contributor

taoliult commented Dec 3, 2024

in almost all cases, the TestCipherMode test process is picked and killed, while no other test is affected.

no, it didn't sound like that. the machine was even rendered inaccessible, wasn't it? that meant sshd or some other critical process(es) were killed as well.

From the failed case, https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/329/testReport/. It run the sanity.openjdk tests, and only TestCipherMode failed due to "Agent communication error", the other tests were still running and completed as passed.

@zl-wang
Copy link
Contributor

zl-wang commented Dec 3, 2024

From the failed case, https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/329/testReport/. It run the sanity.openjdk tests, and only TestCipherMode failed due to "Agent communication error", the other tests were still running and completed as passed.

it sounded to me a normal pick of process to be killed. as far as i know, the most likely pick is the "culprit" failing to allocate the paging space.

@pshipton
Copy link
Member Author

pshipton commented Dec 5, 2024

Since it's being looked at I stopped reporting the failures, but they are still occurring.
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/340
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/340/

Also, I don't know if it's related or not, but AIX machines keep going down and need to be resurrected.
https://github.ibm.com/runtimes/infrastructure/issues/10137

@taoliult
Copy link
Contributor

taoliult commented Dec 5, 2024

@pshipton I run the same test packages under the javax/crypto by login p8-java1-ibm08, using JTReg run 1000 iterations without seeing any issues, all test passed and paging space remain same before and after the tests. Not sure if this related to Jenkins or not, I heard Jenkins recently updated to use JDK21 which consume more memory than before.

@pshipton
Copy link
Member Author

pshipton commented Dec 5, 2024

Excluding the test.
adoptium/aqa-tests#5799

@pshipton
Copy link
Member Author

pshipton commented Dec 6, 2024

Still problems with the jdk_security2 test even with TestCipherMode excluded.

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/341
Cannot contact p8-java1-ibm12: java.lang.InterruptedException

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/340
javax/crypto/CryptoPermissions/TestUnlimited.java
timeout

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/346
Cannot contact p8-java1-ibm03: java.lang.InterruptedException

@taoliult
Copy link
Contributor

taoliult commented Dec 9, 2024

Summarize my updates until now:

  • I was unable to reproduce the issue when logging in the p8-java1-ibm01 and running the tests directly using JTReg, even after 2000 iterations. Therefore, I switched to using Jenkins to see if I could reproduce the issue.

  • In my previous update, I mentioned that I didn’t think the failure was related to the TestCipherMode test case. Also, as shown in the update above, even when the TestCipherMode test is disabled, the issue still happens when running the jdk_security2 tests.

  • Ran the TARGET: -f parallelList.mk testList_1 on p8-java1-ibm01 10 times using Jenkins. The "Cannot contact p8-java1-ibm01" issue reproduced during the first run, job link and exception below, but the other 9 runs completed successfully.

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/345/

[2024-12-06T08:14:46.805Z] TESTING:
[2024-12-06T08:14:47.747Z] Directory "/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/../TKG/output_17334671735294/jdk_security2_0/work" not found: creating
[2024-12-06T08:14:47.747Z] Directory "/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/../TKG/output_17334671735294/jdk_security2_0/report" not found: creating
[2024-12-06T08:14:57.730Z] XML output with verification to /home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/output_17334671735294/jdk_security2_0/work
[2024-12-06T08:23:09.721Z] Cannot contact p8-java1-ibm01: java.lang.InterruptedException
[2024-12-06T10:44:21.329Z] gmake[4]: *** [autoGen.mk:195: jdk_security2_0] Killed
[2024-12-06T10:44:21.329Z] gmake[4]: Leaving directory '/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/openjdk'
[2024-12-06T10:44:21.329Z] gmake[3]: *** [/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/../TKG/settings.mk:361: testList-openjdk] Error 2
[2024-12-06T10:44:21.329Z] gmake[3]: Leaving directory '/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests'
[2024-12-06T10:44:21.329Z] gmake[2]: *** [settings.mk:361: testList-..] Error 2
[2024-12-06T10:44:21.329Z] gmake[2]: Leaving directory '/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG'
[2024-12-06T10:44:21.329Z] gmake[1]: *** [makefile:70: _testList] Error 2
[2024-12-06T10:44:21.329Z] gmake[1]: Leaving directory '/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG'
[2024-12-06T10:44:21.329Z] gmake: *** [parallelList.mk:11: testList_1] Error 2
[2024-12-06T10:49:32.535Z] wrapper script does not seem to be touching the log file in /home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1@tmp/durable-47a7e649
[2024-12-06T10:49:32.535Z] (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

Next, my plan:

  • Run the TARGET: -f parallelList.mk testList_1 on p8-java1-ibm01 10 times. This will take approximately one and a half days. And since the issue has only been reported when running the jdk_security2 tests, so I will focus on running TARGET: jdk_security2 using Jenkins (This will be much faster than running all the tests in testList_1) to check if the issue can be reproduced or not.

  • If the issue is reproducible when running jdk_security2, I will then run the tests with native crypto disabled by using the flag -Djdk.nativeCrypto=false to check if the problem persists.

@pshipton
Copy link
Member Author

pshipton commented Dec 9, 2024

Restored TestCipherMode adoptium/aqa-tests#5805

@taoliult
Copy link
Contributor

  • Run tests under javax/crypto using Jenkins.
100 iterations, all passed without any issue.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/356/

50 iterations, all passed without any issue.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/354/

10 iterations, all passed without any issue.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/353/
  • Run tests “TARGET: jdk_security2” using Jenkins.
10 iterations,  reproduce the issue.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/349/

5 iterations,  reproduce the issue.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/357/

5 iterations,  with native crypto disabled, reproduce the issue.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/358/
  • Run tests “TARGET: -f parallelList.mk testList_1” using Jenkins.
10 iterations,  reproduce the issue.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/345/

5 iterations,  reproduce the issue.
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/348/

Summary:

  • Run the tests under javax/crypto in Jenkins does not reproduce the issue.
  • Run TARGET: jdk_security2 in Jenkins reproduces the issue, regardless of whether native crypto is enabled or disabled (EXTRA_OPTIONS=-Djdk.nativeCrypto=false). This issue is unrelated to native crypto.

Next:

  • Run TARGET: jdk_security2 in Jenkins while monitoring the paging space status to try to identify any findings.

@taoliult
Copy link
Contributor

Run the “TARGET: jdk_security2” for 5 iterations using Jenkins. The issue “Cannot contact p8-java1-ibm01: java.lang.InterruptedException” was reported during the 2nd and 5th iterations.

Real-time memory and paging usage captured during the 2nd iteration found one process/JVM “java 6160518 0.7 32.4G jenkins”. This JVM consumes and continually occupies an extremely large amount of paging space in minutes, causing the Cannot contact p8-java1-ibm01: java.lang.InterruptedException issue and AIX machine to become very slow at 2nd iteration, and cause the machine crashed at the 5th iteration.

Topas Monitor for host:p8-java1-ibm01           EVENTS/QUEUES    FILE/TTY       
Tue Dec 10 20:05:16 2024   Interval:2           Cswitch    1719  Readch     5616
                                                Syscall     924  Writech     783
CPU     User% Kern% Wait% Idle%   Physc  Entc%  Reads         5  Rawin         0
Total     0.8   0.7   2.8  95.8    0.06   3.21  Writes        1  Ttyout      378
                                                Forks         0  Igets         0
Network    BPS  I-Pkts  O-Pkts    B-In   B-Out  Execs         1  Namei        35
Total    1.62K   23.00    0.50   1.20K   431.0  Runqueue   2.50  Dirblk        0
                                                Waitqueue  18.5                 
Disk    Busy%      BPS     TPS  B-Read  B-Writ                   MEMORY         
Total    100.0   14.8M   243.5    268K   14.5M  PAGING           Real,MB   30720
                                                Faults      587  % Comp     96  
FileSystem          BPS    TPS  B-Read  B-Writ  Steals      230  % Noncomp   2  
Total             5.48K   4.50   5.48K       0  PgspIn        2  % Client    2  
                                                PgspOut     232                 
Name           PID  CPU%  PgSp Owner            PageIn       21  PAGING SPACE   
java        6160518  0.7 32.4G jenkins          PageOut     232  Size,MB    8192
lrud         262152  0.2  704K root             Sios        249  % Used     93  
vtiol        786456  0.1 1.25M root                              % Free      7  
java       10879098  0.0 89.6M jenkins          NFS (calls/sec)  
topas      11337764  0.0 5.01M jenkins          SerV2         0  WPAR Activ    0
java        6488300  0.0 63.0M jenkins          CliV2         0  WPAR Total    0
java        9896152  0.0 46.5M jenkins          SerV3         0  Press: "h"-help
sh          4784160  0.0  156K jenkins          CliV3         0         "q"-quit
ps         12714140  0.0  428K jenkins          SerV4         0  
getty       8388680  0.0  600K root             CliV4         0  
gil         2359368  0.0  960K root            
aso         5308578  0.0 1.30M root            
sshd        5177586  0.0  892K jenkins         
clcomd      6357192  0.0 1.68M root            
rpc.lock   13107342  0.0 1.19M root            
sec_rcv     3866742  0.0  448K root            
lock_rcv    3014842  0.0  448K root            
xntpd       4915358  0.0  492K root            
cron        5898426  0.0 1.12M root            
sleep_64    7929986  0.0  888K jenkins  

After a while, the current iteration is killed, Jenkins output as follow. But Jenkins keeps running the other iterations.

15:06:19  TESTING:
15:06:21  Directory "/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/../TKG/output_17338600892425/jdk_security2_0/work" not found: creating
15:06:21  Directory "/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/../TKG/output_17338600892425/jdk_security2_0/report" not found: creating
15:06:29  XML output with verification to /home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/output_17338600892425/jdk_security2_0/work
15:14:26  Cannot contact p8-java1-ibm01: java.lang.InterruptedException
15:24:25  gmake[3]: *** [autoGen.mk:33: jdk_security2_0] Killed
15:24:25  gmake[3]: Leaving directory '/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/openjdk'
15:24:25  gmake[2]: *** [/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/../TKG/settings.mk:361: jdk_security2-openjdk] Error 2
15:24:25  gmake[2]: Leaving directory '/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests'
15:24:25  gmake[1]: *** [settings.mk:361: jdk_security2-..] Error 2
15:24:25  gmake[1]: Leaving directory '/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG'
15:24:25  gmake: *** [makefile:70: _jdk_security2] Error 2
15:29:34  wrapper script does not seem to be touching the log file in /home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1@tmp/durable-670aac64
15:29:34  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

The process/JVM java 6160518 0.7 32.4G jenkins, which consumes an extremely large amount of paging space, is detailed below. It is the JTReg agent server process com.sun.javatest.regtest.agent.AgentServe.

jenkins@p8-java1-ibm01:[/home/jenkins/tao]cat output.txt 
 jenkins  6160518  9896152   4 19:54:40      -  0:35 /home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/jdkbinary/j2sdk-image/bin/java -classpath /home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/jvmtest/openjdk/jtreg/lib/javatest.jar:/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/jvmtest/openjdk/jtreg/lib/jtreg.jar:/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/jvmtest/openjdk/jtreg/lib/junit-platform-console-standalone-1.9.2.jar:/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/jvmtest/openjdk/jtreg/lib/testng-7.3.0.jar:/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/jvmtest/openjdk/jtreg/lib/guice-5.1.0.jar:/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/jvmtest/openjdk/jtreg/lib/jcommander-1.82.jar -ea -esa -Xmx512m -XX:+UseCompressedOops -Xverbosegclog -Djdk.nativeCrypto=false -Djava.library.path=/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/jdkbinary/openjdk-test-image/jdk/jtreg/native --patch-module java.base=/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/output_17338600892425/jdk_security2_0/work/patches/java.base -Djava.security.policy=file:/home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/../TKG/output_17338600892425/jdk_security2_0/work/jtreg.policy com.sun.javatest.regtest.agent.AgentServer -id 4 -logfile /home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_1/aqa-tests/TKG/output_17338600892425/jdk_security2_0/work/jtData/agentServer.4.trace -allowSetSecurityManager -port 33072 -timeoutFactor 8.0
 jenkins 13303972  7405608   0 20:05:10  pts/1  0:00 grep 6160518

I checked this AgentServer process and found that it runs in a dedicated JVM, separate from the test case JVMs. It functions as a manager and communication interface for the test cases.

@taoliult
Copy link
Contributor

After researching AgentServer's behaviour, I found that tests run without @run main/othervm or @run main/agentvm will execute in their own separate JVM by default. Therefore, the test cases TestCipherMode or ByteBuffers should be executed in their own JVM, separate from the AgentServer JVM.

As of now, I have not found any direct connection between the failures related to TestCipherMode, ByteBuffers, or the javax/crypto package, despite most of the failures being reported in these areas.

For the above finding, consuming an extremely large amount of paging space in the JTReg agent server process (com.sun.javatest.regtest.agent.AgentServer). Details in comment #19962 (comment). @pshipton Do you have any suggestions or know which team might be able to assist with this AgentServer issue?

@jasonkatonica FYI.

@pshipton
Copy link
Member Author

If we catch the process in action of consuming a large amount of memory and successfully get a core and maybe a javacore file, we can dig into these diagnostics to help figure out what is using all the memory. Assuming it's not the JVM itself, I don't have any experience with tracking memory usage on AIX. We can approach the service team to find out how best to track it.

@llxia might know more about the AgentServer or be able to refer to someone else who does.

@llxia
Copy link
Contributor

llxia commented Dec 13, 2024

  • jtreg doc about agentVM and otherVM: https://openjdk.org/jtreg/faq.html#what-are-the-agentvm-and-othervm-modes.

    • In otherVM mode, jtreg will start a new JVM to perform each action (such as @compile, @run main, and so on) in a test.
    • In agentVM mode, jtreg will maintain a pool of available JVMs, grouped according to the VM options specified when the JVM was created.
    • If no option is specified, -othervm is the default.
  • It looks like -Xmx512m did not take effect. However, I am not familiar with AgentServer. Regarding the JVM consuming a large amount of memory in jtreg tests, @smlambert @sophia-guo, do you have any insights? Other than -Xmx512m, is there any other option to limit it?

@pshipton
Copy link
Member Author

FYI -Xmx512m limits the size of the object heap but doesn't limit off heap (native memory) allocations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants