Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible request for <AIX> x11 setup #2297

Closed
1 task
sophia-guo opened this issue Aug 18, 2021 · 24 comments
Closed
1 task

Ansible request for <AIX> x11 setup #2297

sophia-guo opened this issue Aug 18, 2021 · 24 comments
Assignees
Labels

Comments

@sophia-guo
Copy link

Please put the name of the software product (and affected platforms if relevant) in the title of this issue

  • x11 setup

Details:
java/beans/XMLEncoder/* failed on AIX jdk16 with java.awt.AWTError: Can't connect to X11 window server using ':0' as the value of the DISPLAY variable

Details adoptium/aqa-tests#2810

@sophia-guo
Copy link
Author

@sxa

@sxa
Copy link
Member

sxa commented Aug 18, 2021

As far as I can see, the log that showed this as https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_1/9/consoleFull says:

23:59:25  + nohup /usr/bin/X11/X -force -vfb -x abx -x dbe -x GLX :0
23:59:25  
23:59:25  Fatal server error:
23:59:25  Cannot establish any listening sockets - Make sure an X server isn't already running

I have logged onto the machine that showed the error - aix71-2 - and there is nothing stopping that process starting up properly. Has it been seen anywhere else i.e. is it reproducible, or could this have been a case where the machine had a leftover process, possibly from a previously terminated job, that was stopping it from starting up properly? I seem to be able to start an X -vfb server on that machine without problems.

@sxa sxa self-assigned this Aug 18, 2021
@sxa sxa added this to the August 2021 milestone Aug 18, 2021
@sophia-guo
Copy link
Author

sophia-guo commented Aug 18, 2021

This is a consistent issue and believe happens to all AIX.
test-ibm-aix71-ppc64-1 https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/9/#showFailuresLink

test-osuosl-aix72-ppc64-2
https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/269/

@sxa
Copy link
Member

sxa commented Aug 20, 2021

This is a consistent issue and believe happens to all AIX.
test-ibm-aix71-ppc64-1 https://ci.adoptopenjdk.net/job/Test_openjdk16_hs_extended.openjdk_ppc64_aix_testList_0/9/#showFailuresLink

That is the one I mentioned above from four weeks ago - I was interested to know if it had been seen at any other time

test-osuosl-aix72-ppc64-2
https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/269/

That was a run from your branch where you explicitly put in an override to set the DISPLAY to an incorrect value (You can see from the line above your change that the virtual X server is started on :0 and you're setting your tests to run against a non-existant :1)

@sxa
Copy link
Member

sxa commented Aug 25, 2021

aix72-1 had a leftover process from August 6th which was stopping it from starting a new one. That has also now been cleared but we need the test suite modified to be able to handle this situation - it is NOT an infrastructure request for an installation on the machine :-)

@sophia-guo
Copy link
Author

@sophia-guo
Copy link
Author

sophia-guo commented Aug 26, 2021

The issue happened to different machines, it is definitely reproducible. What is the leftover process on aix72-1, could you confirm if it is a leftover process created by openjdk tests? As for jenkins DISPLAY has been reset when jenkins job is done. https://github.com/adoptium/aqa-tests/pull/1835/files

@sophia-guo
Copy link
Author

Rerun test java/beans/XMLEncoder/on test-ibm-aix71-ppc64-1 and test-osuosl-aix72-ppc64-2 both passed. A second rerun passed too, which means if there is a leftover process it's not created by test java/beans/XMLEncoder/. We probably need to know how the leftover process is created.
https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/285/
https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox_new/286/

@sxa
Copy link
Member

sxa commented Aug 31, 2021

What is the leftover process on aix72-1, could you confirm if it is a leftover process created by openjdk tests?

It'll be the X -vfb process that the test suite starts up before running anything.

@sxa
Copy link
Member

sxa commented Sep 23, 2021

@smlambert Has this been discussed in the AQAvit meetings? We'll need to find a way to ensure the X server is terminated at the end of the job, which it may not be at present. Do we have a post-test clean-up phase that we could add this too?

@Haroon-Khel It's possible this was introduced as a result of adoptium/aqa-tests#1835 although that was from over a year ago now, so I wonder if it's possible that the nohup is preventing this from being terminated once the jenkins job ends.

While adding a different port number would probably work around this issue it will result in process leaks so I'd be reluctant to implement the changes proposed in adoptium/aqa-tests#2831 for this

@sxa sxa modified the milestones: August 2021, September 2021 Sep 23, 2021
@smlambert
Copy link
Contributor

smlambert commented Sep 23, 2021

In the 'post' stage of a test pipeline, for platforms that use the xvfb plugin (all linux platforms), the plugin closes/cleans up the process. For AIX, that plugin does not work, so Xvfb is manually launched and I presume adoptium/aqa-tests#2831 is meant to both address the security scan issue of the process running, but also clean up the process in the post stage for that platform.

@sxa
Copy link
Member

sxa commented Sep 23, 2021

so to be clear, https://github.com/adoptium/aqa-tests/blob/dce1f080f4e7fb1b69b429982aa62e71f54d2a9d/buildenv/jenkins/JenkinsfileBase#L602 is definitely NOT used on Linux because it's started via the jenkins plgin?

@smlambert
Copy link
Contributor

@sxa
Copy link
Member

sxa commented Sep 23, 2021

Gotcha - I hadn't read that syntax as being an invocation of stuff from the plugin. I don't believe that 2821 does anything to address the cleanup, only attempt to cycle the port number so it doesn't hit any leftover one (which is solving the wrong problem IMHO!)
If that post section you reference is executed after each tests would that be a valid place to attempt to kill off the X -vfb process on AIX?

@sxa
Copy link
Member

sxa commented Sep 23, 2021

Possible solution in adoptium/aqa-tests#2892, but I think we need to determine if the current code is always leaving the process around or not

@sxa
Copy link
Member

sxa commented Sep 23, 2021

Hmmm even without that change an aborted job still cleaned up the Xvfb process. I'm tempted to leave this, keep a regular eye on it, and try and see which jobs are causing any such processes to be left behind. We also have the option of trying to re-use any existing Xvfb and not just crashing if it can't launch a second on the same DISPLAY if it's owned by the originating user.

@aixtools
Copy link
Contributor

aixtools commented Oct 5, 2021

I'll take a look again. iirc, what I saw is that (usually) the X VFB process stopped itself shortly after the job finished. When it did continue to run it took PID 1 as PPID.

@aixtools
Copy link
Contributor

aixtools commented Oct 19, 2021

Just adding a comment - the scans done at OSUOSL are still picking up on port 6000 - so regardless of what has been done (or not done) - the issue is still active (as of 11 October 2021)

I'll go back to my PR - and undo the 'generic' code - ie, choosing a port other than 6000 (adoptium/aqa-tests#2831) - and only use the -secIP argument - and hopefully, the issue with the scan is gone (but not the hanging process).

@aixtools
Copy link
Contributor

FYI: about to kill process - but on ojdk05 this has been hanging since October 10th:

root@p9-aix1-ojdk05:[/root]ps auxwww | grep 34275466
jenkins  34275466  0.6  0.0 9596 8188      - A      Oct 10 1792:56 /usr/bin/X11/X -force -vfb -x abx -x dbe -x GLX :0

@sxa sxa modified the milestones: September 2021, December 2021 Dec 1, 2021
@sxa sxa modified the milestones: December 2021, 2022-01 (January) Jan 6, 2022
@sxa sxa modified the milestones: 2022-03 (March), 2022-05 (May) Apr 14, 2022
@sxa sxa modified the milestones: 2022-05 (May), 2022-07 (July) Jun 30, 2022
@aixtools
Copy link
Contributor

a) This issue (Ansible request) - asis - can be closed, as it is not the problem (AIX X11 configuration).
b) perhaps a new issue needs to be opened to 'triple' verify there are no other X11 vfb processes running - and/or - adopt my earlier PR that randomizes the port number so that in principle multiple runs could be performed.

In any case - this is not related to ansible playbooks and the issue cannot be resolved via a playbook change.

@sxa
Copy link
Member

sxa commented Jan 27, 2023

The processCheck job should pick up on incidents of the server process being left around so we should try and keep an eye on that to see if it occurs. I haven't heard of any issues with this recently though.

@sxa
Copy link
Member

sxa commented Feb 6, 2023

Closing due to the lack of problems being highlighted recently.

@sxa sxa closed this as completed Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants