Server hangs in `XSync()` #3503

mdavidsaver · 2022-03-25T21:07:02Z

Describe the bug

I'm seeing instances of a server hanging in XSync(). I don't yet have a useful stack trace with debug symbols. What I have is attached below, and looks somewhat similar to #475.

To Reproduce

tbd. I'm not yet sure how to trigger this issue. I'm going to rebuild with debug information and wait for another occurance. Other suggestions for troubleshooting are very welcomed.

System Information (please complete the following information):

Server OS: Debian 11 / amd64
Client OS: html5
Xpra repo: 6cda08d
Xpra html5 repo: 3d65c16439904b9fc6f80068226cb83ceb92b9bc

Additional context
Add any other context about the problem here.
Please see "reporting bugs" in the wiki section.

instance1.txt
instance2.txt

The text was updated successfully, but these errors were encountered:

totaam · 2022-03-26T04:56:12Z

Please specify more details about your environment, versions, etc. As per:
https://github.com/Xpra-org/xpra/wiki/Reporting-Bugs

I see this in your backtraces which tells me that this is not a standard setup: /opt/xpra/usr/b

It would help to have debug symbols and to know where in the python event loop code it is failing (not the cython .c generated file):

__pyx_f_4xpra_3x11_4gtk3_12gdk_bindings_parse_xevent (__pyx_v_e_gdk=0x7ffe20c75d70) at xpra/x11/gtk3/gdk_bindings.c:17627

I doubt this is the same problem as #475, but you could always try: XPRA_XSHM=0 xpra start ...

mdavidsaver · 2022-03-26T20:31:47Z

Please specify more details ...

xpra showconfig

/opt/xpra/usr/bin/xpra start --daemon=no \
 --chdir=/home/mdavidsaver --start=/usr/local/bin/perpetual-xterm \
 --terminate-children=yes --mdns=no \
 --bind-tcp=0.0.0.0:14500 --tcp-auth=sys \
 :10

I'm seeing this issue in conjunction with the html5 client. I have a group of ~20 users, and only 4 report hangs. Though each of these has had multiple occurrences. These users run a variety of browsers (Safari, Firefox, Chrome), and I don't yet see any commonality.

The first symptom is that the xpra server process "freezes". eg. I then see that new http connections are not accept()ed. This, and seeing other threads in PyThread_acquire_lock_timed() suggests to me that the call to gdk_flush() is being made while the GIL is locked.

My searches for "xsync hang" and "gdb_flush hang" have not been helpful. Reading the man page for XSync() and the source makes it clear that this function will block without timeout until the X server replies (apparently to a GetInputFocus message). The fact that the thread is making a poll() as opposed to futex() suggests to me that this is not a deadlock in xpra, and that the X server is involved somehow.

I guess I can get stack traces from the Xvfb process next time. Maybe I'll get lucky and it will be obviously stalled.

Could this be triggered by a misbehaving X client application?

I'm working with a java/openjfx application, which I know to be troublesome wrt. gtk usage. I'm using xpra is part because the combination seems to have the fewest glitches.

So I'm not sure if a stack trace would show if eg. some client application has grabbed the server.

I see this in your backtraces which tells me that this is not a standard setup

I'm running a local build of the git revisions mentions above against debian packaged dependencies. The only local change is to xpra/platform/xposix/menu_helper.py. I'm having problems figuring out xdg menu files, so I changed load_xdg_menu_data() to return a static dict. (I still plan to get back to #3471)

It would help to have debug symbols ...

I'm planning to rebuild xpra, passing --with-debug. It looks like Debian 11 ~~no longer packages debug symbols for X related things~~ (cf. dbgsym section and find-dbgsym-packages), or debuginfod (which can be really slow!).

I doubt this is the same problem as #475

I concur. I linked that issue because it is the only other mention of XSync().

totaam · 2022-03-27T10:52:47Z

try setting XPRA_X_SYNC=1 when starting the server.
This will enable XSynchronize.
xtrace / xtruss are very chatty but may be useful to show the last few exchanges before the hang.
long shot: swap Xvfb for Xdummy

I'm working with a java/openjfx application ..

Ah. Those are notoriously flaky.
Sometimes, simply updating the JDK solves the problem!

So I'm not sure if a stack trace would show if eg. some client application has grabbed the server.

It would not - it would look exactly the same as what we have here.
You would need to trace that specific application to see it.

mdavidsaver · 2022-04-03T18:15:02Z

I had one more occurrence, from which I am able to collect a little more information. I am able to leave things running in the hung state for the time being, so I could perform additional postmortem tests if any come to mind.

I was able to capture stack traces of all processes associated (by systemd) with this xpra instance. Unfortunately, while I did install some Debian debug info packages, it looks like I didn't point to a debug build of xpra (oops...).

This may be moot, as the Xvfb process appears to be idling normally. I also don't see anything abnormal in the 4 (of 71) threads in the java/jfx application making glib/gtk calls. (I'll continue looking at the java process as there is a reasonable chance I'm missing something)

I also checked (with netstat and ss) the state of the various socket buffers. The TX/RX queues for all of the unix domain connections are empty, including the X related ones. This is consistent with Xvfb idling normally. (maybe it could be inspected by some X client?)

# ss -xn
Netid State Recv-Q Send-Q                                          Local Address:Port    Peer Address:Port   Process
...
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 5265897            * 5265896       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 5264770            * 5264769       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 5264774            * 5264773       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 4812931            * 4812930       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 2267950            * 2268372       
u_str ESTAB 0      0                                         @/tmp/.X11-unix/X10 5264776            * 5264775

The TCP connection queues are not, which is as expected with the GIL being locked for the XSync().

# netstat -tpn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
...
tcp      870      0 10.136.0.22:14500       training:39066          CLOSE_WAIT 
...

Also, it looks like we're running Sun JDK 11.0.2 atm. Which of course has no debug symbols... openjdk 17.0.2 is also install, and I thought this was being used. sigh... maybe next time.

Finally, it is unlikely I'll be able to trigger this hang again in the near term. I haven't been able to do so myself, and the event which provided additional users (a training class) has ended. My suspicion atm. is that the xpra hang is somehow a side effect of misbehavior by OpenJFX. As you say, gtk support in jfx is notoriously buggy. (I've looked at the gtk2/3 binding code for both openjfx and SWT, and both are nightmarish rats nests!) So this ticket could be closed if, as I expect, nothing further can be learned from the information I have provided.

totaam · 2022-04-04T14:39:04Z

I was wrong when I said:

It would not - it would look exactly the same as what we have here.
You would need to trace that specific application to see it.
You would not be able to connect to the X11 server until the lock is released.

As per my previous comment: #3503 (comment)
It could be useful to know which line corresponds to xpra/x11/gtk3/gdk_bindings.c:17627

Without that, I can only suggest running with:

XPRA_X11_DEBUG_EVENTS=all xpra start ...

Which is going to generate a huge amount of debug logging but may show us the event that's triggering the bug.
(or it could just turn it into a Heisenbug and make it disappear)

mdavidsaver · 2022-04-04T14:56:28Z

wrt. X server locking. Is there some way I can probe this without restarting the Xvfb process? How complete would this lockout be? eg. could something like xset be expected to succeed?

It could be useful to know which line corresponds to xpra/x11/gtk3/gdk_bindings.c:17627

Sorry, I didn't pick up on this. The full gdk_bindings.c. The first comment above gdk_bindings.c:17627 is:

        /* "xpra/x11/gtk3/gdk_bindings.pyx":1035
 *         elif etype == PropertyNotify:
 *             pyev.window = _gw(d, e.xany.window)
 *             pyev.atom = trap.call_synced(_get_pyatom, d, e.xproperty.atom)             # <<<<<<<<<<<<<<
 *             pyev.time = e.xproperty.time
 *         elif etype == ConfigureNotify:
 */

also make things consistent and always use an X11 trap sync context so that X11 BadAtom errors will be caught here

totaam · 2022-04-04T15:22:23Z

could something like xset be expected to succeed?

Yes.

trap.call_synced(_get_pyatom, d, e.xproperty.atom)

Ah, now that is interesting!
IIRC, we did have a problem like this one before with Java applications and atoms that don't exist.
I am hoping that the commit above will fix that. It has been a while since I've had to touch this sensitive X11 / GDK glue, but the commit does look correct.

The PropertyNotify was one of a few places that was already using a trap.call_synced context, but perhaps this was still confusing GTK when the atom doesn't / no longer exists.
If not, --debug x11 may have what we're looking for.

totaam · 2022-04-26T12:15:18Z

Feel free to re-open if you can still reproduce with 1e56be6 or later.

mdavidsaver added the bug Something isn't working label Mar 25, 2022

totaam added a commit that referenced this issue Apr 4, 2022

#3503 don't use gtk for reading atom names

1e56be6

also make things consistent and always use an X11 trap sync context so that X11 BadAtom errors will be caught here

totaam closed this as completed Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server hangs in `XSync()` #3503

Server hangs in `XSync()` #3503

mdavidsaver commented Mar 25, 2022

totaam commented Mar 26, 2022 •

edited

Loading

mdavidsaver commented Mar 26, 2022 •

edited

Loading

totaam commented Mar 27, 2022

mdavidsaver commented Apr 3, 2022

totaam commented Apr 4, 2022

mdavidsaver commented Apr 4, 2022

totaam commented Apr 4, 2022

totaam commented Apr 26, 2022

Server hangs in XSync() #3503

Server hangs in XSync() #3503

Comments

mdavidsaver commented Mar 25, 2022

totaam commented Mar 26, 2022 • edited Loading

mdavidsaver commented Mar 26, 2022 • edited Loading

totaam commented Mar 27, 2022

mdavidsaver commented Apr 3, 2022

totaam commented Apr 4, 2022

mdavidsaver commented Apr 4, 2022

totaam commented Apr 4, 2022

totaam commented Apr 26, 2022

Server hangs in `XSync()` #3503

Server hangs in `XSync()` #3503

totaam commented Mar 26, 2022 •

edited

Loading

mdavidsaver commented Mar 26, 2022 •

edited

Loading