server deadlocks when resizing fast updating window (ie: glxgears or mplayer) #475

totaam · 2013-12-17T05:36:41Z

It doesn't crash it just seems to be stuck:

(gdb) bt
#0  0x00000034528eb7fd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000003455409f42 in poll (__timeout=-1, __nfds=1, __fds=0x7fff3f1f6fb0) at /usr/include/bits/poll2.h:46
#2  _xcb_conn_wait (c=c@entry=0x1054260, cond=cond@entry=0x7fff3f1f7030, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:414
#3  0x000000345540b37f in wait_for_reply (c=c@entry=0x1054260, request=16876, e=e@entry=0x7fff3f1f70f8) at xcb_in.c:399
#4  0x000000345540b492 in xcb_wait_for_reply (c=c@entry=0x1054260, request=16876, e=e@entry=0x7fff3f1f70f8) at xcb_in.c:429
#5  0x0000003456041b47 in _XReply (dpy=dpy@entry=0xbe0c60, rep=rep@entry=0x7fff3f1f7140, extra=extra@entry=0, discard=discard@entry=1) at xcb_io.c:602
#6  0x000000345603d76d in XSync (dpy=0xbe0c60, discard=0) at Sync.c:44
#7  0x000000386365c17e in gdk_flush () from /lib64/libgdk-x11-2.0.so.0
#8  0x00007f85b5f69289 in _wrap_gdk_flush (self=<optimized out>) at gdk.c:17325
#9  0x00000033b2e49dd3 in PyObject_Call (func=<built-in function flush>, arg=<optimized out>, kw=kw@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#10 0x00007f85b4a627e9 in __pyx_pf_4xpra_3x11_7gtk_x11_12gdk_bindings_42_gw (__pyx_self=<optimized out>, __pyx_v_xwin=12582914L, 
    __pyx_v_display=<gtk.gdk.Display at remote 0xfb6780>) at xpra/x11/gtk_x11/gdk_bindings.c:8311
#11 __pyx_pw_4xpra_3x11_7gtk_x11_12gdk_bindings_43_gw (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at xpra/x11/gtk_x11/gdk_bindings.c:8147
#12 0x00000033b2e49dd3 in PyObject_Call (func=<built-in function _gw>, arg=<optimized out>, kw=kw@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#13 0x00007f85b4a66d0b in __pyx_f_4xpra_3x11_7gtk_x11_12gdk_bindings_x_event_filter (__pyx_v_e_gdk=0x7fff3f1f7530, __pyx_v_gdk_event=<optimized out>, 
    __pyx_v_userdata=<optimized out>) at xpra/x11/gtk_x11/gdk_bindings.c:9054
#14 0x000000386365c7b5 in gdk_event_translate () from /lib64/libgdk-x11-2.0.so.0
#15 0x000000386365e968 in _gdk_events_queue () from /lib64/libgdk-x11-2.0.so.0
#16 0x000000386365ea0e in gdk_event_dispatch () from /lib64/libgdk-x11-2.0.so.0
#17 0x00007f85b6f06df6 in g_main_dispatch (context=0x104fbf0) at gmain.c:3054
#18 g_main_context_dispatch (context=context@entry=0x104fbf0) at gmain.c:3630
#19 0x00007f85b6f07148 in g_main_context_iterate (context=0x104fbf0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3701
#20 0x00007f85b6f0754a in g_main_loop_run (loop=0x1e4aa30) at gmain.c:3895
#21 0x0000003862f3ed47 in gtk_main () from /lib64/libgtk-x11-2.0.so.0

Sanitized as:

wait_for_reply()
gtk.gdk.flush()
xpra.x11.gtk_x11.gdk_bindings._gw()
xpra.x11.gtk_x11.gdk_bindings.x_event_filter()
gdk_event_translate()
gdk_event_dispatch()
gtk_main()

And here is another one (with 0.10.x 4000), showing up in a different place:

wait_for_reply()
XGetGeometry()
gdk_window_get_geometry()
resize_corral_window()
xpra.x11.gtk_x11.error.call_synced()
xpra.x11.gtk_x11.window.may_resize_corral_window()

With the X11 server just waiting for something:

(gdb) bt
#0  0x00000034528ed533 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000000046692c in WaitForSomething ()
#2  0x0000000000436eb1 in Dispatch ()
#3  0x00000000004266da in main ()

Sometimes it also just kills the X11 server (no xpra backtrace in that case - no X11 backtrace yet..).

In the xpra backtrace case, it looks like we're stuck waiting for a reply from the X11 server, and none is coming...
Possibly because we accessed the server from a non-UI thread somewhere. (Ouch!)
Could also be some library problems (#422: the AMD OpenCL icd is known to cause weird problems)

The text was updated successfully, but these errors were encountered:

totaam · 2013-12-17T07:28:21Z

This is a regression, but not a recent one:

0.9.x does not seem to be affected (could just be harder to hit..)
0.10.x is affected (maybe a little harder to trigger than 0.11.x? races in resize code can be like that..)

(and it could still be related to lib injection from buggy OpenCL / Nvidia bits.)

Let the bisection "fun" begin (using rgb encoding for testing):

4000 (0.10.x RC): bad
r3800: bad
r3150 (0.9.0): ok?
r3475 (+ patch to fix PIL.Image import on F19..): ok?
r3632: bad
r3543 (+ similar PIL patch): ok?
r3588: ok?
r3610: ok?
r3621: bad
r3613 (stands out as a big change..): bad
r3612: ok

So this crash is caused by r3613 which enables XShm (#345) by default.
This can be confirmed by running normally crashing versions (including trunk) without problems by using:

XPRA_XSHM=0 xpra start ...

Problem is that XShm provides a major performance boost, so ideally we want to find a fix for the crash without turning it off completely..

totaam · 2013-12-17T08:42:25Z

Adding some debug, this is often the last thing that shows before we crash:

2013-12-17 15:29:36,198 resize_corral_window()
Xpra: Fatal IO error 11 (Resource temporarily unavailable) on X server :10.

Interestingly, I also got this error:

shmget failed: error 28 (No space left on device)

Which got me watching the shared memory segments with:

watch "ipcs | wc -l"

And this number goes up with every resize, but never seems to go back down again! Looks like something isn't freeing up the shared memory!

totaam · 2013-12-17T09:13:17Z

2013-12-17 09:13:17: totaam uploaded file `xshm-fixleak.patch` (1.6 KiB)

this seems to fix the leak

totaam · 2013-12-17T09:30:52Z

2013-12-17 09:30:52: totaam uploaded file `xshm-fixleak+debug.patch` (11.5 KiB)

fixes the leak and adds some debug

totaam · 2013-12-17T09:42:40Z

With the patch above I see:

2013-12-17 16:19:55,287 XShmWrapper.XShmCreateImage(740x416-24) True ID=203
2013-12-17 16:19:55,287 XShmWrapper.shmget(PRIVATE, 1234320 bytes, 1023) shmid=880476204
2013-12-17 16:19:55,287 XShmWrapper.shmat(880476204, NULL, 0) True
2013-12-17 16:19:55,287 XShmWrapper.XShmAttach(..) True
2013-12-17 16:19:55,287 XShmWrapper.get_image(4195268L, 0, 0, 740, 416)
2013-12-17 16:19:55,289 XShmWrapper.get_image(4195268L, 0, 0, 740, 416) ref_count=1, \
    returning XShmImageWrapper(BGRX: 0, 0, 740, 416, ID=718)
2013-12-17 16:19:55,289 XShmImageWrapper.get_image_pixels() self=XShmImageWrapper(BGRX: 0, 0, 740, 416, ID=718)
2013-12-17 16:19:55,291 WindowModel.do_xpra_configure_event(<AdHocStruct object, contents: \
    {'delivered_to': <gtk.gdk.Window object at 0x3c82910 (GdkWindow at 0x3c47360)>, 'send_event': 0, \
     'height': 418, 'width': 743, 'window': <gtk.gdk.Window object at 0x3c82910 (GdkWindow at 0x3c47360)>, \
     'y': 213, 'x': 94, 'serial': 25676L, 'border_width': 0, 'type': 22, \
     'display': <gtk.gdk.Display object at 0x29867d0 (GdkDisplayX11 at 0x2a18230)>}>) managed=True
2013-12-17 16:19:55,292 resize_corral_window()
2013-12-17 16:19:55,292 corral window geometry: (743, 418)
2013-12-17 16:19:55,292 client window geometry: (0, 0, 743, 418)
2013-12-17 16:19:55,292 WindowModel.do_xpra_configure_event(<AdHocStruct object, contents: \
    {'delivered_to': <gtk.gdk.Window object at 0x3c827d0 (GdkWindow at 0x3c47240)>, 'send_event': 0, \
     'height': 418, 'width': 743, 'window': <gtk.gdk.Window object at 0x3c827d0 (GdkWindow at 0x3c47240)>, \
     'y': 0, 'x': 0, 'serial': 25679L, 'border_width': 0, 'type': 22, \
     'display': <gtk.gdk.Display object at 0x29867d0 (GdkDisplayX11 at 0x2a18230)>}>) managed=True
2013-12-17 16:19:55,292 resize_corral_window()
2013-12-17 16:19:55,292 corral window geometry: (743, 418)
2013-12-17 16:19:55,293 client window geometry: (0, 0, 743, 418)
2013-12-17 16:19:55,310 XShmWrapper.get_image(4194552L, 17, 301, 480, 13)
2013-12-17 16:19:55,311 XShmWrapper.get_image(4194552L, 17, 301, 480, 13) ref_count=1, \
    returning XShmImageWrapper(BGRX: 17, 301, 480, 13, ID=719)
2013-12-17 16:19:55,312 XShmWrapper.cleanup() ID=203, ref_count=1
2013-12-17 16:19:55,313 XShmWrapper.XShmCreateImage(743x418-24) True ID=204
2013-12-17 16:19:55,313 XShmWrapper.shmget(PRIVATE, 1245268 bytes, 1023) shmid=880508973
2013-12-17 16:19:55,313 XShmWrapper.shmat(880508973, NULL, 0) True
2013-12-17 16:19:55,313 XShmWrapper.XShmAttach(..) True
2013-12-17 16:19:55,314 XShmWrapper.get_image(4195272L, 0, 0, 743, 418)
2013-12-17 16:19:55,315 XShmWrapper.get_image(4195272L, 0, 0, 743, 418) ref_count=1, \
    returning XShmImageWrapper(BGRX: 0, 0, 743, 418, ID=720)
2013-12-17 16:19:55,316 XShmImageWrapper.free() ID=718, \
    free_callback=<built-in method free_image of xpra.x11.bindings.ximage.XShmWrapper object at 0x3c16be8>
2013-12-17 16:19:55,317 XShmWrapper.free_image() ID=203, closed=1, new ref_count=0
2013-12-17 16:19:55,318 XShmWrapper.free() ID=203, has_shm=True
2013-12-17 16:19:55,318 XShmWrapper.free() ID=203, image=0x3d6f590L
Xpra: Fatal IO error 11 (Resource temporarily unavailable) on X server :10.

So, it looks to me like we still have an XShmWrapper live (ID=203, ref_count=1) when we call a new get_image following the window resize.
We create a new wrapper (ID=204) and hand out a new image (ID=720), then when we free() the older wrapper, things blow up.

Thoughts:

when we resize, we should probably close the wrapper (since it won't work)
waiting for the last picture to decrement the semaphore to zero may not be the best approach:

maybe we should weak reference the images and copy their pixels out of XShm when we want to free the wrapper
or just keep the wrapper around until it is properly freed (all images freed) and just fallback to the non-xshm code path
Both solutions require memory copying - I think the second one is cleaner.

totaam · 2013-12-17T15:57:12Z

2013-12-17 15:57:12: totaam uploaded file `xshm-single.patch` (2.8 KiB)

forces us to not use more than one XShm mapping per window

totaam · 2013-12-17T16:01:04Z

2013-12-17 16:01:04: totaam uploaded file `xshm-debugging.patch` (10.9 KiB)

improve xshm debugging and add unique ids and thread checking

totaam · 2013-12-17T16:09:50Z

The xshm-fixleak.patch is valid and important: applied in r4962.

The reference issue was a red herring: it is not illegal to have more than one XShm handle for the same pixmap, just a bit wasteful - but in our case they will be of different sizes, which is why we need them. The xshm-single.patch implements a one-mapping-per-window policy, this makes things slower for no real benefit.
Even with those fixes, I still got some ominous crashes, including one or two claiming that we were accessing the X11 server multi-threaded. Surely not, right? Then I added lots of debug: xshm-debugging.patch and eventually found that yes, we are doing something very naughty...
The real problem comes from how we free the image wrapper class, which in turn may free the xshm wrapper if it is waiting to be closed: this call may come from another thread - and non-UI threads are not allowed to talk to the X11 server!
Fixed in r4963.

0.10.x backports in: 4965 and 4966 - keeping ticket opened for testing.

totaam · 2013-12-19T11:20:59Z

0.10.10 released with the fix after testing ok - closing

totaam closed this as completed Dec 19, 2013

totaam added the v0.10.x label Jan 22, 2021

mdavidsaver mentioned this issue Mar 25, 2022

Server hangs in XSync() #3503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server deadlocks when resizing fast updating window (ie: glxgears or mplayer) #475

server deadlocks when resizing fast updating window (ie: glxgears or mplayer) #475

totaam commented Dec 17, 2013 •

edited

Loading

totaam commented Dec 17, 2013 •

edited

Loading

totaam commented Dec 17, 2013 •

edited

Loading

totaam commented Dec 17, 2013

totaam commented Dec 17, 2013

totaam commented Dec 17, 2013 •

edited

Loading

totaam commented Dec 17, 2013

totaam commented Dec 17, 2013

totaam commented Dec 17, 2013 •

edited

Loading

totaam commented Dec 19, 2013 •

edited

Loading

server deadlocks when resizing fast updating window (ie: glxgears or mplayer) #475

server deadlocks when resizing fast updating window (ie: glxgears or mplayer) #475

Comments

totaam commented Dec 17, 2013 • edited Loading

totaam commented Dec 17, 2013 • edited Loading

totaam commented Dec 17, 2013 • edited Loading

totaam commented Dec 17, 2013

2013-12-17 09:13:17: totaam uploaded file xshm-fixleak.patch (1.6 KiB)

totaam commented Dec 17, 2013

2013-12-17 09:30:52: totaam uploaded file xshm-fixleak+debug.patch (11.5 KiB)

totaam commented Dec 17, 2013 • edited Loading

totaam commented Dec 17, 2013

2013-12-17 15:57:12: totaam uploaded file xshm-single.patch (2.8 KiB)

totaam commented Dec 17, 2013

2013-12-17 16:01:04: totaam uploaded file xshm-debugging.patch (10.9 KiB)

totaam commented Dec 17, 2013 • edited Loading

totaam commented Dec 19, 2013 • edited Loading

totaam commented Dec 17, 2013 •

edited

Loading

totaam commented Dec 17, 2013 •

edited

Loading

totaam commented Dec 17, 2013 •

edited

Loading

2013-12-17 09:13:17: totaam uploaded file `xshm-fixleak.patch` (1.6 KiB)

2013-12-17 09:30:52: totaam uploaded file `xshm-fixleak+debug.patch` (11.5 KiB)

totaam commented Dec 17, 2013 •

edited

Loading

2013-12-17 15:57:12: totaam uploaded file `xshm-single.patch` (2.8 KiB)

2013-12-17 16:01:04: totaam uploaded file `xshm-debugging.patch` (10.9 KiB)

totaam commented Dec 17, 2013 •

edited

Loading

totaam commented Dec 19, 2013 •

edited

Loading