Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server deadlocks when resizing fast updating window (ie: glxgears or mplayer) #475

Closed
totaam opened this issue Dec 17, 2013 · 9 comments
Closed

Comments

@totaam
Copy link
Collaborator

totaam commented Dec 17, 2013

It doesn't crash it just seems to be stuck:

(gdb) bt
#0  0x00000034528eb7fd in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000003455409f42 in poll (__timeout=-1, __nfds=1, __fds=0x7fff3f1f6fb0) at /usr/include/bits/poll2.h:46
#2  _xcb_conn_wait (c=c@entry=0x1054260, cond=cond@entry=0x7fff3f1f7030, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:414
#3  0x000000345540b37f in wait_for_reply (c=c@entry=0x1054260, request=16876, e=e@entry=0x7fff3f1f70f8) at xcb_in.c:399
#4  0x000000345540b492 in xcb_wait_for_reply (c=c@entry=0x1054260, request=16876, e=e@entry=0x7fff3f1f70f8) at xcb_in.c:429
#5  0x0000003456041b47 in _XReply (dpy=dpy@entry=0xbe0c60, rep=rep@entry=0x7fff3f1f7140, extra=extra@entry=0, discard=discard@entry=1) at xcb_io.c:602
#6  0x000000345603d76d in XSync (dpy=0xbe0c60, discard=0) at Sync.c:44
#7  0x000000386365c17e in gdk_flush () from /lib64/libgdk-x11-2.0.so.0
#8  0x00007f85b5f69289 in _wrap_gdk_flush (self=<optimized out>) at gdk.c:17325
#9  0x00000033b2e49dd3 in PyObject_Call (func=<built-in function flush>, arg=<optimized out>, kw=kw@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#10 0x00007f85b4a627e9 in __pyx_pf_4xpra_3x11_7gtk_x11_12gdk_bindings_42_gw (__pyx_self=<optimized out>, __pyx_v_xwin=12582914L, 
    __pyx_v_display=<gtk.gdk.Display at remote 0xfb6780>) at xpra/x11/gtk_x11/gdk_bindings.c:8311
#11 __pyx_pw_4xpra_3x11_7gtk_x11_12gdk_bindings_43_gw (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at xpra/x11/gtk_x11/gdk_bindings.c:8147
#12 0x00000033b2e49dd3 in PyObject_Call (func=<built-in function _gw>, arg=<optimized out>, kw=kw@entry=0x0)
    at /usr/src/debug/Python-2.7.5/Objects/abstract.c:2529
#13 0x00007f85b4a66d0b in __pyx_f_4xpra_3x11_7gtk_x11_12gdk_bindings_x_event_filter (__pyx_v_e_gdk=0x7fff3f1f7530, __pyx_v_gdk_event=<optimized out>, 
    __pyx_v_userdata=<optimized out>) at xpra/x11/gtk_x11/gdk_bindings.c:9054
#14 0x000000386365c7b5 in gdk_event_translate () from /lib64/libgdk-x11-2.0.so.0
#15 0x000000386365e968 in _gdk_events_queue () from /lib64/libgdk-x11-2.0.so.0
#16 0x000000386365ea0e in gdk_event_dispatch () from /lib64/libgdk-x11-2.0.so.0
#17 0x00007f85b6f06df6 in g_main_dispatch (context=0x104fbf0) at gmain.c:3054
#18 g_main_context_dispatch (context=context@entry=0x104fbf0) at gmain.c:3630
#19 0x00007f85b6f07148 in g_main_context_iterate (context=0x104fbf0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3701
#20 0x00007f85b6f0754a in g_main_loop_run (loop=0x1e4aa30) at gmain.c:3895
#21 0x0000003862f3ed47 in gtk_main () from /lib64/libgtk-x11-2.0.so.0

Sanitized as:

wait_for_reply()
gtk.gdk.flush()
xpra.x11.gtk_x11.gdk_bindings._gw()
xpra.x11.gtk_x11.gdk_bindings.x_event_filter()
gdk_event_translate()
gdk_event_dispatch()
gtk_main()

And here is another one (with 0.10.x 4000), showing up in a different place:

wait_for_reply()
XGetGeometry()
gdk_window_get_geometry()
resize_corral_window()
xpra.x11.gtk_x11.error.call_synced()
xpra.x11.gtk_x11.window.may_resize_corral_window()

With the X11 server just waiting for something:

(gdb) bt
#0  0x00000034528ed533 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000000046692c in WaitForSomething ()
#2  0x0000000000436eb1 in Dispatch ()
#3  0x00000000004266da in main ()

Sometimes it also just kills the X11 server (no xpra backtrace in that case - no X11 backtrace yet..).

In the xpra backtrace case, it looks like we're stuck waiting for a reply from the X11 server, and none is coming...
Possibly because we accessed the server from a non-UI thread somewhere. (Ouch!)
Could also be some library problems (#422: the AMD OpenCL icd is known to cause weird problems)

@totaam
Copy link
Collaborator Author

totaam commented Dec 17, 2013

This is a regression, but not a recent one:

  • 0.9.x does not seem to be affected (could just be harder to hit..)
  • 0.10.x is affected (maybe a little harder to trigger than 0.11.x? races in resize code can be like that..)

(and it could still be related to lib injection from buggy OpenCL / Nvidia bits.)

Let the bisection "fun" begin (using rgb encoding for testing):

So this crash is caused by r3613 which enables XShm (#345) by default.
This can be confirmed by running normally crashing versions (including trunk) without problems by using:

XPRA_XSHM=0 xpra start ...

Problem is that XShm provides a major performance boost, so ideally we want to find a fix for the crash without turning it off completely..

@totaam
Copy link
Collaborator Author

totaam commented Dec 17, 2013

Adding some debug, this is often the last thing that shows before we crash:

2013-12-17 15:29:36,198 resize_corral_window()
Xpra: Fatal IO error 11 (Resource temporarily unavailable) on X server :10.

Interestingly, I also got this error:

shmget failed: error 28 (No space left on device)

Which got me watching the shared memory segments with:

watch "ipcs | wc -l"

And this number goes up with every resize, but never seems to go back down again! Looks like something isn't freeing up the shared memory!

@totaam
Copy link
Collaborator Author

totaam commented Dec 17, 2013

2013-12-17 09:13:17: totaam uploaded file xshm-fixleak.patch (1.6 KiB)

this seems to fix the leak

@totaam
Copy link
Collaborator Author

totaam commented Dec 17, 2013

2013-12-17 09:30:52: totaam uploaded file xshm-fixleak+debug.patch (11.5 KiB)

fixes the leak and adds some debug

@totaam
Copy link
Collaborator Author

totaam commented Dec 17, 2013

With the patch above I see:

2013-12-17 16:19:55,287 XShmWrapper.XShmCreateImage(740x416-24) True ID=203
2013-12-17 16:19:55,287 XShmWrapper.shmget(PRIVATE, 1234320 bytes, 1023) shmid=880476204
2013-12-17 16:19:55,287 XShmWrapper.shmat(880476204, NULL, 0) True
2013-12-17 16:19:55,287 XShmWrapper.XShmAttach(..) True
2013-12-17 16:19:55,287 XShmWrapper.get_image(4195268L, 0, 0, 740, 416)
2013-12-17 16:19:55,289 XShmWrapper.get_image(4195268L, 0, 0, 740, 416) ref_count=1, \
    returning XShmImageWrapper(BGRX: 0, 0, 740, 416, ID=718)
2013-12-17 16:19:55,289 XShmImageWrapper.get_image_pixels() self=XShmImageWrapper(BGRX: 0, 0, 740, 416, ID=718)
2013-12-17 16:19:55,291 WindowModel.do_xpra_configure_event(<AdHocStruct object, contents: \
    {'delivered_to': <gtk.gdk.Window object at 0x3c82910 (GdkWindow at 0x3c47360)>, 'send_event': 0, \
     'height': 418, 'width': 743, 'window': <gtk.gdk.Window object at 0x3c82910 (GdkWindow at 0x3c47360)>, \
     'y': 213, 'x': 94, 'serial': 25676L, 'border_width': 0, 'type': 22, \
     'display': <gtk.gdk.Display object at 0x29867d0 (GdkDisplayX11 at 0x2a18230)>}>) managed=True
2013-12-17 16:19:55,292 resize_corral_window()
2013-12-17 16:19:55,292 corral window geometry: (743, 418)
2013-12-17 16:19:55,292 client window geometry: (0, 0, 743, 418)
2013-12-17 16:19:55,292 WindowModel.do_xpra_configure_event(<AdHocStruct object, contents: \
    {'delivered_to': <gtk.gdk.Window object at 0x3c827d0 (GdkWindow at 0x3c47240)>, 'send_event': 0, \
     'height': 418, 'width': 743, 'window': <gtk.gdk.Window object at 0x3c827d0 (GdkWindow at 0x3c47240)>, \
     'y': 0, 'x': 0, 'serial': 25679L, 'border_width': 0, 'type': 22, \
     'display': <gtk.gdk.Display object at 0x29867d0 (GdkDisplayX11 at 0x2a18230)>}>) managed=True
2013-12-17 16:19:55,292 resize_corral_window()
2013-12-17 16:19:55,292 corral window geometry: (743, 418)
2013-12-17 16:19:55,293 client window geometry: (0, 0, 743, 418)
2013-12-17 16:19:55,310 XShmWrapper.get_image(4194552L, 17, 301, 480, 13)
2013-12-17 16:19:55,311 XShmWrapper.get_image(4194552L, 17, 301, 480, 13) ref_count=1, \
    returning XShmImageWrapper(BGRX: 17, 301, 480, 13, ID=719)
2013-12-17 16:19:55,312 XShmWrapper.cleanup() ID=203, ref_count=1
2013-12-17 16:19:55,313 XShmWrapper.XShmCreateImage(743x418-24) True ID=204
2013-12-17 16:19:55,313 XShmWrapper.shmget(PRIVATE, 1245268 bytes, 1023) shmid=880508973
2013-12-17 16:19:55,313 XShmWrapper.shmat(880508973, NULL, 0) True
2013-12-17 16:19:55,313 XShmWrapper.XShmAttach(..) True
2013-12-17 16:19:55,314 XShmWrapper.get_image(4195272L, 0, 0, 743, 418)
2013-12-17 16:19:55,315 XShmWrapper.get_image(4195272L, 0, 0, 743, 418) ref_count=1, \
    returning XShmImageWrapper(BGRX: 0, 0, 743, 418, ID=720)
2013-12-17 16:19:55,316 XShmImageWrapper.free() ID=718, \
    free_callback=<built-in method free_image of xpra.x11.bindings.ximage.XShmWrapper object at 0x3c16be8>
2013-12-17 16:19:55,317 XShmWrapper.free_image() ID=203, closed=1, new ref_count=0
2013-12-17 16:19:55,318 XShmWrapper.free() ID=203, has_shm=True
2013-12-17 16:19:55,318 XShmWrapper.free() ID=203, image=0x3d6f590L
Xpra: Fatal IO error 11 (Resource temporarily unavailable) on X server :10.

So, it looks to me like we still have an XShmWrapper live (ID=203, ref_count=1) when we call a new get_image following the window resize.
We create a new wrapper (ID=204) and hand out a new image (ID=720), then when we free() the older wrapper, things blow up.

Thoughts:

  • when we resize, we should probably close the wrapper (since it won't work)
  • waiting for the last picture to decrement the semaphore to zero may not be the best approach:
  • maybe we should weak reference the images and copy their pixels out of XShm when we want to free the wrapper
  • or just keep the wrapper around until it is properly freed (all images freed) and just fallback to the non-xshm code path
    Both solutions require memory copying - I think the second one is cleaner.

@totaam
Copy link
Collaborator Author

totaam commented Dec 17, 2013

2013-12-17 15:57:12: totaam uploaded file xshm-single.patch (2.8 KiB)

forces us to not use more than one XShm mapping per window

@totaam
Copy link
Collaborator Author

totaam commented Dec 17, 2013

2013-12-17 16:01:04: totaam uploaded file xshm-debugging.patch (10.9 KiB)

improve xshm debugging and add unique ids and thread checking

@totaam
Copy link
Collaborator Author

totaam commented Dec 17, 2013

  • The xshm-fixleak.patch is valid and important: applied in r4962.
  • The reference issue was a red herring: it is not illegal to have more than one XShm handle for the same pixmap, just a bit wasteful - but in our case they will be of different sizes, which is why we need them. The xshm-single.patch implements a one-mapping-per-window policy, this makes things slower for no real benefit.

  • Even with those fixes, I still got some ominous crashes, including one or two claiming that we were accessing the X11 server multi-threaded. Surely not, right? Then I added lots of debug: xshm-debugging.patch and eventually found that yes, we are doing something very naughty...

  • The real problem comes from how we free the image wrapper class, which in turn may free the xshm wrapper if it is waiting to be closed: this call may come from another thread - and non-UI threads are not allowed to talk to the X11 server!
    Fixed in r4963.


0.10.x backports in: 4965 and 4966 - keeping ticket opened for testing.

@totaam
Copy link
Collaborator Author

totaam commented Dec 19, 2013

0.10.10 released with the fix after testing ok - closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant