Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed display output when using GLX backend and VSYNC on AMD #1345

Closed
christian-heusel opened this issue Sep 29, 2024 · 27 comments
Closed

Comments

@christian-heusel
Copy link

christian-heusel commented Sep 29, 2024

Platform

Arch Linux (Rolling) amd64

GPU, drivers, and screen setup

glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon Graphics (radeonsi, renoir, LLVM 18.1.8, DRM 3.58, 6.11.1-rc1-1home) (0x164c)
    Version: 24.2.3
    Accelerated: yes
    Video memory: 1024MB
    Unified memory: no
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
Memory info (GL_ATI_meminfo):
    VBO free memory - total: 384 MB, largest block: 384 MB
    VBO free aux. memory - total: 6947 MB, largest block: 6947 MB
    Texture free memory - total: 384 MB, largest block: 384 MB
    Texture free aux. memory - total: 6947 MB, largest block: 6947 MB
    Renderbuffer free memory - total: 384 MB, largest block: 384 MB
    Renderbuffer free aux. memory - total: 6947 MB, largest block: 6947 MB
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 1024 MB
    Total available memory: 8425 MB
    Currently available dedicated video memory: 384 MB
OpenGL vendor string: AMD
OpenGL renderer string: AMD Radeon Graphics (radeonsi, renoir, LLVM 18.1.8, DRM 3.58, 6.11.1-rc1-1home)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 24.2.3-arch1.1
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.2.3-arch1.1
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 24.2.3-arch1.1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

Environment

i3 version 4.23 (which has gaps support merged)

picom version

$ picom --version                                  
v12.1 (/startdir/picom revision c321da4)
Diagnostics
$ picom --diagnostics --config ~/.config/i3/picom.conf 
[ 09/29/2024 12:56:52.693 parse_config_libconfig WARN ] The refresh-rate option has been deprecated. Please remove it from your configuration file. If you encounter any problems without this feature, please feel free to open a bug report
[ 09/29/2024 12:56:52.693 c2_parse_target WARN ] Type specifier is deprecated. Type "a" specified on target "_NET_WM_STATE" will be ignored, you can remove it.
[ 09/29/2024 12:56:52.693 c2_parse_target WARN ] Format specifier is deprecated. Format "32" specified on target "_NET_WM_STATE" will be ignored, you can remove it.
[ 09/29/2024 12:56:52.693 c2_parse_target WARN ] Type specifier is deprecated. Type "c" specified on target "_GTK_FRAME_EXTENTS" will be ignored, you can remove it.
[ 09/29/2024 12:56:52.693 c2_parse_target WARN ] Type specifier is deprecated. Type "c" specified on target "_GTK_FRAME_EXTENTS" will be ignored, you can remove it.
**Version:** v12.1 (/startdir/picom revision c321da4)

### Extensions:

* Shape: Yes
* RandR: Yes
* Present: Present

### Misc:

* Use Overlay: No

* Config file specified: /home/chris/.config/i3/picom.conf
* Config file used: /home/chris/.config/i3/picom.conf

### Drivers (inaccurate):

modesetting

### Backend: glx

* Driver vendors:
 * GLX: Mesa Project and SGI
 * GL: AMD
* GL renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 18.1.8, DRM 3.58, 6.11.1-rc1-1home)
* Accelerated: 1
[ 09/29/2024 12:56:52.933 egl_init WARN ] The egl backend is still experimental, use with care.

### Backend: egl

* Driver vendors:
 * EGL: Mesa Project
 * EGL driver: radeonsi
 * GL: AMD
* GL renderer: AMD Radeon Graphics (radeonsi, renoir, LLVM 18.1.8, DRM 3.58, 6.11.1-rc1-1home)

Configuration:

Configuration file
# Thank you code_nomad: http://9m.no/ꪯ鵞
# and Arch Wiki contributors: https://wiki.archlinux.org/index.php/Compton

#################################
# Backend
#################################

# Backend to use: "xrender" or "glx".
# GLX backend is typically much faster but depends on a sane driver.
backend = "glx";
# backend = "xrender"

#################################
# GLX backend
#################################

glx-no-stencil = true;

# GLX backend: Copy unmodified regions from front buffer instead of redrawing them all.
# My tests with nvidia-drivers show a 10% decrease in performance when the whole screen is modified,
# but a 20% increase when only 1/4 is.
# My tests on nouveau show terrible slowdown.
glx-copy-from-front = false;

# GLX backend: Use MESA_copy_sub_buffer to do partial screen update.
# My tests on nouveau shows a 200% performance boost when only 1/4 of the screen is updated.
# May break VSync and is not available on some drivers.
# Overrides --glx-copy-from-front.
# glx-use-copysubbuffermesa = true;

# GLX backend: Avoid rebinding pixmap on window damage.
# Probably could improve performance on rapid window content changes, but is known to break things on some drivers (LLVMpipe).
# Recommended if it works.
# glx-no-rebind-pixmap = true;

# GLX backend: GLX buffer swap method we assume.
# Could be undefined (0), copy (1), exchange (2), 3-6, or buffer-age (-1).
# undefined is the slowest and the safest, and the default value.
# copy is fastest, but may fail on some drivers,
# 2-6 are gradually slower but safer (6 is still faster than 0).
# Usually, double buffer means 2, triple buffer means 3.
# buffer-age means auto-detect using GLX_EXT_buffer_age, supported by some drivers.
# Useless with --glx-use-copysubbuffermesa.
# Partially breaks --resize-damage.
# Defaults to undefined.
#glx-swap-method = "undefined";

#################################
# Shadows
#################################

# Enabled client-side shadows on windows.
shadow = false;
# The blur radius for shadows. (default 12)
shadow-radius = 5;
# The left offset for shadows. (default -15)
shadow-offset-x = -5;
# The top offset for shadows. (default -15)
shadow-offset-y = -5;
# The translucency for shadows. (default .75)
shadow-opacity = 0.5;

log-level = "warn";
#change your username here
#log-file = "/home/erik/.config/compton.log";

# Set if you want different colour shadows
# shadow-red = 0.0;
# shadow-green = 0.0;
# shadow-blue = 0.0;

# The shadow exclude options are helpful if you have shadows enabled. Due to the way compton draws its shadows, certain applications will have visual glitches
# (most applications are fine, only apps that do weird things with xshapes or argb are affected).
# This list includes all the affected apps I found in my testing. The "! name~=''" part excludes shadows on any "Unknown" windows, this prevents a visual glitch with the XFWM alt tab switcher.
shadow-exclude = [
    "! name~=''",
    "name = 'Notification'",
    "name = 'Plank'",
    "name = 'Docky'",
    "name = 'Kupfer'",
    "name = 'xfce4-notifyd'",
    "name *= 'VLC'",
    "name *= 'compton'",
    "name *= 'Chromium'",
    "name *= 'Chrome'",
    "class_g = 'Firefox' && argb",
    "class_g = 'Conky'",
    "class_g = 'Kupfer'",
    "class_g = 'Synapse'",
    "class_g ?= 'Notify-osd'",
    "class_g ?= 'Cairo-dock'",
    "class_g = 'Cairo-clock'",
    "class_g ?= 'Xfce4-notifyd'",
    "class_g ?= 'Xfce4-power-manager'",
    "_GTK_FRAME_EXTENTS@:c",
    "_NET_WM_STATE@:32a *= '_NET_WM_STATE_HIDDEN'"
];
# Avoid drawing shadow on all shaped windows (see also: --detect-rounded-corners)
shadow-ignore-shaped = false;

#################################
# Opacity
#################################

inactive-opacity = 1;
active-opacity = 1;
frame-opacity = 1;
inactive-opacity-override = false;

# Dim inactive windows. (0.0 - 1.0)
# inactive-dim = 0.2;
# Do not let dimness adjust based on window opacity.
# inactive-dim-fixed = true;
# Blur background of transparent windows. Bad performance with X Render backend. GLX backend is preferred.
# blur-background = true;
# Blur background of opaque windows with transparent frames as well.
# blur-background-frame = true;
# Do not let blur radius adjust based on window opacity.
blur-background-fixed = false;
blur-background-exclude = [
    "window_type = 'dock'",
    "window_type = 'desktop'",
    "_GTK_FRAME_EXTENTS@:c"
];

#################################
# Fading
#################################

# Fade windows during opacity changes.
fading = true;
# The time between steps in a fade in milliseconds. (default 10).
fade-delta = 2;
# Opacity change between steps while fading in. (default 0.028).
fade-in-step = 0.03;
# Opacity change between steps while fading out. (default 0.03).
fade-out-step = 0.03;
# Fade windows in/out when opening/closing
# no-fading-openclose = true;

# Specify a list of conditions of windows that should not be faded.
fade-exclude = [ ];

#################################
# Other
#################################

# Try to detect WM windows and mark them as active.
mark-wmwin-focused = true;
# Mark all non-WM but override-redirect windows active (e.g. menus).
mark-ovredir-focused = true;
# Use EWMH _NET_WM_ACTIVE_WINDOW to determine which window is focused instead of using FocusIn/Out events.
# Usually more reliable but depends on a EWMH-compliant WM.
use-ewmh-active-win = true;
# Detect rounded corners and treat them as rectangular when --shadow-ignore-shaped is on.
detect-rounded-corners = true;

# Detect _NET_WM_OPACITY on client windows, useful for window managers not passing _NET_WM_OPACITY of client windows to frame windows.
# This prevents opacity being ignored for some apps.
# For example without this enabled my xfce4-notifyd is 100% opacity no matter what.
detect-client-opacity = true;

# Specify refresh rate of the screen.
# If not specified or 0, compton will try detecting this with X RandR extension.
refresh-rate = 0;

# Vertical synchronization: match the refresh rate of the monitor
# this breaks transparency in virtualbox - put a "#" before next line to fix that
vsync = true;

# Enable DBE painting mode, intended to use with VSync to (hopefully) eliminate tearing.
# Reported to have no effect, though.
dbe = false;

# Limit compton to repaint at most once every 1 / refresh_rate second to boost performance.
# This should not be used with --vsync drm/opengl/opengl-oml as they essentially does --sw-opti's job already,
# unless you wish to specify a lower refresh rate than the actual value.
#sw-opti = true;

# Unredirect all windows if a full-screen opaque window is detected, to maximize performance for full-screen windows, like games.
# Known to cause flickering when redirecting/unredirecting windows.
unredir-if-possible = false;

# Specify a list of conditions of windows that should always be considered focused.
focus-exclude = [ ];

# Use WM_TRANSIENT_FOR to group windows, and consider windows in the same group focused at the same time.
detect-transient = true;
# Use WM_CLIENT_LEADER to group windows, and consider windows in the same group focused at the same time.
# WM_TRANSIENT_FOR has higher priority if --detect-transient is enabled, too.
detect-client-leader = true;

#################################
# Window type settings
#################################

wintypes:
{
  tooltip = { fade = true; shadow = true; opacity = 0.9; focus = true;};
  dock = { shadow = false; }
  dnd = { shadow = false; }
  popup_menu = { opacity = 0.9; }
  dropdown_menu = { opacity = 0.9; }
};

######################
# XSync
# See: https://github.com/yshui/compton/commit/b18d46bcbdc35a3b5620d817dd46fbc76485c20d
######################

# Use X Sync fence to sync clients' draw calls. Needed on nvidia-drivers with GLX backend for some users.
xrender-sync-fence = true;

Steps of reproduction

  1. Use picom with the glx backend and vsync enabled
  2. Use some program (in my case neomutt)
  3. observe that it's lagging from time to time for normal usage

You can i.e. see in the following videos that the clicks of my keyboard usually result in an immediate change but sometimes it needs some time until something happens:

2024-09-29.12-07-48.mp4

Expected behavior

Everything feels snappy.

Current Behavior

Have output lags from time to time

Other details

See the video above.


cc @Antiz96

@yshui
Copy link
Owner

yshui commented Sep 30, 2024

does this still happen with glx backend?

(reason i am asking: if glx backend works fine, this is probably a driver problem with xrender; if it doesn't, this is more likely to be a damage tracking problem in picom).

@Antiz96
Copy link

Antiz96 commented Sep 30, 2024

does this still happen with glx backend?

(reason i am asking: if glx backend works fine, this is probably a driver problem with xrender; if it doesn't, this is more likely to be a damage tracking problem in picom).

Hi,

It only happens with the glx backend actually. Using the xrender backend works fine.

@yshui
Copy link
Owner

yshui commented Sep 30, 2024

@Antiz96 oh, ok. I saw backend = "xrender" in your config and thought you were using xrender 😅

in that case I am incline to think this is a driver problem then. does this happen with minimal config? i.e. picom --config /dev/null --backend glx ?

@Antiz96
Copy link

Antiz96 commented Sep 30, 2024

@Antiz96 oh, ok. I saw backend = "xrender" in your config and thought you were using xrender 😅

Ah, I assume this is a mistake, probably a relic of tests we made. Sorry about that. I am not the one that reported the issue, I'll let @christian-heusel confirm.

@christian-heusel
Copy link
Author

Oh yeah sorry for that, having the xrender back in there this was from debugging .. Seems like I grabbed the config at an unfortunate time! 😅 (will edit my post accordingly)

in that case I am incline to think this is a driver problem then. does this happen with minimal config? i.e. picom --config /dev/null --backend glx ?

No, the problem does not occur with the minimal config ...

@yshui
Copy link
Owner

yshui commented Sep 30, 2024

what about picom --config /dev/null --backend glx --vsync?

@Neko-Box-Coder
Copy link

I have just updated my system today. I can confirm the slow down when vsync is set to true. It seems like it happens after a while.

It happens on both glx and xrender backend, xrender is worse.

This is my picom --diagnostics.

**Version:** vgit-89c2c

### Extensions:

* Shape: Yes
* RandR: Yes
* Present: Present

### Misc:

* Use Overlay: No
  (Another compositor is already running)
* Config file used: /home/user/.config/picom.conf

### Drivers (inaccurate):

AMDGPU, Radeon

### Backend: glx

* Driver vendors:
 * GLX: Mesa Project and SGI
 * GL: AMD
* GL renderer: AMD Radeon 680M (radeonsi, rembrandt, LLVM 18.1.8, DRM 3.57, 6.10.10-arch1-1)
* Accelerated: 1

### Backend: egl

* Driver vendors:
 * EGL: Mesa Project
 * EGL driver: radeonsi
 * GL: AMD
* GL renderer: AMD Radeon 680M (radeonsi, rembrandt, LLVM 18.1.8, DRM 3.57, 6.10.10-arch1-1)
* ```

@christian-heusel
Copy link
Author

christian-heusel commented Oct 1, 2024

what about picom --config /dev/null --backend glx --vsync?

This also has the problem 😊

Contrary to what @Neko-Box-Coder has reported in #1345 (comment) I do not face the issue with the xrender backend (so maybe that's a different issue) 😊

@joaopauloalbq
Copy link

It is also happening on Intel when vsync is enabled.

@yshui
Copy link
Owner

yshui commented Oct 2, 2024

OK, now can you try picom --config /dev/null --backend glx --vsync --no-frame-pacing?

@joaopauloalbq
Copy link

OK, now can you try picom --config /dev/null --backend glx --vsync --no-frame-pacing?

That's it. No issues!

@christian-heusel
Copy link
Author

OK, now can you try picom --config /dev/null --backend glx --vsync --no-frame-pacing?

No issues for me either!

@yshui
Copy link
Owner

yshui commented Oct 3, 2024

hmm, interesting. can you run picom --config /dev/null --backend glx --vsync --log-level=trace --log-file=trace.log, trigger the delay, and upload trace.log here?

it would help if you can also record your screen with a clock showing milliseconds so i can relate log to the delay. but if you can't that's also fine.

@yshui
Copy link
Owner

yshui commented Oct 3, 2024

You can use this as the clock: https://codepen.io/yshui/pen/abeNvGv

I realize it is very probable that this delay will just go away if you have this clock running... But do try.

@joaopauloalbq
Copy link

I realize it is very probable that this delay will just go away if you have this clock running... But do try.

Yep, I couldn't reproduce the delay with the clock or by recording the screen :S

The smallest log I could capture: trace.log

@yshui
Copy link
Owner

yshui commented Oct 5, 2024

hmmm, without knowing exactly when the delay happened the log is not very useful.

the delay happens when you press a key to trigger a screen update, right? is there a way to maybe log your key presses with a millisecond timestamp? and then tell me which key press is correlated with the delay.

@joaopauloalbq
Copy link

joaopauloalbq commented Oct 7, 2024

the delay happens when you press a key to trigger a screen update, right?

Right. When I change tag/desktop/workspace

is there a way to maybe log your key presses with a millisecond timestamp?

Got it, here we go: trace.log

Delays in:

  • 11:05:28.490
  • 11:05:38.388
  • 11:06:04.894
  • 11:06:15.670
  • 11:06:25.247
  • 11:06:37.750

@yshui
Copy link
Owner

yshui commented Oct 8, 2024

huh, this might be why:

[ 07/10/2024 11:05:38.400 present_vblank_scheduler_schedule VERBOSE ] Requesting vblank event for window 0x00000243, msc 1039176
[ 07/10/2024 11:05:38.405 handle_present_complete_notify TRACE ] The end of this vblank is 781 us into the future
[ 07/10/2024 11:05:38.406 present_vblank_scheduler_schedule VERBOSE ] Requesting vblank event for window 0x00000243, msc 1039178

vblank scheduler skipped one msc... now we just need to figure out what caused it.

yshui added a commit that referenced this issue Oct 8, 2024
@yshui
Copy link
Owner

yshui commented Oct 8, 2024

I add a debug print and pushed it to debug-1345, can you build from that branch and upload trace log and timestamp again? thanks.

@joaopauloalbq @christian-heusel

@yshui
Copy link
Owner

yshui commented Oct 8, 2024

I also have a hunch what might be wrong here

yshui added a commit that referenced this issue Oct 8, 2024
@yshui
Copy link
Owner

yshui commented Oct 8, 2024

After you have got trace log from debug-1345, try fix-1345 see if that fixes the problem.

yshui added a commit that referenced this issue Oct 8, 2024
@christian-heusel
Copy link
Author

Here is the log from the debug-1345 branch:
trace.log

The fix-1345 branch seems to indeed fix the issue for me 😊

This was referenced Oct 10, 2024
yshui added a commit that referenced this issue Oct 10, 2024
Once again we found ourselves going into sleep while there are X events
in libxcb's buffer.

This time it's because both vblank_handle_x_events and x_poll_for_events
can read from the X connection. This is introduced as part of commit
d617d0b. Before that, we were only calling xcb_poll_for_queued_events
after vblank_handle_x_events.

By changing that to a xcb_poll_for_events (and later to
x_poll_for_events, which calls xcb_poll_for_events), we could read
additional vblank events into the buffer. These events will not be
handled because vblank_handle_x_events has already be called, so we will
be going into sleep with these events in the buffer. Since vblank events
driver the frame render loop, this causes indefinitely long delay
between frames.

Related-to: #1345 #1330
Signed-off-by: Yuxuan Shui <[email protected]>
yshui added a commit that referenced this issue Oct 10, 2024
Once again we found ourselves going into sleep while there are X events
in libxcb's buffer.

This time it's because both vblank_handle_x_events and x_poll_for_events
can read from the X connection. This is introduced as part of commit
d617d0b. Before that, we were only calling xcb_poll_for_queued_events
after vblank_handle_x_events.

By changing that to a xcb_poll_for_events (and later to
x_poll_for_events, which calls xcb_poll_for_events), we could read
additional vblank events into the buffer. These events will not be
handled because vblank_handle_x_events has already be called, so we will
be going into sleep with these events in the buffer. Since vblank events
driver the frame render loop, this causes indefinitely long delay
between frames.

Related-to: #1345 #1330
Signed-off-by: Yuxuan Shui <[email protected]>
@joaopauloalbq
Copy link

Here is the trace log from debug-1345: trace.log

Delays in:

  • 19:07:31.452
  • 19:07:42.496
  • 19:07:49.744
  • 19:08:02.008

The branch fix-1345 really fixed the issue 🤩

@yshui
Copy link
Owner

yshui commented Oct 10, 2024

thanks for the log, and thanks for confirming! i think this can be closed now.

@yshui yshui closed this as completed Oct 10, 2024
@yshui yshui reopened this Oct 12, 2024
@yshui
Copy link
Owner

yshui commented Oct 12, 2024

i am surprised that that commit actually worked, but this bug isn't fixed. there are more fundamental problems...

yshui added a commit that referenced this issue Oct 12, 2024
This partially reverts c57b705, the
previous fix for #1345

The previous fix falsely assumed if `x_poll_for_event` returned nothing,
it must have not read from the X connection. This is just not true. It
worked, probably because it's impossible to have vblank events close
together (they are only once per frame). But that's an assumption we
probably shouldn't make.

_Separately_, we have another problem. We assumed if `x_poll_for_event`
returned nothing, we don't need to flush again. This is not true either.
There could be pending requests being completed, whose callback function
might have sent new requests. This, for example, causes `picom-inspect`
to hang, because some requests are stuck in the outgoing buffer.

The explanation for the fix is going to be long, because this kind of
problems are never simple.

First of all, we need a version of `x_poll_for_event` that actually
guarantees to not read from X. The current version of `x_poll_for_event`
is already extremely complex, I don't want to pile more complexity on
top of it. So instead we are now using a different approach, one some
might consider a ugly hack.

The complexity of `x_poll_for_event` all stems from the lack of
`xcb_poll_for_queued_{reply,special_event}` API, so we had to use
complex to merge message from different streams (events, special events,
replies, and errors) all the while trying our best (and fail) to handle
all messages in the xcb buffer before going to sleep. There is a
`xcb_poll_for_queued_event` which in theory can be used for this and
will make things a lot simpler, the problem is we don't control events,
so if there is a large gap between event arrivals, picom would just be
sitting there doing nothing.

And that's exactly what we do in this commit, that is, controlling
events. Every time we wait for replies/errors for some requests, we send
a request that are guaranteed to _fail_. This is because unchecked
errors will be returned by `xcb_poll_for_queued_event` as an event. This
way, we can be sure an event will be received in a bounded amount of time,
so we won't hang.

This vastly simplifies the logic in `x_poll_for_event`, and made adding
a no-read version of it trivial. Now we have that, some care need to be
taken to make sure that we _do_ read from X sometimes, otherwise we
will go to sleep without reading anything, and be woken up again
immediately by the file descriptor. This will result in an infinite
loop. This has some ripple effects, for example, now we queue redraw
whenever the wm tree changes; before, redraws were only queued when the
wm tree becomes consistent.

Even with this, we still need the `while(true)` loop in
`handle_x_events`, this is because of the other problem we mentioned at
the beginning - flushing. The way we fix the flushing problem is by
tracking the completion of pending requests - if any requests were
completed, we flush conservatively (meaning even when not strictly
necessary) - simple enough. But flushing could read (yes, read) from X
too! So whenever we flush, we need to call `x_poll_for_event` again,
hence the while loop.

Fixes: #1345

Signed-off-by: Yuxuan Shui <[email protected]>
@yshui
Copy link
Owner

yshui commented Oct 12, 2024

sorry about this but i need someone to test the fix again. please try the fix-1345-2 branch and see if that still works. thanks.

yshui added a commit that referenced this issue Oct 12, 2024
This partially reverts c57b705, the
previous fix for #1345

The previous fix falsely assumed if `x_poll_for_event` returned nothing,
it must have not read from the X connection. This is just not true. It
worked, probably because it's impossible to have vblank events close
together (they are only once per frame). But that's an assumption we
probably shouldn't make.

_Separately_, we have another problem. We assumed if `x_poll_for_event`
returned nothing, we don't need to flush again. This is not true either.
There could be pending requests being completed, whose callback function
might have sent new requests. This, for example, causes `picom-inspect`
to hang, because some requests are stuck in the outgoing buffer.

The explanation for the fix is going to be long, because this kind of
problems are never simple.

First of all, we need a version of `x_poll_for_event` that actually
guarantees to not read from X. The current version of `x_poll_for_event`
is already extremely complex, I don't want to pile more complexity on
top of it. So instead we are now using a different approach, one some
might consider a ugly hack.

The complexity of `x_poll_for_event` all stems from the lack of
`xcb_poll_for_queued_{reply,special_event}` API, so we had to use
complex to merge message from different streams (events, special events,
replies, and errors) all the while trying our best (and fail) to handle
all messages in the xcb buffer before going to sleep. There is a
`xcb_poll_for_queued_event` which in theory can be used for this and
will make things a lot simpler, the problem is we don't control events,
so if there is a large gap between event arrivals, picom would just be
sitting there doing nothing.

And that's exactly what we do in this commit, that is, controlling
events. Every time we wait for replies/errors for some requests, we send
a request that are guaranteed to _fail_. This is because unchecked
errors will be returned by `xcb_poll_for_queued_event` as an event. This
way, we can be sure an event will be received in a bounded amount of time,
so we won't hang.

This vastly simplifies the logic in `x_poll_for_event`, and made adding
a no-read version of it trivial. Now we have that, some care need to be
taken to make sure that we _do_ read from X sometimes, otherwise we
will go to sleep without reading anything, and be woken up again
immediately by the file descriptor. This will result in an infinite
loop. This has some ripple effects, for example, now we queue redraw
whenever the wm tree changes; before, redraws were only queued when the
wm tree becomes consistent.

Even with this, we still need the `while(true)` loop in
`handle_x_events`, this is because of the other problem we mentioned at
the beginning - flushing. The way we fix the flushing problem is by
tracking the completion of pending requests - if any requests were
completed, we flush conservatively (meaning even when not strictly
necessary) - simple enough. But flushing could read (yes, read) from X
too! So whenever we flush, we need to call `x_poll_for_event` again,
hence the while loop.

Fixes: #1345

Signed-off-by: Yuxuan Shui <[email protected]>
yshui added a commit that referenced this issue Oct 12, 2024
This partially reverts c57b705, the
previous fix for #1345

The previous fix falsely assumed if `x_poll_for_event` returned nothing,
it must have not read from the X connection. This is just not true. It
worked, probably because it's impossible to have vblank events close
together (they are only once per frame). But that's an assumption we
probably shouldn't make.

_Separately_, we have another problem. We assumed if `x_poll_for_event`
returned nothing, we don't need to flush again. This is not true either.
There could be pending requests being completed, whose callback function
might have sent new requests. This, for example, causes `picom-inspect`
to hang, because some requests are stuck in the outgoing buffer.

The explanation for the fix is going to be long, because this kind of
problems are never simple.

First of all, we need a version of `x_poll_for_event` that actually
guarantees to not read from X. The current version of `x_poll_for_event`
is already extremely complex, I don't want to pile more complexity on
top of it. So instead we are now using a different approach, one some
might consider a ugly hack.

The complexity of `x_poll_for_event` all stems from the lack of
`xcb_poll_for_queued_{reply,special_event}` API, so we had to use
complex to merge message from different streams (events, special events,
replies, and errors) all the while trying our best (and fail) to handle
all messages in the xcb buffer before going to sleep. There is a
`xcb_poll_for_queued_event` which in theory can be used for this and
will make things a lot simpler, the problem is we don't control events,
so if there is a large gap between event arrivals, picom would just be
sitting there doing nothing.

And that's exactly what we do in this commit, that is, controlling
events. Every time we wait for replies/errors for some requests, we send
a request that are guaranteed to _fail_. This is because unchecked
errors will be returned by `xcb_poll_for_queued_event` as an event. This
way, we can be sure an event will be received in a bounded amount of time,
so we won't hang.

This vastly simplifies the logic in `x_poll_for_event`, and made adding
a no-read version of it trivial. Now we have that, some care need to be
taken to make sure that we _do_ read from X sometimes, otherwise we
will go to sleep without reading anything, and be woken up again
immediately by the file descriptor. This will result in an infinite
loop. This has some ripple effects, for example, now we queue redraw
whenever the wm tree changes; before, redraws were only queued when the
wm tree becomes consistent.

Even with this, we still need the `while(true)` loop in
`handle_x_events`, this is because of the other problem we mentioned at
the beginning - flushing. The way we fix the flushing problem is by
tracking the completion of pending requests - if any requests were
completed, we flush conservatively (meaning even when not strictly
necessary) - simple enough. But flushing could read (yes, read) from X
too! So whenever we flush, we need to call `x_poll_for_event` again,
hence the while loop.

Fixes: #1345

Signed-off-by: Yuxuan Shui <[email protected]>
@yshui yshui closed this as completed in 2f917fd Oct 13, 2024
@joaopauloalbq
Copy link

Sorry for the late feedback.

Everything is still working ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants