Subdaemon heartbeat with modified libqb (async API for connect) #2588

wenningerk · 2021-12-09T10:27:48Z

Keep pacemakerd tracking subdaemons for liveness - via qb-ipc-connect and the packets exchanged for authentication as of now.
qb-ipc-connect as of current libqb is blocking for an indefinite time if the subdaemon is unresponsive - -SIGSTOP or busy mainloop.
Thus there is an experimental API extension of libqb ClusterLabs/libqb#450 to be able to deal with that without needing ugly workarounds.
This is as well the reason why CI at this point is expected to fail as upstream master of libqb is missing the API extension.

kgaillot

This approach seems reasonable to me

daemons/pacemakerd/pcmkd_subdaemons.c

kgaillot

Looks good. Are you planning to keep the old code if the new libqb API isn't available?

lib/common/ipc_client.c

wenningerk · 2022-01-15T08:48:30Z

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal).
I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

wenningerk · 2022-01-15T08:51:22Z

build-issue is some repo-issue with tumbleweed

kgaillot · 2022-01-17T18:47:18Z

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal). I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

Do you know if that hang was a regression in a released version?

kgaillot

I think this is a good approach, we just need a fallback for when the libqb API isn't available

kgaillot · 2022-01-17T18:50:31Z

daemons/pacemakerd/pcmkd_subdaemons.c

+                        pcmk_children[next_child].name,
+                        (long long) PCMK__SPECIAL_PID_AS_0(
+                            pcmk_children[next_child].pid),
+                        (rc == pcmk_rc_ipc_pid_only)? " as IPC server" : "");


Now that we don't fall through we don't need this check or the similar one below

wenningerk · 2022-01-17T22:08:40Z

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal). I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

Do you know if that hang was a regression in a released version?

No. Never had tested much with pre-existing daemons.
Maybe it is even working when you give it longer time and the issue just arises with sbd enabled.
Now that I know it isn't the same thing as with my code I should look at it once more.

kgaillot · 2022-01-19T19:47:03Z

configure.ac

@@ -1316,7 +1316,7 @@ AC_CHECK_FUNCS(qb_ipcc_connect_async,

 dnl libqb 2.0.2+ (2020-10)
 AC_CHECK_FUNCS(qb_ipcc_auth_get,
-               AC_DEFINE(HAVE_IPCC_AUTH_GET, 1,
+               AC_DEFINE(HAVE_QB_IPCC_AUTH_GET, 1,


Actually (thankfully) this shouldn't be necessary. AC_CHECK_FUNCS() will already define that, so we were just unnecessarily defining the alternate name. We can just drop the second argument (i.e. the AC_DEFINE) altogether.

Then let's make it consistent if we are sure it works the same on all platforms/versions we do support.

Remove superfluous AC_DEFINE - one of them with typo

gao-yan · 2024-12-17T12:47:55Z

daemons/pacemakerd/pcmkd_subdaemons.c

+                        (long long) PCMK__SPECIAL_PID_AS_0(
+                            pcmk_children[next_child].pid),
+                        pcmk_children[next_child].check_count);
+                stop_child(&pcmk_children[next_child], SIGKILL);


In public clouds, nowadays it happens more often than before that sub-daemons are unresponsive to IPC and get respawned.

As we know, if it's controller that respawns, the node will lose all its transient attributes in the CIB status without being written again. Not only the resources that rely on the attributes will get impacted, but also missing of the internal attribute #feature-set will result into confusing MIXED-VERSION condition being shown from interfaces like crm_mon.

So far PCMK_fail_fast=yes probably is the only workaround to get the situation back into sanity but of course at a cost of node reboot.

While we've been trying to address it with the idea like:
#1699

, I'm not sure if it'd make sense at all to increase the tolerance here such as PCMK_PROCESS_CHECK_RETRIES or make it configurable... Otherwise should we say that 5 failures in a row are anyway bad enough to trigger a recovery?

Sry I may be missing the reason for your comment here.
Previously IPC wasn't checked on a periodic basis for all subdaemons.
Numbers are kind of arbitrary. 1s is kind of a lower limit that makes sense for retries. Failing after 5 retries was the attempt to make it as reactive as before for cases where IPC was checked before already.

Nothing is wrong with the changes in this PR. Just for bringing up the topic in the context here :-)

Coincidentally I recently created https://projects.clusterlabs.org/T950 regarding this code, but it's not related unless you've only seen issues at cluster shutdown.

https://projects.clusterlabs.org/T73 is not directly related either but could affect the timing.

There is a 1s delay between checks of all subdaemons, so if they're all up, that's at least 6s between checks for any one subdaemon. 5 tries (30s) does seem plenty of time, so I wouldn't want to raise that. If a cloud host can't get enough cycles in 30s to respond to a check, it's probably unsuitable as an HA node.

Thanks for the info and opinion. I agree.

wenningerk mentioned this pull request Dec 9, 2021

RFC: ipcc: Add an async connect API ClusterLabs/libqb#450

Merged

kgaillot reviewed Dec 13, 2021

View reviewed changes

daemons/pacemakerd/pcmkd_subdaemons.c Outdated Show resolved Hide resolved

wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch 3 times, most recently from bd94be0 to f07e7b3 Compare January 14, 2022 20:05

kgaillot reviewed Jan 14, 2022

View reviewed changes

lib/common/ipc_client.c Show resolved Hide resolved

kgaillot reviewed Jan 17, 2022

View reviewed changes

wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from f07e7b3 to a84bd9b Compare January 17, 2022 23:51

kgaillot mentioned this pull request Jan 18, 2022

[WIP] Subdaemon heartbeat #2573

Closed

wenningerk changed the title ~~[WIP] Subdaemon heartbeat with modified libqb (async API for connect)~~ Subdaemon heartbeat with modified libqb (async API for connect) Jan 18, 2022

kgaillot reviewed Jan 19, 2022

View reviewed changes

wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from a84bd9b to f62b28f Compare January 19, 2022 20:22

wenningerk added 3 commits January 19, 2022 22:17

Feature: pacemakerd: keep tracking pacemakerd for liveness

9ee9fd6

Fix: ipc_client: use libqb async API for connect

4b60aa1

Fix: configure.ac: make usage of AC_CHECK_FUNCS consistent

8e8a4a3

Remove superfluous AC_DEFINE - one of them with typo

wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from f62b28f to 8e8a4a3 Compare January 19, 2022 21:19

kgaillot merged commit 2c937a4 into ClusterLabs:main Jan 19, 2022

gao-yan reviewed Dec 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subdaemon heartbeat with modified libqb (async API for connect) #2588

Subdaemon heartbeat with modified libqb (async API for connect) #2588

wenningerk commented Dec 9, 2021 •

edited

Loading

kgaillot left a comment

kgaillot left a comment

wenningerk commented Jan 15, 2022

wenningerk commented Jan 15, 2022

kgaillot commented Jan 17, 2022

kgaillot left a comment

kgaillot Jan 17, 2022

wenningerk Jan 17, 2022

wenningerk commented Jan 17, 2022

kgaillot Jan 19, 2022 •

edited

Loading

wenningerk Jan 19, 2022

gao-yan Dec 17, 2024

wenningerk Dec 17, 2024

gao-yan Dec 17, 2024

kgaillot Dec 17, 2024

gao-yan Dec 17, 2024

Subdaemon heartbeat with modified libqb (async API for connect) #2588

Subdaemon heartbeat with modified libqb (async API for connect) #2588

Conversation

wenningerk commented Dec 9, 2021 • edited Loading

kgaillot left a comment

Choose a reason for hiding this comment

kgaillot left a comment

Choose a reason for hiding this comment

wenningerk commented Jan 15, 2022

wenningerk commented Jan 15, 2022

kgaillot commented Jan 17, 2022

kgaillot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenningerk commented Jan 17, 2022

kgaillot Jan 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenningerk commented Dec 9, 2021 •

edited

Loading

kgaillot Jan 19, 2022 •

edited

Loading