Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subdaemon heartbeat with modified libqb (async API for connect) #2588

Merged
merged 3 commits into from
Jan 19, 2022

Conversation

wenningerk
Copy link
Contributor

@wenningerk wenningerk commented Dec 9, 2021

Keep pacemakerd tracking subdaemons for liveness - via qb-ipc-connect and the packets exchanged for authentication as of now.
qb-ipc-connect as of current libqb is blocking for an indefinite time if the subdaemon is unresponsive - -SIGSTOP or busy mainloop.
Thus there is an experimental API extension of libqb ClusterLabs/libqb#450 to be able to deal with that without needing ugly workarounds.
This is as well the reason why CI at this point is expected to fail as upstream master of libqb is missing the API extension.

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems reasonable to me

@wenningerk wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch 3 times, most recently from bd94be0 to f07e7b3 Compare January 14, 2022 20:05
Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Are you planning to keep the old code if the new libqb API isn't available?

@wenningerk
Copy link
Contributor Author

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal).
I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

@wenningerk
Copy link
Contributor Author

build-issue is some repo-issue with tumbleweed

@kgaillot
Copy link
Contributor

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal). I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

Do you know if that hang was a regression in a released version?

Copy link
Contributor

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach, we just need a fallback for when the libqb API isn't available

pcmk_children[next_child].name,
(long long) PCMK__SPECIAL_PID_AS_0(
pcmk_children[next_child].pid),
(rc == pcmk_rc_ipc_pid_only)? " as IPC server" : "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we don't fall through we don't need this check or the similar one below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

@wenningerk
Copy link
Contributor Author

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal). I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

Do you know if that hang was a regression in a released version?

No. Never had tested much with pre-existing daemons.
Maybe it is even working when you give it longer time and the issue just arises with sbd enabled.
Now that I know it isn't the same thing as with my code I should look at it once more.

@wenningerk wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from f07e7b3 to a84bd9b Compare January 17, 2022 23:51
@wenningerk wenningerk changed the title [WIP] Subdaemon heartbeat with modified libqb (async API for connect) Subdaemon heartbeat with modified libqb (async API for connect) Jan 18, 2022
configure.ac Outdated
@@ -1316,7 +1316,7 @@ AC_CHECK_FUNCS(qb_ipcc_connect_async,

dnl libqb 2.0.2+ (2020-10)
AC_CHECK_FUNCS(qb_ipcc_auth_get,
AC_DEFINE(HAVE_IPCC_AUTH_GET, 1,
AC_DEFINE(HAVE_QB_IPCC_AUTH_GET, 1,
Copy link
Contributor

@kgaillot kgaillot Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually (thankfully) this shouldn't be necessary. AC_CHECK_FUNCS() will already define that, so we were just unnecessarily defining the alternate name. We can just drop the second argument (i.e. the AC_DEFINE) altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's make it consistent if we are sure it works the same on all platforms/versions we do support.

@wenningerk wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from a84bd9b to f62b28f Compare January 19, 2022 20:22
@wenningerk wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from f62b28f to 8e8a4a3 Compare January 19, 2022 21:19
@kgaillot kgaillot merged commit 2c937a4 into ClusterLabs:main Jan 19, 2022
(long long) PCMK__SPECIAL_PID_AS_0(
pcmk_children[next_child].pid),
pcmk_children[next_child].check_count);
stop_child(&pcmk_children[next_child], SIGKILL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In public clouds, nowadays it happens more often than before that sub-daemons are unresponsive to IPC and get respawned.

As we know, if it's controller that respawns, the node will lose all its transient attributes in the CIB status without being written again. Not only the resources that rely on the attributes will get impacted, but also missing of the internal attribute #feature-set will result into confusing MIXED-VERSION condition being shown from interfaces like crm_mon.

So far PCMK_fail_fast=yes probably is the only workaround to get the situation back into sanity but of course at a cost of node reboot.

While we've been trying to address it with the idea like:
#1699

, I'm not sure if it'd make sense at all to increase the tolerance here such as PCMK_PROCESS_CHECK_RETRIES or make it configurable... Otherwise should we say that 5 failures in a row are anyway bad enough to trigger a recovery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry I may be missing the reason for your comment here.
Previously IPC wasn't checked on a periodic basis for all subdaemons.
Numbers are kind of arbitrary. 1s is kind of a lower limit that makes sense for retries. Failing after 5 retries was the attempt to make it as reactive as before for cases where IPC was checked before already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is wrong with the changes in this PR. Just for bringing up the topic in the context here :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coincidentally I recently created https://projects.clusterlabs.org/T950 regarding this code, but it's not related unless you've only seen issues at cluster shutdown.

https://projects.clusterlabs.org/T73 is not directly related either but could affect the timing.

There is a 1s delay between checks of all subdaemons, so if they're all up, that's at least 6s between checks for any one subdaemon. 5 tries (30s) does seem plenty of time, so I wouldn't want to raise that. If a cloud host can't get enough cycles in 30s to respond to a check, it's probably unsuitable as an HA node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info and opinion. I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants