Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
libmunge: Fix connect failure retry for full socket queue
libmunge retries transient errors when connecting to munged. This should handle errors arising from the listening socket's queue being full. However, PR #139 uncovered a bug that did not handle EAGAIN which is returned on Linux when a nonblocking UNIX domain socket connection cannot be completed immediately. This commit fixes the while-loop in _m_msg_client_connect() that retries connect() so both EAGAIN (Linux) and ECONNREFUSED (BSD) are handled as transient errors that should be retried. This was tested by setting "--listen-backlog=1", running munged with the default 2 work threads, and running remunge with 64 threads. First, the while-loop was altered so connect() errors would not be retried. remunge could reproduce EAGAIN on Linux, and this behavior was dramatically more reproducible if vcpu > 1. remunge could reproduce ECONNREFUSED on NetBSD 9.3 with vcpu=1, and on FreeBSD 14.0 with vcpu=2. Adding back the retry logic for EAGAIN and ECONNREFUSED made this connect() failure difficult to reproduce even with "--listen-backlog=1". Tested: - AlmaLinux 9.3, 8.9 - Arch Linux - CentOS Linux Stream 9, Stream 8, 7.9.2009, 6.10 - Debian sid, 12.5, 11.9, 10.13, 9.13, 8.11, 7.11, 6.0.10, 5.0.10, 4.0 - Fedora 39, 38, 37 - FreeBSD 14.0, 13.3, 13.2 - NetBSD 9.3 - OpenBSD 7.4, 7.3 - openSUSE 15.5, 15.4 - Ubuntu 23.10, 22.04.4, 20.04.6, 18.04.6, 16.04.7, 14.04.6
- Loading branch information