You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
Executions published from the master in batch mode hang indefinitely if:
minions are targeted by a list and any minions targeted are offline or do not respond during the initial gather_minions stage
minions fail to respond/become offline after the initial gather_minions stage and a job is published to the offline minion if other targeting methods are used (i.e. glob targeting)
Setup
Default salt-minion/salt-master installs
Please be as specific as possible and give set-up details.
on-prem machine
VM (Virtualbox, KVM, etc. please specify)
VM running on a cloud service, please be explicit and add details
container (Kubernetes, Docker, containerd, etc. please specify)
or a combination, please be explicit (masters/minions on both VMware and AWS)
jails if it is FreeBSD
classic packaging
onedir packaging
used bootstrap to install
Steps to Reproduce the behavior
(Include debug logs if possible and relevant)
salt master plus some number of minions, default installs for both
scenario 1, batch mode using a list of 2 minions, which are both offline:
#salt -b 1 -L "minion1,minion2" test.ping
Executing run on ['minion1']
Minion '%s' failed to respond to job sent
Executing run on ['minion2']
Minion '%s' failed to respond to job sent
<terminal never returns>
scenario 2, glob targeting and minion2 goes offline after the gather_minions stage or does not return for some reason, minion1 is offline the entire time
#salt -b 1 "minion*" test.ping
Minion minion1 did not respond. No job will be sent.
Executing run on ['minion2']
Minion '%s' failed to respond to job sent
<terminal never returns>
Orchestrations that utilize batch mode in these scenarios will also hang and never move to the next stage of the orchestration.
Expected behavior
no job is sent to minions detected to be down during initial stage of batch mode when a list is used
salt returns after all minions either respond or the gather_job_timeout has been reached
Expected behavior in scenario 1 above (minion1 and minion2 are offline, list targeting used):
#salt -b 1 -L "minion1,minion2" test.ping
Minion minion1 did not respond. No job will be sent.
Minion minion2 did not respond. No job will be sent.
#
Expected behavior in scenario 2 above (glob targeting and minion2 goes offline after the gather_minions stage or does not return for some reason, minion1 is offline the entire time)
#salt -b 1 "minion*" test.ping
Minion minion1 did not respond. No job will be sent.
Executing run on ['minion2']
Minion 'minion2' failed to respond to job sent
# <back at terminal prompt>
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
Salt: 3006.7Python Version:
Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]Dependency Versions:
cffi: 1.14.6cherrypy: unknowndateutil: 2.8.1docker-py: Not Installedgitdb: Not Installedgitpython: Not InstalledJinja2: 3.1.3libgit2: Not Installedlooseversion: 1.0.2M2Crypto: Not InstalledMako: Not Installedmsgpack: 1.0.2msgpack-pure: Not Installedmysql-python: Not Installedpackaging: 22.0pycparser: 2.21pycrypto: Not Installedpycryptodome: 3.19.1pygit2: Not Installedpython-gnupg: 2.3.1PyYAML: 6.0.1PyZMQ: 23.2.0relenv: 0.15.1smmap: Not Installedtimelib: 0.2.4Tornado: 4.5.3ZMQ: 4.3.4System Versions:
dist: ubuntu 20.04.6 focallocale: utf-8machine: x86_64release: 5.4.0-171-genericsystem: Linuxversion: Ubuntu 20.04.6 focal
Additional context
Seems to have started in at least 3006.5, did not see this in 3003 (we skipped from 3003 to 3006.5, so unsure if 3004/3005 behave this way)
The following patch resolves in local testing:
--- /opt/saltstack/salt/lib/python3.10/site-packages/salt/cli/batch.py 2024-03-20 09:28:51.901051084 -0500
+++ /tmp/batch.py.new 2024-03-20 10:50:59.370974275 -0500
@@ -83,7 +83,10 @@
)
break
if m is not None:
- fret.add(m)
+ if "failed" in ret[m] and ret[m]["failed"] is True:
+ log.debug("minion {} failed test.ping - will be returned as a down minion".format(m))
+ else:
+ fret.add(m)
return (list(fret), ping_gen, nret.difference(fret))
def get_bnum(self):
@@ -289,11 +292,12 @@
# We already know some minions didn't respond to the ping, so inform
# inform user attempt to run a job failed
salt.utils.stringutils.print_cli(
- "Minion '%s' failed to respond to job sent", minion
+ "Minion '{}' failed to respond to job sent".format(minion)
)
if self.opts.get("failhard"):
failhard = True
+ ret[minion] = data
else:
# If we are executing multiple modules with the same cmd,
# We use the highest retcode.
Changes above in gather_minions function keep jobs being sent to minions that do not respond to the initial test.ping when a list is used for targeting
Changes in the run function allows the cli to return after a minion has timed out and fixes the console print statement to properly show the minion name
Documenting this possible fix here until I have time for a proper PR or someone else beats me to it...
The text was updated successfully, but these errors were encountered:
Thank you @lomeroe, +1 for this patch. It resolved both stagnant salt calls and the memory leak in EventPublisher process.
System details:
salt-master 3006.8 on Ubuntu 22.04, installed via onedir
2.5k salt-minions a mix of 3005.1/Ubuntu 18.04, and 3006.8/Ubuntu 22.04, installed via onedir
all are running on a variety of t3 instance types at AWS, across regions
Prior to the patch, our salt-master was unstable and required restarts to resolve the EventPublisher memory leak. We'd see memory usage start to increase shortly after the restart. With the patch, it's been stable for almost 24 hours now.
We don't use orchestration but we do use batches in a number of tasks. Worth noting that one of the stagnant processes didn't use batch mode but uses nodegroups calling to a custom module:
$ salt --timeout=10 --out=json --static --out-file=out.json -N minion_group custommodule.customaction
The chart below shows EventPublisher's memory usage over a period of days. We applied the patch twice (to confirm that when reverted, the previous memory growth behaviour returned):
salt-master 3006.8 on ubuntu 20.04 onedir intallation
issue started happening after upgrade from 3004.2
~1700 minions all on 3006.8
have a runner that executes states using batches, noticed that sometimes batch executions hangs, when that happens memory leak occurs, with patch runner returns results and no leak appears
Description
Executions published from the master in batch mode hang indefinitely if:
minions are targeted by a list and any minions targeted are offline or do not respond during the initial
gather_minions
stageminions fail to respond/become offline after the initial
gather_minions
stage and a job is published to the offline minion if other targeting methods are used (i.e. glob targeting)Setup
Default salt-minion/salt-master installs
Please be as specific as possible and give set-up details.
Steps to Reproduce the behavior
(Include debug logs if possible and relevant)
salt master plus some number of minions, default installs for both
scenario 1, batch mode using a list of 2 minions, which are both offline:
scenario 2, glob targeting and minion2 goes offline after the
gather_minions
stage or does not return for some reason, minion1 is offline the entire timeOrchestrations that utilize batch mode in these scenarios will also hang and never move to the next stage of the orchestration.
Expected behavior
no job is sent to minions detected to be down during initial stage of batch mode when a list is used
salt returns after all minions either respond or the gather_job_timeout has been reached
Expected behavior in scenario 1 above (minion1 and minion2 are offline, list targeting used):
Expected behavior in scenario 2 above (glob targeting and minion2 goes offline after the
gather_minions
stage or does not return for some reason, minion1 is offline the entire time)Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Additional context
Seems to have started in at least 3006.5, did not see this in 3003 (we skipped from 3003 to 3006.5, so unsure if 3004/3005 behave this way)
The following patch resolves in local testing:
Changes above in
gather_minions
function keep jobs being sent to minions that do not respond to the initial test.ping when a list is used for targetingChanges in the
run
function allows the cli to return after a minion has timed out and fixes the console print statement to properly show the minion nameDocumenting this possible fix here until I have time for a proper PR or someone else beats me to it...
The text was updated successfully, but these errors were encountered: