[BUG] 3006 executions in batch mode hang indefinitely in some scenarios #66249

lomeroe · 2024-03-20T19:50:16Z

Description
Executions published from the master in batch mode hang indefinitely if:

minions are targeted by a list and any minions targeted are offline or do not respond during the initial gather_minions stage
minions fail to respond/become offline after the initial gather_minions stage and a job is published to the offline minion if other targeting methods are used (i.e. glob targeting)

Setup
Default salt-minion/salt-master installs

Please be as specific as possible and give set-up details.

on-prem machine
VM (Virtualbox, KVM, etc. please specify)
VM running on a cloud service, please be explicit and add details
container (Kubernetes, Docker, containerd, etc. please specify)
or a combination, please be explicit (masters/minions on both VMware and AWS)
jails if it is FreeBSD
classic packaging
onedir packaging
used bootstrap to install

Steps to Reproduce the behavior
(Include debug logs if possible and relevant)
salt master plus some number of minions, default installs for both

scenario 1, batch mode using a list of 2 minions, which are both offline:

#salt -b 1 -L "minion1,minion2"  test.ping
Executing run on ['minion1']

Minion '%s' failed to respond to job sent

Executing run on ['minion2']

Minion '%s' failed to respond to job sent
<terminal never returns>

scenario 2, glob targeting and minion2 goes offline after the gather_minions stage or does not return for some reason, minion1 is offline the entire time

#salt -b 1 "minion*" test.ping
Minion minion1 did not respond.  No job will be sent.

Executing run on ['minion2']

Minion '%s' failed to respond to job sent
<terminal never returns>

Orchestrations that utilize batch mode in these scenarios will also hang and never move to the next stage of the orchestration.

Expected behavior
no job is sent to minions detected to be down during initial stage of batch mode when a list is used
salt returns after all minions either respond or the gather_job_timeout has been reached

Expected behavior in scenario 1 above (minion1 and minion2 are offline, list targeting used):

#salt -b 1 -L "minion1,minion2"  test.ping
Minion minion1 did not respond.  No job will be sent.
Minion minion2 did not respond.  No job will be sent.
#

Expected behavior in scenario 2 above (glob targeting and minion2 goes offline after the gather_minions stage or does not return for some reason, minion1 is offline the entire time)

#salt -b 1 "minion*" test.ping
Minion minion1 did not respond.  No job will be sent.

Executing run on ['minion2']

Minion 'minion2' failed to respond to job sent
# <back at terminal prompt>

Versions Report

salt --versions-report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Salt Version:
          Salt: 3006.7

Python Version:
        Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 2.3.1
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.15.1
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: ubuntu 20.04.6 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-171-generic
        system: Linux
       version: Ubuntu 20.04.6 focal

Additional context
Seems to have started in at least 3006.5, did not see this in 3003 (we skipped from 3003 to 3006.5, so unsure if 3004/3005 behave this way)

The following patch resolves in local testing:

--- /opt/saltstack/salt/lib/python3.10/site-packages/salt/cli/batch.py  2024-03-20 09:28:51.901051084 -0500
+++ /tmp/batch.py.new   2024-03-20 10:50:59.370974275 -0500
@@ -83,7 +83,10 @@
                         )
                     break
                 if m is not None:
-                    fret.add(m)
+                    if "failed" in ret[m] and ret[m]["failed"] is True:
+                        log.debug("minion {} failed test.ping - will be returned as a down minion".format(m))
+                    else:
+                        fret.add(m)
         return (list(fret), ping_gen, nret.difference(fret))

     def get_bnum(self):
@@ -289,11 +292,12 @@
                         # We already know some minions didn't respond to the ping, so inform
                         # inform user attempt to run a job failed
                         salt.utils.stringutils.print_cli(
-                            "Minion '%s' failed to respond to job sent", minion
+                            "Minion '{}' failed to respond to job sent".format(minion)
                         )

                     if self.opts.get("failhard"):
                         failhard = True
+                    ret[minion] = data
                 else:
                     # If we are executing multiple modules with the same cmd,
                     # We use the highest retcode.

Changes above in gather_minions function keep jobs being sent to minions that do not respond to the initial test.ping when a list is used for targeting

Changes in the run function allows the cli to return after a minion has timed out and fixes the console print statement to properly show the minion name

Documenting this possible fix here until I have time for a proper PR or someone else beats me to it...

The text was updated successfully, but these errors were encountered:

thinkst-marco · 2024-06-13T07:40:57Z

Thank you @lomeroe, +1 for this patch. It resolved both stagnant salt calls and the memory leak in EventPublisher process.

System details:

salt-master 3006.8 on Ubuntu 22.04, installed via onedir
2.5k salt-minions a mix of 3005.1/Ubuntu 18.04, and 3006.8/Ubuntu 22.04, installed via onedir
all are running on a variety of t3 instance types at AWS, across regions

Prior to the patch, our salt-master was unstable and required restarts to resolve the EventPublisher memory leak. We'd see memory usage start to increase shortly after the restart. With the patch, it's been stable for almost 24 hours now.

We don't use orchestration but we do use batches in a number of tasks. Worth noting that one of the stagnant processes didn't use batch mode but uses nodegroups calling to a custom module:

$ salt --timeout=10 --out=json --static --out-file=out.json -N minion_group custommodule.customaction

The chart below shows EventPublisher's memory usage over a period of days. We applied the patch twice (to confirm that when reverted, the previous memory growth behaviour returned):

romanbakaleyko · 2024-07-27T13:28:05Z

salt-master 3006.8 on ubuntu 20.04 onedir intallation
issue started happening after upgrade from 3004.2
~1700 minions all on 3006.8

have a runner that executes states using batches, noticed that sometimes batch executions hangs, when that happens memory leak occurs, with patch runner returns results and no leak appears

dmurphy18 · 2024-07-29T22:17:34Z

@lomeroe Thanks for that patch, trying the changes and tests to come

dmurphy18 · 2024-08-15T16:45:00Z

@lomeroe Closing since associated PR #66758 merged into 3006.x branch and will be merged forward into 3007.x and master branches

lomeroe added Bug broken, incorrect, or confusing behavior needs-triage labels Mar 20, 2024

lomeroe mentioned this issue Mar 22, 2024

[BUG] Memory Leak in EventPublisher process #61565

Open

dmurphy18 self-assigned this Jul 29, 2024

dmurphy18 removed the needs-triage label Jul 29, 2024

dmurphy18 added this to the Sulfur v3006.10 milestone Jul 29, 2024

dmurphy18 mentioned this issue Jul 29, 2024

[3006.x] Fix batch mode hang indefinitely in some scenarios #66758

Merged

3 tasks

dmurphy18 closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 3006 executions in batch mode hang indefinitely in some scenarios #66249

[BUG] 3006 executions in batch mode hang indefinitely in some scenarios #66249

lomeroe commented Mar 20, 2024

thinkst-marco commented Jun 13, 2024

romanbakaleyko commented Jul 27, 2024 •

edited

Loading

dmurphy18 commented Jul 29, 2024

dmurphy18 commented Aug 15, 2024

[BUG] 3006 executions in batch mode hang indefinitely in some scenarios #66249

[BUG] 3006 executions in batch mode hang indefinitely in some scenarios #66249

Comments

lomeroe commented Mar 20, 2024

thinkst-marco commented Jun 13, 2024

romanbakaleyko commented Jul 27, 2024 • edited Loading

dmurphy18 commented Jul 29, 2024

dmurphy18 commented Aug 15, 2024

romanbakaleyko commented Jul 27, 2024 •

edited

Loading