Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 3006 executions in batch mode hang indefinitely in some scenarios #66249

Closed
5 of 9 tasks
lomeroe opened this issue Mar 20, 2024 · 4 comments
Closed
5 of 9 tasks
Assignees
Labels
Bug broken, incorrect, or confusing behavior

Comments

@lomeroe
Copy link
Contributor

lomeroe commented Mar 20, 2024

Description
Executions published from the master in batch mode hang indefinitely if:

minions are targeted by a list and any minions targeted are offline or do not respond during the initial gather_minions stage
minions fail to respond/become offline after the initial gather_minions stage and a job is published to the offline minion if other targeting methods are used (i.e. glob targeting)

Setup
Default salt-minion/salt-master installs

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit (masters/minions on both VMware and AWS)
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior
(Include debug logs if possible and relevant)
salt master plus some number of minions, default installs for both

scenario 1, batch mode using a list of 2 minions, which are both offline:

#salt -b 1 -L "minion1,minion2"  test.ping
Executing run on ['minion1']

Minion '%s' failed to respond to job sent

Executing run on ['minion2']

Minion '%s' failed to respond to job sent
<terminal never returns>

scenario 2, glob targeting and minion2 goes offline after the gather_minions stage or does not return for some reason, minion1 is offline the entire time

#salt -b 1 "minion*" test.ping
Minion minion1 did not respond.  No job will be sent.

Executing run on ['minion2']

Minion '%s' failed to respond to job sent
<terminal never returns>

Orchestrations that utilize batch mode in these scenarios will also hang and never move to the next stage of the orchestration.

Expected behavior
no job is sent to minions detected to be down during initial stage of batch mode when a list is used
salt returns after all minions either respond or the gather_job_timeout has been reached

Expected behavior in scenario 1 above (minion1 and minion2 are offline, list targeting used):

#salt -b 1 -L "minion1,minion2"  test.ping
Minion minion1 did not respond.  No job will be sent.
Minion minion2 did not respond.  No job will be sent.
#

Expected behavior in scenario 2 above (glob targeting and minion2 goes offline after the gather_minions stage or does not return for some reason, minion1 is offline the entire time)

#salt -b 1 "minion*" test.ping
Minion minion1 did not respond.  No job will be sent.

Executing run on ['minion2']

Minion 'minion2' failed to respond to job sent
# <back at terminal prompt>

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3006.7

Python Version:
        Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 2.3.1
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.15.1
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: ubuntu 20.04.6 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-171-generic
        system: Linux
       version: Ubuntu 20.04.6 focal

Additional context
Seems to have started in at least 3006.5, did not see this in 3003 (we skipped from 3003 to 3006.5, so unsure if 3004/3005 behave this way)

The following patch resolves in local testing:

--- /opt/saltstack/salt/lib/python3.10/site-packages/salt/cli/batch.py  2024-03-20 09:28:51.901051084 -0500
+++ /tmp/batch.py.new   2024-03-20 10:50:59.370974275 -0500
@@ -83,7 +83,10 @@
                         )
                     break
                 if m is not None:
-                    fret.add(m)
+                    if "failed" in ret[m] and ret[m]["failed"] is True:
+                        log.debug("minion {} failed test.ping - will be returned as a down minion".format(m))
+                    else:
+                        fret.add(m)
         return (list(fret), ping_gen, nret.difference(fret))

     def get_bnum(self):
@@ -289,11 +292,12 @@
                         # We already know some minions didn't respond to the ping, so inform
                         # inform user attempt to run a job failed
                         salt.utils.stringutils.print_cli(
-                            "Minion '%s' failed to respond to job sent", minion
+                            "Minion '{}' failed to respond to job sent".format(minion)
                         )

                     if self.opts.get("failhard"):
                         failhard = True
+                    ret[minion] = data
                 else:
                     # If we are executing multiple modules with the same cmd,
                     # We use the highest retcode.

Changes above in gather_minions function keep jobs being sent to minions that do not respond to the initial test.ping when a list is used for targeting

Changes in the run function allows the cli to return after a minion has timed out and fixes the console print statement to properly show the minion name

Documenting this possible fix here until I have time for a proper PR or someone else beats me to it...

@lomeroe lomeroe added Bug broken, incorrect, or confusing behavior needs-triage labels Mar 20, 2024
@thinkst-marco
Copy link

Thank you @lomeroe, +1 for this patch. It resolved both stagnant salt calls and the memory leak in EventPublisher process.

System details:

  • salt-master 3006.8 on Ubuntu 22.04, installed via onedir
  • 2.5k salt-minions a mix of 3005.1/Ubuntu 18.04, and 3006.8/Ubuntu 22.04, installed via onedir
  • all are running on a variety of t3 instance types at AWS, across regions

Prior to the patch, our salt-master was unstable and required restarts to resolve the EventPublisher memory leak. We'd see memory usage start to increase shortly after the restart. With the patch, it's been stable for almost 24 hours now.

We don't use orchestration but we do use batches in a number of tasks. Worth noting that one of the stagnant processes didn't use batch mode but uses nodegroups calling to a custom module:

$ salt --timeout=10 --out=json --static --out-file=out.json -N minion_group custommodule.customaction

The chart below shows EventPublisher's memory usage over a period of days. We applied the patch twice (to confirm that when reverted, the previous memory growth behaviour returned):
image

@romanbakaleyko
Copy link

romanbakaleyko commented Jul 27, 2024

salt-master 3006.8 on ubuntu 20.04 onedir intallation
issue started happening after upgrade from 3004.2
~1700 minions all on 3006.8

have a runner that executes states using batches, noticed that sometimes batch executions hangs, when that happens memory leak occurs, with patch runner returns results and no leak appears

@dmurphy18 dmurphy18 self-assigned this Jul 29, 2024
@dmurphy18 dmurphy18 added this to the Sulfur v3006.10 milestone Jul 29, 2024
@dmurphy18
Copy link
Contributor

@lomeroe Thanks for that patch, trying the changes and tests to come

@dmurphy18
Copy link
Contributor

@lomeroe Closing since associated PR #66758 merged into 3006.x branch and will be merged forward into 3007.x and master branches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior
Projects
None yet
Development

No branches or pull requests

4 participants