Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] intermittent connection between master and minion #65265

Open
3 of 9 tasks
qianguih opened this issue Sep 21, 2023 · 72 comments
Open
3 of 9 tasks

[BUG] intermittent connection between master and minion #65265

qianguih opened this issue Sep 21, 2023 · 72 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Transport

Comments

@qianguih
Copy link

Description
A clear and concise description of what the bug is.
I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn't connect to them anymore after a while. salt '*' test.ping failed with the following error message:

    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
    
    salt-run jobs.lookup_jid 20230920213139507242

here are a few observations:

  • restarting the salt-minion service helped but the same minion lost connection again after a while.
  • salt-call test.ping works fine on minion side. other commands like salt-call state.apply also works fine. this indicates minion to master communication is fine but master to minion communication is not
  • below is the error message i found from minion log. i tried to bump up the timeout param like salt '*' -t 600 test.ping . but it doesn't help
2023-09-20 13:00:08,121 [salt.minion      :2733][ERROR   ][821760] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20230920195941006337', 'fun': 'test.ping', 'fun_args': [], 'user': 'root', '_stamp': '2023-09-20T19:59:41.114944', 'nonce': '3b23a38761fc4e98a694448d36ac7f97'} request
does anyone have any idea what's wrong here and how to debug this issue?

Setup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

  • minion was installed by sudo ./bootstrap-salt.sh -A <master-ip-address> -i $(hostname) stable 3006.3. no custom config on minion
  • master runs inside a container using image saltstack/salt:3006.3 . master configs:
nodegroups:
  prod-early-adopter: L@minion-hostname-1
  prod-general-population: L@minion-hostname-2
  release: L@minion-hostname-3
  custom: L@minion-hostname-4

file_roots:
  base:
    - <path/to/custom/state/file>

state file:

pull_state_job:
  schedule.present:
    - function: state.apply
    - maxrunning: 1
    - when: 8:00pm

deploy:
  cmd.run:
    - name: '<custom-command-here>'
    - runas: ubuntu

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior
(Include debug logs if possible and relevant)

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3006.3
 
Python Version:
        Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]
 
Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: Not Installed
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: alpine 3.14.6 
        locale: utf-8
       machine: x86_64
       release: 5.11.0-1022-aws
        system: Linux
       version: Alpine Linux 3.14.6 

Additional context
Add any other context about the problem here.

@qianguih qianguih added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 21, 2023
@welcome
Copy link

welcome bot commented Sep 21, 2023

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@brettgilmer
Copy link

HI - I am seeing this same issue.
I am also seeing on the periodic pings configured by the "ping_interval" minion configuration parameter.
Running salt 3006.5 on both minon and master.

@darkpixel
Copy link
Contributor

It seems worse with 3006.5 with Linux as the master when managing Windows minions.

@brettgilmer
Copy link

This is affecting many of our endpoints. I can get them to re-establish communication by restarting the minion or the master, but they lose communication again.

@brettgilmer
Copy link

Restarting the saltmaster seems to fix the issue for all minions, for a while, but the issue will return after about 12 hours on a different seemingly random selection of minions.

@raddessi
Copy link
Contributor

I seem to have a very similar issue with 3006.x but in my case restarting the master does not have any effect and only a minion restart resolves the issue.

Another oddity is that I can see in the minion logs that the minion is still receiving commands from the master and is able to execute them just fine but the master seemingly never receives the response data. If I issue a salt '*' service.restart salt-minion from the master all of the minions receive the command and restart and pop back up just fine and then communication will work for probably another 12 hours or so.

I don't recall having this issue on 3005.x but I have not downgraded that far yet.. so far both 3006.5 and 3006.4 minions have the problem for me. I'll try to run a tcpdump if I have time

@ReubenM
Copy link

ReubenM commented Jan 29, 2024

I am encountering similar issues. Everything is 3006.5.

I've spent two day's thinking I broke something in some recent changes I made, but I've found that the minions jobs are succeeding, but they timeout trying to communicate back to the master. I'm thinking this may be related to concurrency + load. I use this for testing environment automation, and during tests I have concurrent jobs fired off by the scheduler for test data collections. And that is where the issues start to show up in the logs. When this happens, the minions seem to try to re-send the data which just compounds the problem. The logs on the master show that it is getting the messages because it is flagging duplicate messages, but something seems to be getting lost processing the return data.

The traces all look the same and seem to indicate something is getting dropped in concurrency related code:

2024-01-29 15:22:57,215 [salt.master      :1924][ERROR   ][115353] Error in function minion_pub:
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1910, in pub
    payload = channel.send(payload_kwargs, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 338, in send
    ret = yield self._uncrypted_transfer(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 309, in _uncrypted_transfer
    ret = yield self.transport.send(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 909, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 589, in send
    recv = yield future
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 387, in run_job
    pub_data = self.pub(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1913, in pub
    raise SaltReqTimeoutError(
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1918, in run_func
    ret = getattr(self, func)(load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1839, in minion_pub
    return self.masterapi.minion_pub(clear_load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/daemons/masterapi.py", line 952, in minion_pub
    ret["jid"] = self.local.cmd_async(**pub_load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 494, in cmd_async
    pub_data = self.run_job(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 409, in run_job
    raise SaltClientError(general_exception)

@darkpixel
Copy link
Contributor

I just discovered something.

At any random time I might have 25-50 minions that don't appear to respond to jobs. They may or may not respond to test.ping, but definitely won't respond to something like state.sls somestate.

...buuuut they ARE actually are listening to the master.

So my workflow is stupidly:

salt '*' minion.restart
# Wait for every minion to return that it failed to respond
salt '*' state.sls somestate

@raddessi
Copy link
Contributor

raddessi commented Feb 13, 2024

@darkpixel Yes I have found the same thing and have the same workflow. Something just gets stuck and responses get lost somewhere. They are always receiving events however as you say, in my experience

@raddessi
Copy link
Contributor

This seems to still be an issue on 3006.7 when both minion and master are the same version

@darkpixel
Copy link
Contributor

3007.0 is...worse?

Woke up to all ~600 minions in an environment being offline.

salt '*' test.ping failed for every minion.

The log showed returns from every minion, but the master spit out Minion did not return: [Not connected] for every single one.

Restarted the salt-master service, got distracted for ~15 minutes, ran another test.ping and several hundred failed to respond.

Used Cluster SSH to connect in to every machine I can reach across the internet and restarted the salt-minion service and I'm down to a mix of ~60 (Windows, Linux, and BSD) that don't respond and I can't reach. Maybe 10 of them are 3006.7.

I'd love to test/switch to a different transport like websockets that would probably be more stable, but it appears to be "all or nothing". If I switch to websockets on the master, it looks like every minion will disconnect unless I also update them to use websockets...and if I update them to use websockets and something breaks, I'm going to have to spend the next month trying to get access to hosts to fix salt-minion.

@darkpixel
Copy link
Contributor

It just happened on my master which is 3007.0...

I was going highstate on a minion that involves certificate signing and it refused to generate the certificate with no error messages in the salt master log.

I tried restarting the salt master, no dice.

About 10 minutes later I decided to restart the salt master's minion...and suddenly certificate signing worked.

The minion on the master wasn't communicating with the master...locally...on the same box...

@gregorg
Copy link

gregorg commented Mar 25, 2024

It just happened on my master which is 3007.0...

Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :

# The number of salt-master worker threads that process commands
# and replies from minions and the Salt API
# Default: 5
# Recommendation: 1 worker thread per 200 minions, max 1.5x cpu cores
# 24x1.5 = 36, should handle 7200 minions
worker_threads: 96

# The listen queue size / backlog
# Default: 1000
# Recommendation: 1000-5000
zmq_backlog: 2000

# The publisher interface ZeroMQPubServerChannel
# Default: 1000
# Recommendation: 10000-100000
pub_hwm: 50000

# Default: 100
# Recommendation: 100-500
thread_pool: 200

max_open_files: 65535
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000

@sasidharjetb
Copy link

Where do i need to add this @gregorg in salt master and how we need to add this

@gregorg
Copy link

gregorg commented Mar 28, 2024

Where do i need to add this @gregorg in salt master and how we need to add this

Add this in /etc/salt/master and restart salt-master.

@sasidharjetb
Copy link

image

@sasidharjetb
Copy link

we upgrade salt to 3006.4 on master and 20 minions out of which 10 minions are not upgraded
Will this solve my issue ?

@gregorg
Copy link

gregorg commented Mar 28, 2024

This is not a support ticket, look at salt master logs.

@darkpixel
Copy link
Contributor

I tried those settings @gregorg. It's been intermittent for the last three days....and this morning 100% of my minions are offline (even the local one on the salt master)

If I connect to a box with a minion I show the service is running. I can totally state.highstate and everything works properly.

Restarting the master brings everything online.

There's nothing that appears unusual in the master log. I can even see minions reporting their results if I do something like salt '*' test.ping, but all I get back is Minion did not return. [Not connected].

I'd love to switch to a potentially more reliable transport, but it looks like Salt can only have one transport active at a time...so if I enable something like websockets it looks like all my minions will be knocked offline until I reconfigure them.

@darkpixel
Copy link
Contributor

I just noticed an interesting log entry on the master. A bunch of my minions weren't talking again, even though the log had a ton of lines of "Got return from..."

So I restarted salt-master and noticed this in the log:

2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140336] MWorkerQueue under PID 2140336 is closing
2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140337] MWorkerQueue under PID 2140337 is closing
2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140318] MWorkerQueue under PID 2140318 is closing
2024-04-07 14:40:30,307 [salt.transport.zeromq:477 ][INFO    ][2140341] MWorkerQueue under PID 2140341 is closing
2024-04-07 14:40:30,307 [salt.transport.zeromq:477 ][INFO    ][2140339] MWorkerQueue under PID 2140339 is closing
2024-04-07 14:40:30,310 [salt.transport.zeromq:477 ][INFO    ][2140335] MWorkerQueue under PID 2140335 is closing
2024-04-07 14:40:30,312 [salt.transport.zeromq:477 ][INFO    ][2140319] MWorkerQueue under PID 2140319 is closing
2024-04-07 14:40:30,312 [salt.transport.zeromq:477 ][INFO    ][2140338] MWorkerQueue under PID 2140338 is closing
2024-04-07 14:40:30,316 [salt.transport.zeromq:477 ][INFO    ][2140320] MWorkerQueue under PID 2140320 is closing
2024-04-07 14:40:30,335 [salt.transport.zeromq:477 ][INFO    ][2140343] MWorkerQueue under PID 2140343 is closing
2024-04-07 14:40:30,360 [salt.transport.zeromq:477 ][INFO    ][2140333] MWorkerQueue under PID 2140333 is closing
2024-04-07 14:40:31,307 [salt.utils.process:745 ][INFO    ][2140315] Some processes failed to respect the KILL signal: Process: <Process name='MWorkerQueue' pid=2140316 parent=2140315 started> (Pid: 2140316)
2024-04-07 14:40:31,308 [salt.utils.process:752 ][INFO    ][2140315] kill_children retries left: 3
2024-04-07 14:40:31,334 [salt.utils.parsers:1061][WARNING ][2140139] Master received a SIGTERM. Exiting.
2024-04-07 14:40:31,334 [salt.cli.daemons :99  ][INFO    ][2140139] The Salt Master is shut down
2024-04-07 14:40:32,241 [salt.cli.daemons :83  ][INFO    ][2407186] Setting up the Salt Master

Specifically this:

2024-04-07 14:40:31,307 [salt.utils.process:745 ][INFO    ][2140315] Some processes failed to respect the KILL signal: Process: <Process name='MWorkerQueue' pid=2140316 parent=2140315 started> (Pid: 2140316)

Maybe something's hanging the MWorkerQueue?

@amalaguti
Copy link

It just happened on my master which is 3007.0...

Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :

# The number of salt-master worker threads that process commands
# and replies from minions and the Salt API
# Default: 5
# Recommendation: 1 worker thread per 200 minions, max 1.5x cpu cores
# 24x1.5 = 36, should handle 7200 minions
worker_threads: 96

# The listen queue size / backlog
# Default: 1000
# Recommendation: 1000-5000
zmq_backlog: 2000

# The publisher interface ZeroMQPubServerChannel
# Default: 1000
# Recommendation: 10000-100000
pub_hwm: 50000

# Default: 100
# Recommendation: 100-500
thread_pool: 200

max_open_files: 65535
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000

Any improvement with these settings ?

@darkpixel
Copy link
Contributor

I didn't use those exact settings because my master is smaller and has fewer minions.
8 cores * 1.5 threads per core = 12 threads = 2,400 minions (I only have ~700 on this test box).

It's no longer dropping all the minions every few hours...it's more like once or twice a week.

@darkpixel
Copy link
Contributor

Also, I'm not sure if this is related or not, but it seems to be in the same vein--communication between the minions and master is pretty unreliable.

root@salt:~# salt 'US*' state.sls win.apps
'str' object has no attribute 'pop'

root@salt:~# salt 'US*' state.sls win.apps
US-REDACTED-54:
    Data failed to compile:
----------
    The function "state.sls" is running as PID 7340 and was started at 2024, May 02 03:22:05.301817 with jid 20240502032205301817
'str' object has no attribute 'pop'

root@salt:~# service salt-master restart
root@salt:~# service salt-minion restart
# wait 30 seconds or so for minions to reconnect
root@salt:~# salt 'US*' state.sls win.apps
<snip lots of output>
-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 428
# of minions returned: 405
# of minions that did not return: 23
# of minions with errors: 6
-------------------------------------------
ERROR: Minions returned with non-zero exit code
root@salt:~# 

@raddessi
Copy link
Contributor

raddessi commented May 2, 2024

Checking back in here, I think this is actually resolved for me once I got all my minions to 3007.0. I've removed all restart cron jobs and the minions appear to have been stable for days now. Is anyone else still having issues with 3007.0 minions?

communication between the minions and master is pretty unreliable

Yeah.. this may still be an issue for me as well. I'm not sure yet. I noticed some odd things last night in testing but it could be unrelated. I definitely don't have the 'str' object has no attribute 'pop' or any other error but sometimes minions do not return in time.

@Rosamaha1
Copy link

Hi all,

I have the same problem on 3007.0.
Master to Minion connectivity always fails.
The other way is working fine..
I did some tuning of the master conf but it didn't help!

@gregorg
Copy link

gregorg commented Sep 19, 2024

I've never used this. What do you mean by disabled? Do you mean you used to be pulling it, and then you stopped? Or did you do something in configs?

Related to issue #66562
I stopped pulling it.

@dwoz
Copy link
Contributor

dwoz commented Sep 25, 2024

Everyone,

At least one user who could not upgrade to 3007.x fixed their environment with this patch:

diff --git a/salt/transport/zeromq.py b/salt/transport/zeromq.py
index 8bc4d0dff26..d06d8976d1e 100644
--- a/salt/transport/zeromq.py
+++ b/salt/transport/zeromq.py
@@ -776,7 +776,7 @@ class ZeroMQSocketMonitor:
     async def consume(self):
         while self._running.is_set():
             try:
-                if self._monitor_socket.poll():
+                if await self._monitor_socket.poll():
                     msg = await self._monitor_socket.recv_multipart()
                     self.monitor_callback(msg)
                 else:

Can anyone in this thread that is willing try out 3007.1 with this patch and report your findings. It would be a huge help.

@amalaguti
Copy link

@dwoz should it be applied on both master and minions in the environment or enough with one or the other ?

@mkhpalm
Copy link

mkhpalm commented Sep 25, 2024

@dwoz should it be applied on both master and minions in the environment or enough with one or the other ?

Just applying it to the master is what worked for us

@darkpixel
Copy link
Contributor

Can anyone in this thread that is willing try out 3007.1 with this patch and report your findings. It would be a huge help.

I just applied it to my 3007.1 master and restarted.

It definitely didn't fix minions hanging on Executing state x509.pem_managed for [/path/to/cert]...which could be a separate issue (the minion on the master is in charge of signing).

As for drops/disconnects, if everything is still connected after 24 hours that would be a pretty good sign for me that it's fixed.

@darkpixel
Copy link
Contributor

I'd say that fixes it @dwoz.
All my minions are still online.

@darkpixel
Copy link
Contributor

It's been 3 days. All my minions are connected! Halleluyah!

@johnl2323
Copy link

johnl2323 commented Oct 9, 2024

Will this be backported to the stable 3006.X release?

dwoz added a commit to dwoz/salt that referenced this issue Oct 10, 2024
@ukdaf
Copy link

ukdaf commented Oct 10, 2024

Everyone,

At least one user who could not upgrade to 3007.x fixed their environment with this patch:

diff --git a/salt/transport/zeromq.py b/salt/transport/zeromq.py
index 8bc4d0dff26..d06d8976d1e 100644
--- a/salt/transport/zeromq.py
+++ b/salt/transport/zeromq.py
@@ -776,7 +776,7 @@ class ZeroMQSocketMonitor:
     async def consume(self):
         while self._running.is_set():
             try:
-                if self._monitor_socket.poll():
+                if await self._monitor_socket.poll():
                     msg = await self._monitor_socket.recv_multipart()
                     self.monitor_callback(msg)
                 else:

Can anyone in this thread that is willing try out 3007.1 with this patch and report your findings. It would be a huge help.

@dwoz This does not seem to apply to 3006.x, but your remark that it was from someone unable to upgrade to 3007.x suggests it should be?

dwoz added a commit that referenced this issue Oct 10, 2024
@romanbakaleyko
Copy link

romanbakaleyko commented Oct 13, 2024

I have same problem 3006.9 and curious about the patch, but from what I understand patch is for 7000.x version

@DDRBoxman
Copy link

This patch worked great on a salt master on 3007.1 with a few thousand minions 👍

@Rosamaha1
Copy link

great to read the fix here!
Waiting for v3007.2 release and hoping that I can update salt-master then without any problems :)

@q1x
Copy link
Contributor

q1x commented Oct 30, 2024

Hi @dwoz any indication of a date for a fix for this?

@ukdaf
Copy link

ukdaf commented Nov 1, 2024

Hi @dwoz any indication of a date for a fix for this?

@q1x Given the current disarray of the project, I wouldn't hold out too much hope. :(

@Rosamaha1
Copy link

Hi @dwoz any indication of a date for a fix for this?

@q1x Given the current disarray of the project, I wouldn't hold out too much hope. :(

whats going on at the moment? Seems like I missed something...

@darkpixel
Copy link
Contributor

It's been going on for a while. Bought by VMWare, restructured, then bought out by Broadcom and Broadcom doesn't seem to care about Saltstack. All of this happened in the middle of a big transition in the code while trying to make stuff more "community maintainable", so stuff is in disarray and the future seems to be uncertain. Hopefully the dust will settle soon and things will improve.

@darkpixel
Copy link
Contributor

There's more to this connectivity issue than the @dwoz pointed out.

I've been playing around extensively with the Linux and BSD minions today.

If I use clusterssh to get an SSH session open to all of them and I run salt-call -l info saltutil.refresh_pillar about half of them were returning False. I closed the sessions that returned True.

I ran the command a handful of times on the ones that returned False and they continually returned False.

Then I ran service salt-minion restart waited a few seconds and re-ran salt-call -l info saltutil.refresh_pillar. Every single one of them returned True and continued returning True for an unknown amount of time until they stop responding to commands from the master and running salt-call -l info saltutil.refresh_pillar locally on the minion starts failing again.

@dwoz
Copy link
Contributor

dwoz commented Nov 18, 2024

Hi @dwoz any indication of a date for a fix for this?

@q1x Given the current disarray of the project, I wouldn't hold out too much hope. :(

whats going on at the moment? Seems like I missed something...

We will get things back on track. We were ready to do a release at the end of October but our CI/CD infrastructure went away with some service consolidation. We are currently working to get CI/CD pipelines going again and this effort is coming along. Hopefully it won't be too much longer before we can start cutting releases again. Likely few weeks more before we are there.

@dwoz
Copy link
Contributor

dwoz commented Nov 18, 2024

@darkpixel

Then I ran service salt-minion restart waited a few seconds and re-ran salt-call -l info saltutil.refresh_pillar. Every single one of them returned True and continued returning True for an unknown amount of time until they stop responding to commands from the master and running salt-call -l info saltutil.refresh_pillar locally on the minion starts failing again.

Can you please provide debug or trace log output for a failing minion?

@darkpixel
Copy link
Contributor

@dwoz

Can you please provide debug or trace log output for a failing minion?

I need to amend my previous report. Restarting the salt master/minion does NOT fix it.
Restarting the local minion does fix it.

I think this snippet from the logs is the culprit:

[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Publisher connecting to /var/run/salt/minion/minion_event_08c88717e0_pull.ipc
[DEBUG   ] Unable to connect pusher: Stream is closed
[DEBUG   ] Using selector: EpollSelector

I'm not well-versed on transports in salt, but I think the local salt-call -l info sautil.refresh_pillar command is piggybacking on the existing minion services connection to the master and that seems to be closed/disconnected and it's never trying to re-establish the connection. When a minion is in this state, no calls from the master succeed--I just get a timeout message.

The affected minions span both Linux and FreeBSD. All the Linux minions are 3007.1 with the exception of one 3006.8.
All the FreeBSD minions are 3006.8 because it's "unsupported" and the package maintainer had trouble with the 3007.x build and hasn't released a fix yet. It seems to affect about ~30 minions fairly consistently.

Full minion log here:

# salt-call -l trace saltutil.refresh_pillar
[TRACE   ] Setting up log file logging: {'log_path': '/var/log/salt/minion', 'log_level': 'warning', 'log_format': '%(asctime)s,%(msecs)03d [%(name)-17s][%(module)s][%(levelname)-8s] %(message)s', 'date_format': '%Y-%m-%d %H:%M:%S', 'max_bytes': 0, 'backup_count': 0, 'user': 'root'}
[TRACE   ] The required configuration section, 'fluent_handler', was not found the in the configuration. Not loading the fluent logging handlers module.
[TRACE   ] None of the required configuration sections, 'logstash_udp_handler' and 'logstash_zmq_handler', were found in the configuration. Not loading the Logstash logging handlers module.
[TRACE   ] Error loading log_handlers.sentry_mod: Cannot find 'raven' python library, 'sentry_handler' config is empty or not defined
[TRACE   ] Processing <bound method SaltfileMixIn.process_saltfile of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method ConfigDirMixIn.process_config_dir of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[DEBUG   ] Missing configuration file: /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/beacons.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/beacons.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/engine.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/engine.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/f_defaults.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/f_defaults.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/reactor.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/reactor.conf
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: my-minion.customer.tld
[TRACE   ] Processing <bound method OutputOptionsMixIn.process_output of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method OutputOptionsMixIn.process_output_file of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method OutputOptionsMixIn.process_state_verbose of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method NoParseMixin.process_no_parse of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469f14c2710>, <CustomOption at 0x7469ef54f6a0: --out-indent/--output-indent>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469ef58c820>, <CustomOption at 0x7469ef54f5e0: --out-file-append/--output-file-append>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469f2163d90>, <CustomOption at 0x7469ef54fbe0: --no-color/--no-colour>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469ef58c8b0>, <CustomOption at 0x7469ef54f730: --force-color/--force-colour>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469ef58c940>, <CustomOption at 0x7469ef54f310: --state-output/--state_output>)
[TRACE   ] Processing <bound method SaltCallOptionParser.process_module_dirs of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method LogLevelMixIn.process_log_level of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method LogLevelMixIn.process_log_file of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method LogLevelMixIn.process_log_level_logfile of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <function OutputOptionsMixIn._mixin_after_parsed at 0x7469efd8e7a0>
[TRACE   ] Processing <function SaltCallOptionParser._mixin_after_parsed at 0x7469efd8fbe0>
[TRACE   ] Processing <function LogLevelMixIn.__setup_logging_routines at 0x7469efd8d1b0>
[TRACE   ] Processing <function MergeConfigMixIn.__merge_config_with_cli at 0x7469efd8cc10>
[TRACE   ] Processing <function LogLevelMixIn.__setup_console_logger_config at 0x7469efd8d480>
[TRACE   ] Processing <function LogLevelMixIn.__setup_logfile_logger_config at 0x7469efd8d510>
[TRACE   ] Processing <function LogLevelMixIn.__setup_logging_config at 0x7469efd8d5a0>
[TRACE   ] Processing <function LogLevelMixIn.__verify_logging at 0x7469efd8d6c0>
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[TRACE   ] Processing <function LogLevelMixIn.__setup_logging at 0x7469efd8d630>
[TRACE   ] Setting up console logging: {'log_level': 'trace', 'log_format': '[%(levelname)-8s] %(message)s', 'date_format': '%H:%M:%S'}
[TRACE   ] Setting up log file logging: {'log_path': '/var/log/salt/minion', 'log_level': 'warning', 'log_format': '%(asctime)s,%(msecs)03d [%(name)-17s][%(module)s][%(levelname)-8s] %(message)s', 'date_format': '%Y-%m-%d %H:%M:%S', 'max_bytes': 0, 'backup_count': 0, 'user': 'root'}
[TRACE   ] The required configuration section, 'fluent_handler', was not found the in the configuration. Not loading the fluent logging handlers module.
[TRACE   ] None of the required configuration sections, 'logstash_udp_handler' and 'logstash_zmq_handler', were found in the configuration. Not loading the Logstash logging handlers module.
[TRACE   ] Error loading log_handlers.sentry_mod: Cannot find 'raven' python library, 'sentry_handler' config is empty or not defined
[DEBUG   ] Configuration file path: /etc/salt/minion
[DEBUG   ] Grains refresh requested. Refreshing grains.
[DEBUG   ] Missing configuration file: /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/beacons.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/beacons.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/engine.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/engine.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/f_defaults.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/f_defaults.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/reactor.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/reactor.conf
[TRACE   ] Error loading grains.avstatus: emsisoft grain only available on Windows
[TRACE   ] Error loading grains.bitlocker_status: bitlocker_status grain only available on Windows
[TRACE   ] Error loading grains.emsisoft: emsisoft grain only available on Windows
[DEBUG   ] The functions from module 'public_ip' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'samba_status' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'syncthing' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'syncthing_router' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.tv: tv grain only available on Windows
[DEBUG   ] The functions from module 'uptime' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.win_computerinfo: 'computerinfo' grain only available on Windows
[TRACE   ] Error loading grains.win_dc: win_dc grain only available on Windows
[TRACE   ] Error loading grains.win_domain_connection: win_domain_connection grain only available on Windows
[TRACE   ] Error loading grains.win_pkgs: win_pkgs grain only available on Windows
[TRACE   ] Error loading grains.win_proc: win_procs grain only available on Windows
[TRACE   ] Error loading grains.win_screen_resolution: win_screen_resolution grain only available on Windows
[TRACE   ] Error loading grains.win_services: win_services grain only available on Windows
[TRACE   ] Error loading grains.windows_boot_mode: 'windows_boot_mode' grain only available on Windows
[DEBUG   ] The functions from module 'windows_reboot_reason' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.windows_system_enclosure_types: 'windows_chassis_type' grain only available on Windows
[TRACE   ] Error loading grains.windows_uptime: 'windows_uptime' grain only available on Windows
[TRACE   ] Error loading grains.windows_users: 'windows_users' grain only available on Windows
[TRACE   ] Error loading grains.xdr: xdr grain only available on Windows
[TRACE   ] 'drill' could not be found in the following search path: '['/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin']'
[TRACE   ] 'drill' could not be found in the following search path: '['/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin']'
[DEBUG   ] The functions from module 'core' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'disks' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'extra' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'lvm' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'mdadm' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'minion_process' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.napalm: "napalm" (/opt/saltstack/salt/lib/python3.10/site-packages/salt/grains/napalm.py) cannot be loaded: NAPALM is not installed: ``pip install napalm``
[TRACE   ] Error loading grains.nxos: No host specified and no UDS found at /tmp/nginx_local/nginx_1_be_nxapi.sock

[DEBUG   ] The functions from module 'opts' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'package' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.pending_reboot: 'pending_reboot' grain only available on Windows
[DEBUG   ] Override  __utils__: <module 'salt.loaded.int.grains.zfs' from '/opt/saltstack/salt/lib/python3.10/site-packages/salt/grains/zfs.py'>
[DEBUG   ] The functions from module 'zfs' are being loaded by dir() on the loaded module
[TRACE   ] Loading core.append_domain grain
[TRACE   ] Loading core.cwd grain
[TRACE   ] Loading core.default_gateway grain
[TRACE   ] Loading core.dns grain
[TRACE   ] Loading core.fqdns grain
[TRACE   ] Loading core.get_machine_id grain
[TRACE   ] Loading core.get_master grain
[TRACE   ] Loading core.get_server_id grain
[TRACE   ] Loading core.hostname grain
[TRACE   ] Loading core.hwaddr_interfaces grain
[TRACE   ] Loading core.id_ grain
[TRACE   ] Loading core.ip4_interfaces grain
[TRACE   ] Loading core.ip6_interfaces grain
[TRACE   ] Loading core.ip_fqdn grain
[TRACE   ] Loading core.ip_interfaces grain
[TRACE   ] Loading core.kernelparams grain
[TRACE   ] Loading core.locale_info grain
[TRACE   ] Loading core.os_data grain
[TRACE   ] Adding systemd grains
[TRACE   ] Adding init grain
[TRACE   ] Getting lsb_release distro information
[TRACE   ] lsb_release python bindings not available
[TRACE   ] Attempting to parse /etc/lsb-release
[TRACE   ] Failed to parse /etc/lsb-release: [Errno 2] No such file or directory: '/etc/lsb-release'
[TRACE   ] Getting OS name, release, and codename from freedesktop_os_release
[TRACE   ] Failed to get lsb_distrib_id, trying to parse os-release
[TRACE   ] Getting OS name, release, and codename from distro id, version, codename
[TRACE   ] Loading core.path grain
[TRACE   ] Loading core.pythonexecutable grain
[TRACE   ] Loading core.pythonpath grain
[TRACE   ] Loading core.pythonversion grain
[TRACE   ] Loading core.saltpath grain
[TRACE   ] Loading core.saltversion grain
[TRACE   ] Loading core.saltversioninfo grain
[TRACE   ] Loading core.zmqversion grain
[TRACE   ] Loading public_ip.get_public_ip grain
[DEBUG   ] Starting new HTTPS connection (1): ip.mymaster.tld:443
[DEBUG   ] https://ip.mymaster.tld:443 "GET / HTTP/1.1" 200 179
[DEBUG   ] Starting new HTTPS connection (1): portal.mymaster.tld:443
[DEBUG   ] https://portal.mymaster.tld:443 "POST /salt/api/v1/minion/ip/ HTTP/1.1" 200 28
[TRACE   ] Loading samba_status.samba_status grain
[TRACE   ] Loading syncthing.syncthing_device_id grain
[TRACE   ] Loading syncthing_router.syncthing_device_id grain
[TRACE   ] Loading uptime.uptime grain
[TRACE   ] Loading windows_reboot_reason.windows_reboot_reason grain
[TRACE   ] Loading disks.disks grain
[TRACE   ] Device sdy reports itself as an HDD
[TRACE   ] Device sdf reports itself as an SSD
[TRACE   ] Device sdo reports itself as an HDD
[TRACE   ] Device sdw reports itself as an HDD
[TRACE   ] Device sdd reports itself as an HDD
[TRACE   ] Device sdm reports itself as an HDD
[TRACE   ] Device sdu reports itself as an SSD
[TRACE   ] Device sdb reports itself as an SSD
[TRACE   ] Device sdk reports itself as an HDD
[TRACE   ] Device sds reports itself as an HDD
[TRACE   ] Device sdi reports itself as an SSD
[TRACE   ] Device sdq reports itself as an HDD
[TRACE   ] Device sdz reports itself as an HDD
[TRACE   ] Device sdg reports itself as an HDD
[TRACE   ] Device sdx reports itself as an HDD
[TRACE   ] Device sde reports itself as an HDD
[TRACE   ] Device sdaa reports itself as an HDD
[TRACE   ] Device sdn reports itself as an HDD
[TRACE   ] Device sdv reports itself as an HDD
[TRACE   ] Device sdc reports itself as an SSD
[TRACE   ] Device sdl reports itself as an HDD
[TRACE   ] Device sdt reports itself as an HDD
[TRACE   ] Device sda reports itself as an SSD
[TRACE   ] Device sdj reports itself as an HDD
[TRACE   ] Device sdr reports itself as an SSD
[TRACE   ] Device sdh reports itself as an HDD
[TRACE   ] Device sdp reports itself as an HDD
[TRACE   ] Loading extra.config grain
[TRACE   ] Loading extra.shell grain
[TRACE   ] Loading extra.transactional grain
[TRACE   ] 'transactional-update' could not be found in the following search path: '['/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin']'
[TRACE   ] Loading extra.uefi grain
[TRACE   ] Loading lvm.lvm grain
[TRACE   ] Loading mdadm.mdadm grain
[TRACE   ] Loading minion_process.grains grain
[TRACE   ] Loading opts.opts grain
[TRACE   ] Loading package.package grain
[TRACE   ] Loading zfs.zfs grain
[DEBUG   ] The functions from module 'zfs' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded zfs.is_supported
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Connecting to master. Attempt 1 of 1
[DEBUG   ] "salt.mymaster.tld" Not an IP address? Assuming it is a hostname.
[DEBUG   ] Master URI: tcp://208.70.52.16:4506
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'my-minion.customer.tld', 'tcp://208.70.52.16:4506')
[DEBUG   ] Generated random reconnect delay between '4312ms' and '19312ms' (4312)
[DEBUG   ] Setting zmq_reconnect_ivl to '4312ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '19312ms'
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[TRACE   ] ReqChannel send clear load={'cmd': '_auth', 'id': 'my-minion.customer.tld', 'nonce': '877459b49f0e41de934588da7da54c85', 'token': b'\xb4\xf0\xea\x89\xdfNw\xa64\x15\xfeD0$)v\xd6\x86\'\x8c\xd2\xc2\xaa\x0c\x82jn\xebF\x97\xdb"?\x8c\x8e\x8b\xbb\\\xab\xeaI\xc2\x91?\xe8-F\x8c\x81\x041I\xf35\xf6\xd5\xcd\xb9\x105\xf9\x14\x9eW\xa8\xc3hU\xd5\x14\xe9\xe3\xa0-\x8d\r{D\x19\x131v\x8d\x01#\x80\xdaw\xcc\x05\x18\xf5\xae\xf3\xf3tsv\x82\x03 \\\xea\xc6\xff\xf4\x14\x9e\x140\x85`\x8c\x95M\xca\xc8\x01c\xabU\xc4\xa5\x95s\xaf\x99\xba\x82\xe6.\xc3^GHJ\x9a\xd7\xc3\xe7\x85\xbc\xeb\xfa\xce\x85\xdc\xc92\xde:\xa4\x8a5\x1d\xfa/!\xf5\xadO\xb6\x00#\xe2\xc7m\x94\x90\xc3\xd6R[\xc9\xa9\xdej\xc2\x112|\x97\x82\x98#\xb1{\xf7\xa1\xe8]}\xf6\xcf4\xc94C_\\`C\xec\x8a^\xb0E\xb4\xbeZ\xc5\xaf\xbbfl\xbe\xe6\xf1\xc6\x1c\xed\\\xbeM\x1aH\xaf\xda|\xaaD\xe3\x00\xf7\x8b\xbc5\x1c\x8c\r\xee\x7fT\xd6\xfa6\x92\xc0\xd8Jd\xd6Fa[\xf9', 'pub': '-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAjumArX5jmi0/L2slUER/\nGX7JU4XPAj0ZiVwpNBg3GrhbCSFtiCRhHVmDcxxxEhpouH9BUELx8S16QiWc3hOT\n8k4PIdBDsck8+3uaRwUx0xG90SHnq6SVS0042+kZwhA8D6cQG10PTo0Xj2kK4Kls\nzfbhw1BdVa+yuMaPvIlw/UGQlb6eiuuoQlR9L4vN/IBTEu1uiCz8CDH+dOV6aIef\nn3P81d6SZRBUVrr4HJQDJIKLvrBvYI6oyyjpe/t66RlHDnmk5ZEqaCqFM/hs6AS4\nICdM6tSxnFAW/j8m2O6/i5Cy84ZI/6h4DhyPKz/YzxZXNpz+SseRxtV2to+/xjJK\ntQIDAQAB\n-----END PUBLIC KEY-----'}
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] <salt.crypt.AsyncAuth object at 0x7469eef44e80> Got new master aes key.
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] Connecting the Minion to the Master publish port, using the URI: tcp://208.70.52.16:4505
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Publisher connecting to /var/run/salt/minion/minion_event_08c88717e0_pull.ipc
[DEBUG   ] Unable to connect pusher: Stream is closed
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Determining pillar cache
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'my-minion.customer.tld', 'tcp://208.70.52.16:4506')
[TRACE   ] ext_pillar_extra_data = {}
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.verify_signature: Loading public key
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] salt.crypt.verify_signature: Verifying signature
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] The functions from module 'jinja' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded jinja.render
[DEBUG   ] The functions from module 'yaml' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded yaml.render
[DEBUG   ] The functions from module 'saltutil' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded saltutil.refresh_pillar
[DEBUG   ] The functions from module 'direct_call' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded direct_call.execute
[DEBUG   ] The functions from module 'event' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded event.fire
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Publisher connecting to /var/run/salt/minion/minion_event_08c88717e0_pull.ipc
[DEBUG   ] Unable to connect pusher: Stream is closed
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'my-minion.customer.tld', 'tcp://208.70.52.16:4506')
[TRACE   ] ReqChannel send crypt load={'cmd': '_return', 'id': 'my-minion.customer.tld', 'jid': 'req', 'return': False, 'retcode': 0, 'fun': 'saltutil.refresh_pillar', 'fun_args': []}
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] The functions from module 'nested' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded nested.output
[TRACE   ] data = {'local': False}
local:
    False
root@my-minion:~# 

@darkpixel
Copy link
Contributor

If your logs are being flooded with authentication requests, see bug #67071

The troubleshooting docs say "seconds" and "options can be set on the master"...but those are definitely minion settings and the values appear to be in milliseconds.

@darkpixel
Copy link
Contributor

After making a bunch of architectural changes, I found one way of hosing Salt.

If you use an external pillar provider (i.e. curl some JSON) and you either set it to delay responses by ~5,000 msec or return bad data (404, garbage HTML instead of JSON, TCP connection reset, etc...) or a combination of both, you can get the master and minions into a hanged state where the only way to bring them back (even after removing the delays and bad data) is to restart them.

You'll get authentication timeouts, file client timeouts, event bus timeouts, etc... Not sure if it's ZMQ that doesn't any sort of delay or hiccup in network traffic or what...but it devastates my infrastructure if I induce it.

@darkpixel
Copy link
Contributor

When the master gets into this "weird" state, I notice most or all of the "MWorker" processes sitting at 100% CPU usage.
Authentication times out pretty much everywhere. Some fortunate minions (running salt-call) are able to authentication, but then time out with file client issues or time out returning to the master.

I'm assuming the MWorker processes are controlled by the worker_threads config knob.
I've tried doing a ton of playing with that setting. Everything from 4 threads up to 64 (the system has 64 cores and 64 GB of RAM). Nothing seems to appease the MWorker gods. After a few minutes, hours, or a day they hit 100% CPU usage and salt grinds to a halt until I restart the master. And sometimes the minions need a restart too because the get locked into a zombie state.

@darkpixel
Copy link
Contributor

Another interesting note...when things are adjusted semi-properly and things are generally working well, the master receives floods of traffic in bursts. The master will receive about 1 gigabit of traffic for ten seconds, then it peters off to nothing for 10 seconds, then another huge burst. It does this even when simple jobs (like state.sls salt.minion or test.ping) when targeting the ~500 minions. I can understand salt.minion pulling a few files from the master to generate the minion configs, but test.ping shouldn't really be pulling anything.

Maybe the pillar data...but after testing the external pillar provider, the combination of grains uploaded and pillar data downloaded averages 0.7 MB per minion...and multiply that by ~500 minions and you're talking about 350 MB of data if grains were uploaded and pillar was downloaded.

The returns from state.sls salt.minion and test.ping shouldn't be very big. Not 1 gigabit for 10 seconds then almost nothing for 10 seconds repeated for 10 minutes or however long it takes all the jobs to complete.

Screenshot from 2024-12-11 09-39-49

When I get it the most stable I possibly can and kick off a job, the Minion running on the Master throws a ton of these messages into the log while the job is running...even though the master's minion isn't the target of the job:

2024-12-11 10:21:13,223 [salt.payload     ][payload][CRITICAL] Could not deserialize msgpack message. This often happens when trying to read a file not in binary mode. To see message payload, enable debug logging and retry. Exception: unpack(b) received extra data.
2024-12-11 10:21:13,223 [salt.transport.zeromq][zeromq][ERROR   ] Exception while running callback
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/payload.py", line 92, in loads
    ret = salt.utils.msgpack.unpackb(msg, **loads_kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/msgpack.py", line 137, in unpackb
    return msgpack.unpackb(packed, **_sanitize_msgpack_unpack_kwargs(kwargs))
  File "msgpack/_unpacker.pyx", line 201, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 394, in consume
    await callback(msg)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 481, in wrap_callback
    payload = self.transport._decode_messages(messages)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 323, in _decode_messages
    payload = salt.payload.loads(messages)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/payload.py", line 109, in loads
    raise SaltDeserializationError(exc_msg) from exc
salt.exceptions.SaltDeserializationError: Could not deserialize msgpack message. See log for more info.
2024-12-11 10:21:13,224 [salt.crypt       ][crypt][CRITICAL] A message with an invalid serial was received.
this serial: 75
last serial: 75
The minion will not honor this request.

@ldav1s
Copy link
Contributor

ldav1s commented Jan 24, 2025

I've been experiencing this issue as well, at least with what is in the initial report. The salt master is running 3006.9 (onedir, Ubuntu 20.04, installed through apt, with additions for gitfs) and manages minions also on 3006.9 (onedir, Ubuntu 20.04, installed through apt). The minions themselves are a mix of on-prem hosts (2 datacenters; salt connects through a public interface) the and VMs (same 2 datacenters as on-prem; salt connects through a private interface).

My observations on my setup:

  • The command salt '*' test.ping will fail fairly reliably with the Minion did not return on a majority of the hosts only in the on-prem hosts in one datacenter. All the other minions will return fairly reliably.
  • Restarting the salt minion on the non-responding minions will make the non-responding minions return reliably for a while from salt '*' test.ping.
  • Didn't attempt salt-call test.ping from the non-responding minion.
  • Observed the timeout error in the minion logs on the non-responding minions

After reading the original bug (and restarting minions on the affected on-prem hosts), and observing that it would work for a while, I formulated the following idea. Have a job on the the salt master invoke salt '<on-prem-hosts-that-stop-responding>' test.ping every minute. I then proceeded to restart the salt minions on the affected on-prem hosts. I'm now at 20 hours after the restart. All the previously non-responding minions still respond. In other words, salt '*' test.ping completes without Minion did not return errors.

What this seems to suggest is that there could be an issue with the underlying dynamic IP routing setup and not anything to do with salt. It's possible that an intermediate router is throwing away the route from the minion to the master on a timeout, and not restoring it until the salt minion gets restarted. The route from the master to the minion seems unaffected. Lots of comments like "they are listening to the master", etc. on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Transport
Projects
None yet
Development

No branches or pull requests