[BUG] intermittent connection between master and minion #65265

qianguih · 2023-09-21T22:13:39Z

Description
A clear and concise description of what the bug is.
I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn't connect to them anymore after a while. salt '*' test.ping failed with the following error message:

    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
    
    salt-run jobs.lookup_jid 20230920213139507242

here are a few observations:

restarting the salt-minion service helped but the same minion lost connection again after a while.
salt-call test.ping works fine on minion side. other commands like salt-call state.apply also works fine. this indicates minion to master communication is fine but master to minion communication is not
below is the error message i found from minion log. i tried to bump up the timeout param like salt '*' -t 600 test.ping . but it doesn't help

2023-09-20 13:00:08,121 [salt.minion      :2733][ERROR   ][821760] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20230920195941006337', 'fun': 'test.ping', 'fun_args': [], 'user': 'root', '_stamp': '2023-09-20T19:59:41.114944', 'nonce': '3b23a38761fc4e98a694448d36ac7f97'} request
does anyone have any idea what's wrong here and how to debug this issue?

Setup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

minion was installed by sudo ./bootstrap-salt.sh -A <master-ip-address> -i $(hostname) stable 3006.3. no custom config on minion
master runs inside a container using image saltstack/salt:3006.3 . master configs:

nodegroups:
  prod-early-adopter: L@minion-hostname-1
  prod-general-population: L@minion-hostname-2
  release: L@minion-hostname-3
  custom: L@minion-hostname-4

file_roots:
  base:
    - <path/to/custom/state/file>

state file:

pull_state_job:
  schedule.present:
    - function: state.apply
    - maxrunning: 1
    - when: 8:00pm

deploy:
  cmd.run:
    - name: '<custom-command-here>'
    - runas: ubuntu

Please be as specific as possible and give set-up details.

on-prem machine
VM (Virtualbox, KVM, etc. please specify)
VM running on a cloud service, please be explicit and add details
container (Kubernetes, Docker, containerd, etc. please specify)
or a combination, please be explicit
jails if it is FreeBSD
classic packaging
onedir packaging
used bootstrap to install

Steps to Reproduce the behavior
(Include debug logs if possible and relevant)

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Salt Version:
          Salt: 3006.3
 
Python Version:
        Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]
 
Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: Not Installed
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: alpine 3.14.6 
        locale: utf-8
       machine: x86_64
       release: 5.11.0-1022-aws
        system: Linux
       version: Alpine Linux 3.14.6

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

welcome · 2023-09-21T22:13:41Z

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!

brettgilmer · 2023-12-21T16:53:43Z

HI - I am seeing this same issue.
I am also seeing on the periodic pings configured by the "ping_interval" minion configuration parameter.
Running salt 3006.5 on both minon and master.

darkpixel · 2023-12-22T01:16:32Z

It seems worse with 3006.5 with Linux as the master when managing Windows minions.

brettgilmer · 2023-12-26T15:34:57Z

This is affecting many of our endpoints. I can get them to re-establish communication by restarting the minion or the master, but they lose communication again.

brettgilmer · 2023-12-28T19:51:27Z

Restarting the saltmaster seems to fix the issue for all minions, for a while, but the issue will return after about 12 hours on a different seemingly random selection of minions.

raddessi · 2024-01-25T17:35:10Z

I seem to have a very similar issue with 3006.x but in my case restarting the master does not have any effect and only a minion restart resolves the issue.

Another oddity is that I can see in the minion logs that the minion is still receiving commands from the master and is able to execute them just fine but the master seemingly never receives the response data. If I issue a salt '*' service.restart salt-minion from the master all of the minions receive the command and restart and pop back up just fine and then communication will work for probably another 12 hours or so.

I don't recall having this issue on 3005.x but I have not downgraded that far yet.. so far both 3006.5 and 3006.4 minions have the problem for me. I'll try to run a tcpdump if I have time

ReubenM · 2024-01-29T22:49:59Z

I am encountering similar issues. Everything is 3006.5.

I've spent two day's thinking I broke something in some recent changes I made, but I've found that the minions jobs are succeeding, but they timeout trying to communicate back to the master. I'm thinking this may be related to concurrency + load. I use this for testing environment automation, and during tests I have concurrent jobs fired off by the scheduler for test data collections. And that is where the issues start to show up in the logs. When this happens, the minions seem to try to re-send the data which just compounds the problem. The logs on the master show that it is getting the messages because it is flagging duplicate messages, but something seems to be getting lost processing the return data.

The traces all look the same and seem to indicate something is getting dropped in concurrency related code:

2024-01-29 15:22:57,215 [salt.master      :1924][ERROR   ][115353] Error in function minion_pub:
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1910, in pub
    payload = channel.send(payload_kwargs, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 338, in send
    ret = yield self._uncrypted_transfer(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 309, in _uncrypted_transfer
    ret = yield self.transport.send(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 909, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 589, in send
    recv = yield future
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 387, in run_job
    pub_data = self.pub(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1913, in pub
    raise SaltReqTimeoutError(
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1918, in run_func
    ret = getattr(self, func)(load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1839, in minion_pub
    return self.masterapi.minion_pub(clear_load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/daemons/masterapi.py", line 952, in minion_pub
    ret["jid"] = self.local.cmd_async(**pub_load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 494, in cmd_async
    pub_data = self.run_job(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 409, in run_job
    raise SaltClientError(general_exception)

darkpixel · 2024-02-13T15:11:54Z

I just discovered something.

At any random time I might have 25-50 minions that don't appear to respond to jobs. They may or may not respond to test.ping, but definitely won't respond to something like state.sls somestate.

...buuuut they ARE actually are listening to the master.

So my workflow is stupidly:

salt '*' minion.restart
# Wait for every minion to return that it failed to respond
salt '*' state.sls somestate

raddessi · 2024-02-13T23:08:58Z

@darkpixel Yes I have found the same thing and have the same workflow. Something just gets stuck and responses get lost somewhere. They are always receiving events however as you say, in my experience

raddessi · 2024-02-23T18:22:55Z

This seems to still be an issue on 3006.7 when both minion and master are the same version

darkpixel · 2024-03-11T14:32:21Z

3007.0 is...worse?

Woke up to all ~600 minions in an environment being offline.

salt '*' test.ping failed for every minion.

The log showed returns from every minion, but the master spit out Minion did not return: [Not connected] for every single one.

Restarted the salt-master service, got distracted for ~15 minutes, ran another test.ping and several hundred failed to respond.

Used Cluster SSH to connect in to every machine I can reach across the internet and restarted the salt-minion service and I'm down to a mix of ~60 (Windows, Linux, and BSD) that don't respond and I can't reach. Maybe 10 of them are 3006.7.

I'd love to test/switch to a different transport like websockets that would probably be more stable, but it appears to be "all or nothing". If I switch to websockets on the master, it looks like every minion will disconnect unless I also update them to use websockets...and if I update them to use websockets and something breaks, I'm going to have to spend the next month trying to get access to hosts to fix salt-minion.

darkpixel · 2024-03-25T14:36:05Z

It just happened on my master which is 3007.0...

I was going highstate on a minion that involves certificate signing and it refused to generate the certificate with no error messages in the salt master log.

I tried restarting the salt master, no dice.

About 10 minutes later I decided to restart the salt master's minion...and suddenly certificate signing worked.

The minion on the master wasn't communicating with the master...locally...on the same box...

gregorg · 2024-03-25T15:19:48Z

It just happened on my master which is 3007.0...

Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :

# The number of salt-master worker threads that process commands
# and replies from minions and the Salt API
# Default: 5
# Recommendation: 1 worker thread per 200 minions, max 1.5x cpu cores
# 24x1.5 = 36, should handle 7200 minions
worker_threads: 96

# The listen queue size / backlog
# Default: 1000
# Recommendation: 1000-5000
zmq_backlog: 2000

# The publisher interface ZeroMQPubServerChannel
# Default: 1000
# Recommendation: 10000-100000
pub_hwm: 50000

# Default: 100
# Recommendation: 100-500
thread_pool: 200

max_open_files: 65535
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000

sasidharjetb · 2024-03-28T11:54:17Z

Where do i need to add this @gregorg in salt master and how we need to add this

gregorg · 2024-03-28T12:16:26Z

Where do i need to add this @gregorg in salt master and how we need to add this

Add this in /etc/salt/master and restart salt-master.

sasidharjetb · 2024-03-28T12:46:24Z

sasidharjetb · 2024-03-28T12:48:11Z

we upgrade salt to 3006.4 on master and 20 minions out of which 10 minions are not upgraded
Will this solve my issue ?

gregorg · 2024-03-28T15:07:24Z

This is not a support ticket, look at salt master logs.

darkpixel · 2024-03-28T17:54:24Z

I tried those settings @gregorg. It's been intermittent for the last three days....and this morning 100% of my minions are offline (even the local one on the salt master)

If I connect to a box with a minion I show the service is running. I can totally state.highstate and everything works properly.

Restarting the master brings everything online.

There's nothing that appears unusual in the master log. I can even see minions reporting their results if I do something like salt '*' test.ping, but all I get back is Minion did not return. [Not connected].

I'd love to switch to a potentially more reliable transport, but it looks like Salt can only have one transport active at a time...so if I enable something like websockets it looks like all my minions will be knocked offline until I reconfigure them.

darkpixel · 2024-04-07T14:43:49Z

I just noticed an interesting log entry on the master. A bunch of my minions weren't talking again, even though the log had a ton of lines of "Got return from..."

So I restarted salt-master and noticed this in the log:

2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140336] MWorkerQueue under PID 2140336 is closing
2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140337] MWorkerQueue under PID 2140337 is closing
2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140318] MWorkerQueue under PID 2140318 is closing
2024-04-07 14:40:30,307 [salt.transport.zeromq:477 ][INFO    ][2140341] MWorkerQueue under PID 2140341 is closing
2024-04-07 14:40:30,307 [salt.transport.zeromq:477 ][INFO    ][2140339] MWorkerQueue under PID 2140339 is closing
2024-04-07 14:40:30,310 [salt.transport.zeromq:477 ][INFO    ][2140335] MWorkerQueue under PID 2140335 is closing
2024-04-07 14:40:30,312 [salt.transport.zeromq:477 ][INFO    ][2140319] MWorkerQueue under PID 2140319 is closing
2024-04-07 14:40:30,312 [salt.transport.zeromq:477 ][INFO    ][2140338] MWorkerQueue under PID 2140338 is closing
2024-04-07 14:40:30,316 [salt.transport.zeromq:477 ][INFO    ][2140320] MWorkerQueue under PID 2140320 is closing
2024-04-07 14:40:30,335 [salt.transport.zeromq:477 ][INFO    ][2140343] MWorkerQueue under PID 2140343 is closing
2024-04-07 14:40:30,360 [salt.transport.zeromq:477 ][INFO    ][2140333] MWorkerQueue under PID 2140333 is closing
2024-04-07 14:40:31,307 [salt.utils.process:745 ][INFO    ][2140315] Some processes failed to respect the KILL signal: Process: <Process name='MWorkerQueue' pid=2140316 parent=2140315 started> (Pid: 2140316)
2024-04-07 14:40:31,308 [salt.utils.process:752 ][INFO    ][2140315] kill_children retries left: 3
2024-04-07 14:40:31,334 [salt.utils.parsers:1061][WARNING ][2140139] Master received a SIGTERM. Exiting.
2024-04-07 14:40:31,334 [salt.cli.daemons :99  ][INFO    ][2140139] The Salt Master is shut down
2024-04-07 14:40:32,241 [salt.cli.daemons :83  ][INFO    ][2407186] Setting up the Salt Master

Specifically this:

2024-04-07 14:40:31,307 [salt.utils.process:745 ][INFO    ][2140315] Some processes failed to respect the KILL signal: Process: <Process name='MWorkerQueue' pid=2140316 parent=2140315 started> (Pid: 2140316)

Maybe something's hanging the MWorkerQueue?

amalaguti · 2024-04-11T19:57:50Z

It just happened on my master which is 3007.0...

Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :

# The number of salt-master worker threads that process commands
# and replies from minions and the Salt API
# Default: 5
# Recommendation: 1 worker thread per 200 minions, max 1.5x cpu cores
# 24x1.5 = 36, should handle 7200 minions
worker_threads: 96

# The listen queue size / backlog
# Default: 1000
# Recommendation: 1000-5000
zmq_backlog: 2000

# The publisher interface ZeroMQPubServerChannel
# Default: 1000
# Recommendation: 10000-100000
pub_hwm: 50000

# Default: 100
# Recommendation: 100-500
thread_pool: 200

max_open_files: 65535
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000

Any improvement with these settings ?

darkpixel · 2024-04-11T20:34:49Z

I didn't use those exact settings because my master is smaller and has fewer minions.
8 cores * 1.5 threads per core = 12 threads = 2,400 minions (I only have ~700 on this test box).

It's no longer dropping all the minions every few hours...it's more like once or twice a week.

darkpixel · 2024-05-02T03:54:04Z

Also, I'm not sure if this is related or not, but it seems to be in the same vein--communication between the minions and master is pretty unreliable.

root@salt:~# salt 'US*' state.sls win.apps
'str' object has no attribute 'pop'

root@salt:~# salt 'US*' state.sls win.apps
US-REDACTED-54:
    Data failed to compile:
----------
    The function "state.sls" is running as PID 7340 and was started at 2024, May 02 03:22:05.301817 with jid 20240502032205301817
'str' object has no attribute 'pop'

root@salt:~# service salt-master restart
root@salt:~# service salt-minion restart
# wait 30 seconds or so for minions to reconnect
root@salt:~# salt 'US*' state.sls win.apps
<snip lots of output>
-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 428
# of minions returned: 405
# of minions that did not return: 23
# of minions with errors: 6
-------------------------------------------
ERROR: Minions returned with non-zero exit code
root@salt:~#

raddessi · 2024-05-02T18:56:25Z

Checking back in here, I think this is actually resolved for me once I got all my minions to 3007.0. I've removed all restart cron jobs and the minions appear to have been stable for days now. Is anyone else still having issues with 3007.0 minions?

communication between the minions and master is pretty unreliable

Yeah.. this may still be an issue for me as well. I'm not sure yet. I noticed some odd things last night in testing but it could be unrelated. I definitely don't have the 'str' object has no attribute 'pop' or any other error but sometimes minions do not return in time.

Rosamaha1 · 2024-05-13T10:43:57Z

Hi all,

I have the same problem on 3007.0.
Master to Minion connectivity always fails.
The other way is working fine..
I did some tuning of the master conf but it didn't help!

gregorg · 2024-09-19T12:40:50Z

I've never used this. What do you mean by disabled? Do you mean you used to be pulling it, and then you stopped? Or did you do something in configs?

Related to issue #66562
I stopped pulling it.

dwoz · 2024-09-25T00:18:23Z

Everyone,

At least one user who could not upgrade to 3007.x fixed their environment with this patch:

diff --git a/salt/transport/zeromq.py b/salt/transport/zeromq.py
index 8bc4d0dff26..d06d8976d1e 100644
--- a/salt/transport/zeromq.py
+++ b/salt/transport/zeromq.py
@@ -776,7 +776,7 @@ class ZeroMQSocketMonitor:
     async def consume(self):
         while self._running.is_set():
             try:
-                if self._monitor_socket.poll():
+                if await self._monitor_socket.poll():
                     msg = await self._monitor_socket.recv_multipart()
                     self.monitor_callback(msg)
                 else:

Can anyone in this thread that is willing try out 3007.1 with this patch and report your findings. It would be a huge help.

amalaguti · 2024-09-25T11:57:03Z

@dwoz should it be applied on both master and minions in the environment or enough with one or the other ?

mkhpalm · 2024-09-25T15:11:09Z

@dwoz should it be applied on both master and minions in the environment or enough with one or the other ?

Just applying it to the master is what worked for us

darkpixel · 2024-09-25T17:24:14Z

Can anyone in this thread that is willing try out 3007.1 with this patch and report your findings. It would be a huge help.

I just applied it to my 3007.1 master and restarted.

It definitely didn't fix minions hanging on Executing state x509.pem_managed for [/path/to/cert]...which could be a separate issue (the minion on the master is in charge of signing).

As for drops/disconnects, if everything is still connected after 24 hours that would be a pretty good sign for me that it's fixed.

darkpixel · 2024-09-26T14:22:38Z

I'd say that fixes it @dwoz.
All my minions are still online.

darkpixel · 2024-09-28T14:32:24Z

It's been 3 days. All my minions are connected! Halleluyah!

johnl2323 · 2024-10-09T20:09:58Z

Will this be backported to the stable 3006.X release?

ukdaf · 2024-10-10T08:58:54Z

Everyone,

At least one user who could not upgrade to 3007.x fixed their environment with this patch:

diff --git a/salt/transport/zeromq.py b/salt/transport/zeromq.py
index 8bc4d0dff26..d06d8976d1e 100644
--- a/salt/transport/zeromq.py
+++ b/salt/transport/zeromq.py
@@ -776,7 +776,7 @@ class ZeroMQSocketMonitor:
     async def consume(self):
         while self._running.is_set():
             try:
-                if self._monitor_socket.poll():
+                if await self._monitor_socket.poll():
                     msg = await self._monitor_socket.recv_multipart()
                     self.monitor_callback(msg)
                 else:

Can anyone in this thread that is willing try out 3007.1 with this patch and report your findings. It would be a huge help.

@dwoz This does not seem to apply to 3006.x, but your remark that it was from someone unable to upgrade to 3007.x suggests it should be?

romanbakaleyko · 2024-10-13T15:08:44Z

I have same problem 3006.9 and curious about the patch, but from what I understand patch is for 7000.x version

DDRBoxman · 2024-10-16T16:54:16Z

This patch worked great on a salt master on 3007.1 with a few thousand minions 👍

Rosamaha1 · 2024-10-22T09:11:03Z

great to read the fix here!
Waiting for v3007.2 release and hoping that I can update salt-master then without any problems :)

q1x · 2024-10-30T12:05:40Z

Hi @dwoz any indication of a date for a fix for this?

ukdaf · 2024-11-01T11:20:49Z

Hi @dwoz any indication of a date for a fix for this?

@q1x Given the current disarray of the project, I wouldn't hold out too much hope. :(

Rosamaha1 · 2024-11-04T14:20:13Z

Hi @dwoz any indication of a date for a fix for this?

@q1x Given the current disarray of the project, I wouldn't hold out too much hope. :(

whats going on at the moment? Seems like I missed something...

darkpixel · 2024-11-04T16:55:37Z

It's been going on for a while. Bought by VMWare, restructured, then bought out by Broadcom and Broadcom doesn't seem to care about Saltstack. All of this happened in the middle of a big transition in the code while trying to make stuff more "community maintainable", so stuff is in disarray and the future seems to be uncertain. Hopefully the dust will settle soon and things will improve.

darkpixel · 2024-11-13T00:42:02Z

There's more to this connectivity issue than the @dwoz pointed out.

I've been playing around extensively with the Linux and BSD minions today.

If I use clusterssh to get an SSH session open to all of them and I run salt-call -l info saltutil.refresh_pillar about half of them were returning False. I closed the sessions that returned True.

I ran the command a handful of times on the ones that returned False and they continually returned False.

Then I ran service salt-minion restart waited a few seconds and re-ran salt-call -l info saltutil.refresh_pillar. Every single one of them returned True and continued returning True for an unknown amount of time until they stop responding to commands from the master and running salt-call -l info saltutil.refresh_pillar locally on the minion starts failing again.

dwoz · 2024-11-18T21:47:14Z

Hi @dwoz any indication of a date for a fix for this?

@q1x Given the current disarray of the project, I wouldn't hold out too much hope. :(

whats going on at the moment? Seems like I missed something...

We will get things back on track. We were ready to do a release at the end of October but our CI/CD infrastructure went away with some service consolidation. We are currently working to get CI/CD pipelines going again and this effort is coming along. Hopefully it won't be too much longer before we can start cutting releases again. Likely few weeks more before we are there.

dwoz · 2024-11-18T21:49:03Z

@darkpixel

Then I ran service salt-minion restart waited a few seconds and re-ran salt-call -l info saltutil.refresh_pillar. Every single one of them returned True and continued returning True for an unknown amount of time until they stop responding to commands from the master and running salt-call -l info saltutil.refresh_pillar locally on the minion starts failing again.

Can you please provide debug or trace log output for a failing minion?

darkpixel · 2024-11-23T18:13:39Z

@dwoz

Can you please provide debug or trace log output for a failing minion?

I need to amend my previous report. Restarting the salt master/minion does NOT fix it.
Restarting the local minion does fix it.

I think this snippet from the logs is the culprit:

[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Publisher connecting to /var/run/salt/minion/minion_event_08c88717e0_pull.ipc
[DEBUG   ] Unable to connect pusher: Stream is closed
[DEBUG   ] Using selector: EpollSelector

I'm not well-versed on transports in salt, but I think the local salt-call -l info sautil.refresh_pillar command is piggybacking on the existing minion services connection to the master and that seems to be closed/disconnected and it's never trying to re-establish the connection. When a minion is in this state, no calls from the master succeed--I just get a timeout message.

The affected minions span both Linux and FreeBSD. All the Linux minions are 3007.1 with the exception of one 3006.8.
All the FreeBSD minions are 3006.8 because it's "unsupported" and the package maintainer had trouble with the 3007.x build and hasn't released a fix yet. It seems to affect about ~30 minions fairly consistently.

Full minion log here:

# salt-call -l trace saltutil.refresh_pillar
[TRACE   ] Setting up log file logging: {'log_path': '/var/log/salt/minion', 'log_level': 'warning', 'log_format': '%(asctime)s,%(msecs)03d [%(name)-17s][%(module)s][%(levelname)-8s] %(message)s', 'date_format': '%Y-%m-%d %H:%M:%S', 'max_bytes': 0, 'backup_count': 0, 'user': 'root'}
[TRACE   ] The required configuration section, 'fluent_handler', was not found the in the configuration. Not loading the fluent logging handlers module.
[TRACE   ] None of the required configuration sections, 'logstash_udp_handler' and 'logstash_zmq_handler', were found in the configuration. Not loading the Logstash logging handlers module.
[TRACE   ] Error loading log_handlers.sentry_mod: Cannot find 'raven' python library, 'sentry_handler' config is empty or not defined
[TRACE   ] Processing <bound method SaltfileMixIn.process_saltfile of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method ConfigDirMixIn.process_config_dir of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[DEBUG   ] Missing configuration file: /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/beacons.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/beacons.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/engine.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/engine.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/f_defaults.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/f_defaults.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/reactor.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/reactor.conf
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: my-minion.customer.tld
[TRACE   ] Processing <bound method OutputOptionsMixIn.process_output of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method OutputOptionsMixIn.process_output_file of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method OutputOptionsMixIn.process_state_verbose of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method NoParseMixin.process_no_parse of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469f14c2710>, <CustomOption at 0x7469ef54f6a0: --out-indent/--output-indent>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469ef58c820>, <CustomOption at 0x7469ef54f5e0: --out-file-append/--output-file-append>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469f2163d90>, <CustomOption at 0x7469ef54fbe0: --no-color/--no-colour>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469ef58c8b0>, <CustomOption at 0x7469ef54f730: --force-color/--force-colour>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x7469ef58c940>, <CustomOption at 0x7469ef54f310: --state-output/--state_output>)
[TRACE   ] Processing <bound method SaltCallOptionParser.process_module_dirs of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method LogLevelMixIn.process_log_level of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method LogLevelMixIn.process_log_file of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <bound method LogLevelMixIn.process_log_level_logfile of <salt.cli.call.SaltCall object at 0x7469f14062c0>>
[TRACE   ] Processing <function OutputOptionsMixIn._mixin_after_parsed at 0x7469efd8e7a0>
[TRACE   ] Processing <function SaltCallOptionParser._mixin_after_parsed at 0x7469efd8fbe0>
[TRACE   ] Processing <function LogLevelMixIn.__setup_logging_routines at 0x7469efd8d1b0>
[TRACE   ] Processing <function MergeConfigMixIn.__merge_config_with_cli at 0x7469efd8cc10>
[TRACE   ] Processing <function LogLevelMixIn.__setup_console_logger_config at 0x7469efd8d480>
[TRACE   ] Processing <function LogLevelMixIn.__setup_logfile_logger_config at 0x7469efd8d510>
[TRACE   ] Processing <function LogLevelMixIn.__setup_logging_config at 0x7469efd8d5a0>
[TRACE   ] Processing <function LogLevelMixIn.__verify_logging at 0x7469efd8d6c0>
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[TRACE   ] Processing <function LogLevelMixIn.__setup_logging at 0x7469efd8d630>
[TRACE   ] Setting up console logging: {'log_level': 'trace', 'log_format': '[%(levelname)-8s] %(message)s', 'date_format': '%H:%M:%S'}
[TRACE   ] Setting up log file logging: {'log_path': '/var/log/salt/minion', 'log_level': 'warning', 'log_format': '%(asctime)s,%(msecs)03d [%(name)-17s][%(module)s][%(levelname)-8s] %(message)s', 'date_format': '%Y-%m-%d %H:%M:%S', 'max_bytes': 0, 'backup_count': 0, 'user': 'root'}
[TRACE   ] The required configuration section, 'fluent_handler', was not found the in the configuration. Not loading the fluent logging handlers module.
[TRACE   ] None of the required configuration sections, 'logstash_udp_handler' and 'logstash_zmq_handler', were found in the configuration. Not loading the Logstash logging handlers module.
[TRACE   ] Error loading log_handlers.sentry_mod: Cannot find 'raven' python library, 'sentry_handler' config is empty or not defined
[DEBUG   ] Configuration file path: /etc/salt/minion
[DEBUG   ] Grains refresh requested. Refreshing grains.
[DEBUG   ] Missing configuration file: /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/beacons.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/beacons.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/engine.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/engine.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/f_defaults.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/f_defaults.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/reactor.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/reactor.conf
[TRACE   ] Error loading grains.avstatus: emsisoft grain only available on Windows
[TRACE   ] Error loading grains.bitlocker_status: bitlocker_status grain only available on Windows
[TRACE   ] Error loading grains.emsisoft: emsisoft grain only available on Windows
[DEBUG   ] The functions from module 'public_ip' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'samba_status' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'syncthing' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'syncthing_router' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.tv: tv grain only available on Windows
[DEBUG   ] The functions from module 'uptime' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.win_computerinfo: 'computerinfo' grain only available on Windows
[TRACE   ] Error loading grains.win_dc: win_dc grain only available on Windows
[TRACE   ] Error loading grains.win_domain_connection: win_domain_connection grain only available on Windows
[TRACE   ] Error loading grains.win_pkgs: win_pkgs grain only available on Windows
[TRACE   ] Error loading grains.win_proc: win_procs grain only available on Windows
[TRACE   ] Error loading grains.win_screen_resolution: win_screen_resolution grain only available on Windows
[TRACE   ] Error loading grains.win_services: win_services grain only available on Windows
[TRACE   ] Error loading grains.windows_boot_mode: 'windows_boot_mode' grain only available on Windows
[DEBUG   ] The functions from module 'windows_reboot_reason' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.windows_system_enclosure_types: 'windows_chassis_type' grain only available on Windows
[TRACE   ] Error loading grains.windows_uptime: 'windows_uptime' grain only available on Windows
[TRACE   ] Error loading grains.windows_users: 'windows_users' grain only available on Windows
[TRACE   ] Error loading grains.xdr: xdr grain only available on Windows
[TRACE   ] 'drill' could not be found in the following search path: '['/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin']'
[TRACE   ] 'drill' could not be found in the following search path: '['/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin']'
[DEBUG   ] The functions from module 'core' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'disks' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'extra' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'lvm' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'mdadm' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'minion_process' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.napalm: "napalm" (/opt/saltstack/salt/lib/python3.10/site-packages/salt/grains/napalm.py) cannot be loaded: NAPALM is not installed: ``pip install napalm``
[TRACE   ] Error loading grains.nxos: No host specified and no UDS found at /tmp/nginx_local/nginx_1_be_nxapi.sock

[DEBUG   ] The functions from module 'opts' are being loaded by dir() on the loaded module
[DEBUG   ] The functions from module 'package' are being loaded by dir() on the loaded module
[TRACE   ] Error loading grains.pending_reboot: 'pending_reboot' grain only available on Windows
[DEBUG   ] Override  __utils__: <module 'salt.loaded.int.grains.zfs' from '/opt/saltstack/salt/lib/python3.10/site-packages/salt/grains/zfs.py'>
[DEBUG   ] The functions from module 'zfs' are being loaded by dir() on the loaded module
[TRACE   ] Loading core.append_domain grain
[TRACE   ] Loading core.cwd grain
[TRACE   ] Loading core.default_gateway grain
[TRACE   ] Loading core.dns grain
[TRACE   ] Loading core.fqdns grain
[TRACE   ] Loading core.get_machine_id grain
[TRACE   ] Loading core.get_master grain
[TRACE   ] Loading core.get_server_id grain
[TRACE   ] Loading core.hostname grain
[TRACE   ] Loading core.hwaddr_interfaces grain
[TRACE   ] Loading core.id_ grain
[TRACE   ] Loading core.ip4_interfaces grain
[TRACE   ] Loading core.ip6_interfaces grain
[TRACE   ] Loading core.ip_fqdn grain
[TRACE   ] Loading core.ip_interfaces grain
[TRACE   ] Loading core.kernelparams grain
[TRACE   ] Loading core.locale_info grain
[TRACE   ] Loading core.os_data grain
[TRACE   ] Adding systemd grains
[TRACE   ] Adding init grain
[TRACE   ] Getting lsb_release distro information
[TRACE   ] lsb_release python bindings not available
[TRACE   ] Attempting to parse /etc/lsb-release
[TRACE   ] Failed to parse /etc/lsb-release: [Errno 2] No such file or directory: '/etc/lsb-release'
[TRACE   ] Getting OS name, release, and codename from freedesktop_os_release
[TRACE   ] Failed to get lsb_distrib_id, trying to parse os-release
[TRACE   ] Getting OS name, release, and codename from distro id, version, codename
[TRACE   ] Loading core.path grain
[TRACE   ] Loading core.pythonexecutable grain
[TRACE   ] Loading core.pythonpath grain
[TRACE   ] Loading core.pythonversion grain
[TRACE   ] Loading core.saltpath grain
[TRACE   ] Loading core.saltversion grain
[TRACE   ] Loading core.saltversioninfo grain
[TRACE   ] Loading core.zmqversion grain
[TRACE   ] Loading public_ip.get_public_ip grain
[DEBUG   ] Starting new HTTPS connection (1): ip.mymaster.tld:443
[DEBUG   ] https://ip.mymaster.tld:443 "GET / HTTP/1.1" 200 179
[DEBUG   ] Starting new HTTPS connection (1): portal.mymaster.tld:443
[DEBUG   ] https://portal.mymaster.tld:443 "POST /salt/api/v1/minion/ip/ HTTP/1.1" 200 28
[TRACE   ] Loading samba_status.samba_status grain
[TRACE   ] Loading syncthing.syncthing_device_id grain
[TRACE   ] Loading syncthing_router.syncthing_device_id grain
[TRACE   ] Loading uptime.uptime grain
[TRACE   ] Loading windows_reboot_reason.windows_reboot_reason grain
[TRACE   ] Loading disks.disks grain
[TRACE   ] Device sdy reports itself as an HDD
[TRACE   ] Device sdf reports itself as an SSD
[TRACE   ] Device sdo reports itself as an HDD
[TRACE   ] Device sdw reports itself as an HDD
[TRACE   ] Device sdd reports itself as an HDD
[TRACE   ] Device sdm reports itself as an HDD
[TRACE   ] Device sdu reports itself as an SSD
[TRACE   ] Device sdb reports itself as an SSD
[TRACE   ] Device sdk reports itself as an HDD
[TRACE   ] Device sds reports itself as an HDD
[TRACE   ] Device sdi reports itself as an SSD
[TRACE   ] Device sdq reports itself as an HDD
[TRACE   ] Device sdz reports itself as an HDD
[TRACE   ] Device sdg reports itself as an HDD
[TRACE   ] Device sdx reports itself as an HDD
[TRACE   ] Device sde reports itself as an HDD
[TRACE   ] Device sdaa reports itself as an HDD
[TRACE   ] Device sdn reports itself as an HDD
[TRACE   ] Device sdv reports itself as an HDD
[TRACE   ] Device sdc reports itself as an SSD
[TRACE   ] Device sdl reports itself as an HDD
[TRACE   ] Device sdt reports itself as an HDD
[TRACE   ] Device sda reports itself as an SSD
[TRACE   ] Device sdj reports itself as an HDD
[TRACE   ] Device sdr reports itself as an SSD
[TRACE   ] Device sdh reports itself as an HDD
[TRACE   ] Device sdp reports itself as an HDD
[TRACE   ] Loading extra.config grain
[TRACE   ] Loading extra.shell grain
[TRACE   ] Loading extra.transactional grain
[TRACE   ] 'transactional-update' could not be found in the following search path: '['/usr/local/sbin', '/usr/local/bin', '/usr/sbin', '/usr/bin', '/sbin', '/bin']'
[TRACE   ] Loading extra.uefi grain
[TRACE   ] Loading lvm.lvm grain
[TRACE   ] Loading mdadm.mdadm grain
[TRACE   ] Loading minion_process.grains grain
[TRACE   ] Loading opts.opts grain
[TRACE   ] Loading package.package grain
[TRACE   ] Loading zfs.zfs grain
[DEBUG   ] The functions from module 'zfs' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded zfs.is_supported
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Connecting to master. Attempt 1 of 1
[DEBUG   ] "salt.mymaster.tld" Not an IP address? Assuming it is a hostname.
[DEBUG   ] Master URI: tcp://208.70.52.16:4506
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'my-minion.customer.tld', 'tcp://208.70.52.16:4506')
[DEBUG   ] Generated random reconnect delay between '4312ms' and '19312ms' (4312)
[DEBUG   ] Setting zmq_reconnect_ivl to '4312ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '19312ms'
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[TRACE   ] ReqChannel send clear load={'cmd': '_auth', 'id': 'my-minion.customer.tld', 'nonce': '877459b49f0e41de934588da7da54c85', 'token': b'\xb4\xf0\xea\x89\xdfNw\xa64\x15\xfeD0$)v\xd6\x86\'\x8c\xd2\xc2\xaa\x0c\x82jn\xebF\x97\xdb"?\x8c\x8e\x8b\xbb\\\xab\xeaI\xc2\x91?\xe8-F\x8c\x81\x041I\xf35\xf6\xd5\xcd\xb9\x105\xf9\x14\x9eW\xa8\xc3hU\xd5\x14\xe9\xe3\xa0-\x8d\r{D\x19\x131v\x8d\x01#\x80\xdaw\xcc\x05\x18\xf5\xae\xf3\xf3tsv\x82\x03 \\\xea\xc6\xff\xf4\x14\x9e\x140\x85`\x8c\x95M\xca\xc8\x01c\xabU\xc4\xa5\x95s\xaf\x99\xba\x82\xe6.\xc3^GHJ\x9a\xd7\xc3\xe7\x85\xbc\xeb\xfa\xce\x85\xdc\xc92\xde:\xa4\x8a5\x1d\xfa/!\xf5\xadO\xb6\x00#\xe2\xc7m\x94\x90\xc3\xd6R[\xc9\xa9\xdej\xc2\x112|\x97\x82\x98#\xb1{\xf7\xa1\xe8]}\xf6\xcf4\xc94C_\\`C\xec\x8a^\xb0E\xb4\xbeZ\xc5\xaf\xbbfl\xbe\xe6\xf1\xc6\x1c\xed\\\xbeM\x1aH\xaf\xda|\xaaD\xe3\x00\xf7\x8b\xbc5\x1c\x8c\r\xee\x7fT\xd6\xfa6\x92\xc0\xd8Jd\xd6Fa[\xf9', 'pub': '-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAjumArX5jmi0/L2slUER/\nGX7JU4XPAj0ZiVwpNBg3GrhbCSFtiCRhHVmDcxxxEhpouH9BUELx8S16QiWc3hOT\n8k4PIdBDsck8+3uaRwUx0xG90SHnq6SVS0042+kZwhA8D6cQG10PTo0Xj2kK4Kls\nzfbhw1BdVa+yuMaPvIlw/UGQlb6eiuuoQlR9L4vN/IBTEu1uiCz8CDH+dOV6aIef\nn3P81d6SZRBUVrr4HJQDJIKLvrBvYI6oyyjpe/t66RlHDnmk5ZEqaCqFM/hs6AS4\nICdM6tSxnFAW/j8m2O6/i5Cy84ZI/6h4DhyPKz/YzxZXNpz+SseRxtV2to+/xjJK\ntQIDAQAB\n-----END PUBLIC KEY-----'}
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] <salt.crypt.AsyncAuth object at 0x7469eef44e80> Got new master aes key.
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] Connecting the Minion to the Master publish port, using the URI: tcp://208.70.52.16:4505
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Publisher connecting to /var/run/salt/minion/minion_event_08c88717e0_pull.ipc
[DEBUG   ] Unable to connect pusher: Stream is closed
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Determining pillar cache
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'my-minion.customer.tld', 'tcp://208.70.52.16:4506')
[TRACE   ] ext_pillar_extra_data = {}
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.verify_signature: Loading public key
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] salt.crypt.verify_signature: Verifying signature
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] The functions from module 'jinja' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded jinja.render
[DEBUG   ] The functions from module 'yaml' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded yaml.render
[DEBUG   ] The functions from module 'saltutil' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded saltutil.refresh_pillar
[DEBUG   ] The functions from module 'direct_call' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded direct_call.execute
[DEBUG   ] The functions from module 'event' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded event.fire
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Publisher connecting to /var/run/salt/minion/minion_event_08c88717e0_pull.ipc
[DEBUG   ] Unable to connect pusher: Stream is closed
[DEBUG   ] Using selector: EpollSelector
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'my-minion.customer.tld', 'tcp://208.70.52.16:4506')
[TRACE   ] ReqChannel send crypt load={'cmd': '_return', 'id': 'my-minion.customer.tld', 'jid': 'req', 'return': False, 'retcode': 0, 'fun': 'saltutil.refresh_pillar', 'fun_args': []}
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] The functions from module 'nested' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded nested.output
[TRACE   ] data = {'local': False}
local:
    False
root@my-minion:~#

darkpixel · 2024-11-26T17:01:10Z

If your logs are being flooded with authentication requests, see bug #67071

The troubleshooting docs say "seconds" and "options can be set on the master"...but those are definitely minion settings and the values appear to be in milliseconds.

darkpixel · 2024-12-11T02:02:27Z

After making a bunch of architectural changes, I found one way of hosing Salt.

If you use an external pillar provider (i.e. curl some JSON) and you either set it to delay responses by ~5,000 msec or return bad data (404, garbage HTML instead of JSON, TCP connection reset, etc...) or a combination of both, you can get the master and minions into a hanged state where the only way to bring them back (even after removing the delays and bad data) is to restart them.

You'll get authentication timeouts, file client timeouts, event bus timeouts, etc... Not sure if it's ZMQ that doesn't any sort of delay or hiccup in network traffic or what...but it devastates my infrastructure if I induce it.

darkpixel · 2024-12-11T17:01:40Z

When the master gets into this "weird" state, I notice most or all of the "MWorker" processes sitting at 100% CPU usage.
Authentication times out pretty much everywhere. Some fortunate minions (running salt-call) are able to authentication, but then time out with file client issues or time out returning to the master.

I'm assuming the MWorker processes are controlled by the worker_threads config knob.
I've tried doing a ton of playing with that setting. Everything from 4 threads up to 64 (the system has 64 cores and 64 GB of RAM). Nothing seems to appease the MWorker gods. After a few minutes, hours, or a day they hit 100% CPU usage and salt grinds to a halt until I restart the master. And sometimes the minions need a restart too because the get locked into a zombie state.

darkpixel · 2024-12-11T18:22:55Z

Another interesting note...when things are adjusted semi-properly and things are generally working well, the master receives floods of traffic in bursts. The master will receive about 1 gigabit of traffic for ten seconds, then it peters off to nothing for 10 seconds, then another huge burst. It does this even when simple jobs (like state.sls salt.minion or test.ping) when targeting the ~500 minions. I can understand salt.minion pulling a few files from the master to generate the minion configs, but test.ping shouldn't really be pulling anything.

Maybe the pillar data...but after testing the external pillar provider, the combination of grains uploaded and pillar data downloaded averages 0.7 MB per minion...and multiply that by ~500 minions and you're talking about 350 MB of data if grains were uploaded and pillar was downloaded.

The returns from state.sls salt.minion and test.ping shouldn't be very big. Not 1 gigabit for 10 seconds then almost nothing for 10 seconds repeated for 10 minutes or however long it takes all the jobs to complete.

When I get it the most stable I possibly can and kick off a job, the Minion running on the Master throws a ton of these messages into the log while the job is running...even though the master's minion isn't the target of the job:

2024-12-11 10:21:13,223 [salt.payload     ][payload][CRITICAL] Could not deserialize msgpack message. This often happens when trying to read a file not in binary mode. To see message payload, enable debug logging and retry. Exception: unpack(b) received extra data.
2024-12-11 10:21:13,223 [salt.transport.zeromq][zeromq][ERROR   ] Exception while running callback
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/payload.py", line 92, in loads
    ret = salt.utils.msgpack.unpackb(msg, **loads_kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/msgpack.py", line 137, in unpackb
    return msgpack.unpackb(packed, **_sanitize_msgpack_unpack_kwargs(kwargs))
  File "msgpack/_unpacker.pyx", line 201, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 394, in consume
    await callback(msg)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 481, in wrap_callback
    payload = self.transport._decode_messages(messages)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 323, in _decode_messages
    payload = salt.payload.loads(messages)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/payload.py", line 109, in loads
    raise SaltDeserializationError(exc_msg) from exc
salt.exceptions.SaltDeserializationError: Could not deserialize msgpack message. See log for more info.
2024-12-11 10:21:13,224 [salt.crypt       ][crypt][CRITICAL] A message with an invalid serial was received.
this serial: 75
last serial: 75
The minion will not honor this request.

ldav1s · 2025-01-24T18:11:50Z

I've been experiencing this issue as well, at least with what is in the initial report. The salt master is running 3006.9 (onedir, Ubuntu 20.04, installed through apt, with additions for gitfs) and manages minions also on 3006.9 (onedir, Ubuntu 20.04, installed through apt). The minions themselves are a mix of on-prem hosts (2 datacenters; salt connects through a public interface) the and VMs (same 2 datacenters as on-prem; salt connects through a private interface).

My observations on my setup:

The command salt '*' test.ping will fail fairly reliably with the Minion did not return on a majority of the hosts only in the on-prem hosts in one datacenter. All the other minions will return fairly reliably.
Restarting the salt minion on the non-responding minions will make the non-responding minions return reliably for a while from salt '*' test.ping.
Didn't attempt salt-call test.ping from the non-responding minion.
Observed the timeout error in the minion logs on the non-responding minions

After reading the original bug (and restarting minions on the affected on-prem hosts), and observing that it would work for a while, I formulated the following idea. Have a job on the the salt master invoke salt '<on-prem-hosts-that-stop-responding>' test.ping every minute. I then proceeded to restart the salt minions on the affected on-prem hosts. I'm now at 20 hours after the restart. All the previously non-responding minions still respond. In other words, salt '*' test.ping completes without Minion did not return errors.

What this seems to suggest is that there could be an issue with the underlying dynamic IP routing setup and not anything to do with salt. It's possible that an intermediate router is throwing away the route from the minion to the master on a timeout, and not restoring it until the salt minion gets restarted. The route from the master to the minion seems unaffected. Lots of comments like "they are listening to the master", etc. on this thread.

qianguih added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 21, 2023

anilsil assigned dwoz Sep 25, 2023

OrangeDog added the Transport label Nov 1, 2023

brettgilmer mentioned this issue Jan 4, 2024

[BUG] Minions keep disconnecting from Master with timeout errors. #65785

Open

dwoz added a commit to dwoz/salt that referenced this issue Sep 26, 2024

Add changelog for issue saltstack#65265

b0b87df

dwoz mentioned this issue Sep 26, 2024

[3007.x] The zmq socket poll method needs to be awaited #66926

Merged

3 tasks

dwoz added a commit to dwoz/salt that referenced this issue Oct 10, 2024

Add changelog for issue saltstack#65265

25f1260

dwoz added a commit that referenced this issue Oct 10, 2024

Add changelog for issue #65265

071325c

tjyang mentioned this issue Nov 23, 2024

[BUG] 3007 master 4505 publish port failed to respond to salt commands unless restart #66282

Open

5 tasks

[BUG] intermittent connection between master and minion #65265

[BUG] intermittent connection between master and minion #65265

Comments

qianguih commented Sep 21, 2023

welcome bot commented Sep 21, 2023

brettgilmer commented Dec 21, 2023

darkpixel commented Dec 22, 2023

brettgilmer commented Dec 26, 2023

brettgilmer commented Dec 28, 2023

raddessi commented Jan 25, 2024

ReubenM commented Jan 29, 2024

darkpixel commented Feb 13, 2024

raddessi commented Feb 13, 2024 • edited Loading

raddessi commented Feb 23, 2024

darkpixel commented Mar 11, 2024

darkpixel commented Mar 25, 2024

gregorg commented Mar 25, 2024

sasidharjetb commented Mar 28, 2024

gregorg commented Mar 28, 2024

sasidharjetb commented Mar 28, 2024

sasidharjetb commented Mar 28, 2024

gregorg commented Mar 28, 2024

darkpixel commented Mar 28, 2024

darkpixel commented Apr 7, 2024

amalaguti commented Apr 11, 2024

darkpixel commented Apr 11, 2024

darkpixel commented May 2, 2024

raddessi commented May 2, 2024

Rosamaha1 commented May 13, 2024

gregorg commented Sep 19, 2024

dwoz commented Sep 25, 2024

amalaguti commented Sep 25, 2024

mkhpalm commented Sep 25, 2024

darkpixel commented Sep 25, 2024

darkpixel commented Sep 26, 2024

darkpixel commented Sep 28, 2024

johnl2323 commented Oct 9, 2024 • edited Loading

ukdaf commented Oct 10, 2024

romanbakaleyko commented Oct 13, 2024 • edited Loading

DDRBoxman commented Oct 16, 2024

Rosamaha1 commented Oct 22, 2024

q1x commented Oct 30, 2024

ukdaf commented Nov 1, 2024

Rosamaha1 commented Nov 4, 2024

darkpixel commented Nov 4, 2024

darkpixel commented Nov 13, 2024

dwoz commented Nov 18, 2024

dwoz commented Nov 18, 2024 • edited Loading

darkpixel commented Nov 23, 2024

darkpixel commented Nov 26, 2024

darkpixel commented Dec 11, 2024

darkpixel commented Dec 11, 2024

darkpixel commented Dec 11, 2024

ldav1s commented Jan 24, 2025

raddessi commented Feb 13, 2024 •

edited

Loading

johnl2323 commented Oct 9, 2024 •

edited

Loading

romanbakaleyko commented Oct 13, 2024 •

edited

Loading

dwoz commented Nov 18, 2024 •

edited

Loading