-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] intermittent connection between master and minion #65265
Comments
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. |
HI - I am seeing this same issue. |
It seems worse with 3006.5 with Linux as the master when managing Windows minions. |
This is affecting many of our endpoints. I can get them to re-establish communication by restarting the minion or the master, but they lose communication again. |
Restarting the saltmaster seems to fix the issue for all minions, for a while, but the issue will return after about 12 hours on a different seemingly random selection of minions. |
I seem to have a very similar issue with 3006.x but in my case restarting the master does not have any effect and only a minion restart resolves the issue. Another oddity is that I can see in the minion logs that the minion is still receiving commands from the master and is able to execute them just fine but the master seemingly never receives the response data. If I issue a I don't recall having this issue on 3005.x but I have not downgraded that far yet.. so far both 3006.5 and 3006.4 minions have the problem for me. I'll try to run a tcpdump if I have time |
I am encountering similar issues. Everything is 3006.5. I've spent two day's thinking I broke something in some recent changes I made, but I've found that the minions jobs are succeeding, but they timeout trying to communicate back to the master. I'm thinking this may be related to concurrency + load. I use this for testing environment automation, and during tests I have concurrent jobs fired off by the scheduler for test data collections. And that is where the issues start to show up in the logs. When this happens, the minions seem to try to re-send the data which just compounds the problem. The logs on the master show that it is getting the messages because it is flagging duplicate messages, but something seems to be getting lost processing the return data. The traces all look the same and seem to indicate something is getting dropped in concurrency related code: 2024-01-29 15:22:57,215 [salt.master :1924][ERROR ][115353] Error in function minion_pub:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1910, in pub
payload = channel.send(payload_kwargs, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 131, in _target
result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 338, in send
ret = yield self._uncrypted_transfer(load, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 309, in _uncrypted_transfer
ret = yield self.transport.send(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 909, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 589, in send
recv = yield future
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 387, in run_job
pub_data = self.pub(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1913, in pub
raise SaltReqTimeoutError(
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1918, in run_func
ret = getattr(self, func)(load)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1839, in minion_pub
return self.masterapi.minion_pub(clear_load)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/daemons/masterapi.py", line 952, in minion_pub
ret["jid"] = self.local.cmd_async(**pub_load)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 494, in cmd_async
pub_data = self.run_job(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 409, in run_job
raise SaltClientError(general_exception) |
I just discovered something. At any random time I might have 25-50 minions that don't appear to respond to jobs. They may or may not respond to ...buuuut they ARE actually are listening to the master. So my workflow is stupidly:
|
@darkpixel Yes I have found the same thing and have the same workflow. Something just gets stuck and responses get lost somewhere. They are always receiving events however as you say, in my experience |
This seems to still be an issue on 3006.7 when both minion and master are the same version |
3007.0 is...worse? Woke up to all ~600 minions in an environment being offline.
The log showed returns from every minion, but the master spit out Restarted the salt-master service, got distracted for ~15 minutes, ran another Used Cluster SSH to connect in to every machine I can reach across the internet and restarted the salt-minion service and I'm down to a mix of ~60 (Windows, Linux, and BSD) that don't respond and I can't reach. Maybe 10 of them are 3006.7. I'd love to test/switch to a different transport like websockets that would probably be more stable, but it appears to be "all or nothing". If I switch to websockets on the master, it looks like every minion will disconnect unless I also update them to use websockets...and if I update them to use websockets and something breaks, I'm going to have to spend the next month trying to get access to hosts to fix salt-minion. |
It just happened on my master which is 3007.0... I was going highstate on a minion that involves certificate signing and it refused to generate the certificate with no error messages in the salt master log. I tried restarting the salt master, no dice. About 10 minutes later I decided to restart the salt master's minion...and suddenly certificate signing worked. The minion on the master wasn't communicating with the master...locally...on the same box... |
Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :
|
Where do i need to add this @gregorg in salt master and how we need to add this |
Add this in |
we upgrade salt to 3006.4 on master and 20 minions out of which 10 minions are not upgraded |
This is not a support ticket, look at salt master logs. |
I tried those settings @gregorg. It's been intermittent for the last three days....and this morning 100% of my minions are offline (even the local one on the salt master) If I connect to a box with a minion I show the service is running. I can totally state.highstate and everything works properly. Restarting the master brings everything online. There's nothing that appears unusual in the master log. I can even see minions reporting their results if I do something like I'd love to switch to a potentially more reliable transport, but it looks like Salt can only have one transport active at a time...so if I enable something like websockets it looks like all my minions will be knocked offline until I reconfigure them. |
I just noticed an interesting log entry on the master. A bunch of my minions weren't talking again, even though the log had a ton of lines of "Got return from..." So I restarted salt-master and noticed this in the log:
Specifically this:
Maybe something's hanging the MWorkerQueue? |
Any improvement with these settings ? |
I didn't use those exact settings because my master is smaller and has fewer minions. It's no longer dropping all the minions every few hours...it's more like once or twice a week. |
Also, I'm not sure if this is related or not, but it seems to be in the same vein--communication between the minions and master is pretty unreliable.
|
Checking back in here, I think this is actually resolved for me once I got all my minions to 3007.0. I've removed all restart cron jobs and the minions appear to have been stable for days now. Is anyone else still having issues with 3007.0 minions?
Yeah.. this may still be an issue for me as well. I'm not sure yet. I noticed some odd things last night in testing but it could be unrelated. I definitely don't have the |
Hi all, I have the same problem on 3007.0. |
Related to issue #66562 |
Everyone, At least one user who could not upgrade to 3007.x fixed their environment with this patch:
Can anyone in this thread that is willing try out 3007.1 with this patch and report your findings. It would be a huge help. |
@dwoz should it be applied on both master and minions in the environment or enough with one or the other ? |
Just applying it to the master is what worked for us |
I just applied it to my 3007.1 master and restarted. It definitely didn't fix minions hanging on As for drops/disconnects, if everything is still connected after 24 hours that would be a pretty good sign for me that it's fixed. |
I'd say that fixes it @dwoz. |
It's been 3 days. All my minions are connected! Halleluyah! |
Will this be backported to the stable 3006.X release? |
@dwoz This does not seem to apply to 3006.x, but your remark that it was from someone unable to upgrade to 3007.x suggests it should be? |
I have same problem 3006.9 and curious about the patch, but from what I understand patch is for 7000.x version |
This patch worked great on a salt master on 3007.1 with a few thousand minions 👍 |
great to read the fix here! |
Hi @dwoz any indication of a date for a fix for this? |
It's been going on for a while. Bought by VMWare, restructured, then bought out by Broadcom and Broadcom doesn't seem to care about Saltstack. All of this happened in the middle of a big transition in the code while trying to make stuff more "community maintainable", so stuff is in disarray and the future seems to be uncertain. Hopefully the dust will settle soon and things will improve. |
There's more to this connectivity issue than the @dwoz pointed out. I've been playing around extensively with the Linux and BSD minions today. If I use clusterssh to get an SSH session open to all of them and I run I ran the command a handful of times on the ones that returned Then I ran |
We will get things back on track. We were ready to do a release at the end of October but our CI/CD infrastructure went away with some service consolidation. We are currently working to get CI/CD pipelines going again and this effort is coming along. Hopefully it won't be too much longer before we can start cutting releases again. Likely few weeks more before we are there. |
Can you please provide debug or trace log output for a failing minion? |
I need to amend my previous report. Restarting the salt master/minion does NOT fix it. I think this snippet from the logs is the culprit:
I'm not well-versed on transports in salt, but I think the local The affected minions span both Linux and FreeBSD. All the Linux minions are 3007.1 with the exception of one 3006.8. Full minion log here:
|
If your logs are being flooded with authentication requests, see bug #67071 The troubleshooting docs say "seconds" and "options can be set on the master"...but those are definitely minion settings and the values appear to be in milliseconds. |
After making a bunch of architectural changes, I found one way of hosing Salt. If you use an external pillar provider (i.e. curl some JSON) and you either set it to delay responses by ~5,000 msec or return bad data (404, garbage HTML instead of JSON, TCP connection reset, etc...) or a combination of both, you can get the master and minions into a hanged state where the only way to bring them back (even after removing the delays and bad data) is to restart them. You'll get authentication timeouts, file client timeouts, event bus timeouts, etc... Not sure if it's ZMQ that doesn't any sort of delay or hiccup in network traffic or what...but it devastates my infrastructure if I induce it. |
When the master gets into this "weird" state, I notice most or all of the "MWorker" processes sitting at 100% CPU usage. I'm assuming the MWorker processes are controlled by the |
Another interesting note...when things are adjusted semi-properly and things are generally working well, the master receives floods of traffic in bursts. The master will receive about 1 gigabit of traffic for ten seconds, then it peters off to nothing for 10 seconds, then another huge burst. It does this even when simple jobs (like Maybe the pillar data...but after testing the external pillar provider, the combination of grains uploaded and pillar data downloaded averages 0.7 MB per minion...and multiply that by ~500 minions and you're talking about 350 MB of data if grains were uploaded and pillar was downloaded. The returns from When I get it the most stable I possibly can and kick off a job, the Minion running on the Master throws a ton of these messages into the log while the job is running...even though the master's minion isn't the target of the job:
|
I've been experiencing this issue as well, at least with what is in the initial report. The salt master is running 3006.9 (onedir, Ubuntu 20.04, installed through apt, with additions for gitfs) and manages minions also on 3006.9 (onedir, Ubuntu 20.04, installed through apt). The minions themselves are a mix of on-prem hosts (2 datacenters; salt connects through a public interface) the and VMs (same 2 datacenters as on-prem; salt connects through a private interface). My observations on my setup:
After reading the original bug (and restarting minions on the affected on-prem hosts), and observing that it would work for a while, I formulated the following idea. Have a job on the the salt master invoke What this seems to suggest is that there could be an issue with the underlying dynamic IP routing setup and not anything to do with salt. It's possible that an intermediate router is throwing away the route from the minion to the master on a timeout, and not restoring it until the salt minion gets restarted. The route from the master to the minion seems unaffected. Lots of comments like "they are listening to the master", etc. on this thread. |
Description
A clear and concise description of what the bug is.
I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn't connect to them anymore after a while.
salt '*' test.ping
failed with the following error message:here are a few observations:
salt-call test.ping
works fine on minion side. other commands likesalt-call state.apply
also works fine. this indicates minion to master communication is fine but master to minion communication is notSetup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
sudo ./bootstrap-salt.sh -A <master-ip-address> -i $(hostname) stable 3006.3
. no custom config on minionsaltstack/salt:3006.3
. master configs:state file:
Please be as specific as possible and give set-up details.
Steps to Reproduce the behavior
(Include debug logs if possible and relevant)
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: