-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Getting the error "An extra return was detected from minion" with Multimaster Syndic setup post upgrade to 3006.3 one dir #65516
Comments
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. |
See also #65301 |
Have narrowed down the problem to Syndic. Have tried to replicate the same issue in an direct master to minion approach. The issue did not occur and managed to scale up up to 15000 minions using minion Swarm. |
The same issue happened with salt version 3004.2, but started happening beyond 20000 minions. With 3006.3, this error started occurring at 2500 minions. |
Hello, Using the same version, 3006.3, I can confirm this bug too. This is then logged as : |
Same issue for us since version 3006.3,. A lot of extra returns from syndic to master of master occurring when more than ~1200 jobs are running simultaneously. |
Same issue here. I actually didn't it see it manifest itself until the MoMs were updated to 3006 (our syndics were at 3006 for a few weeks before the MoMs were upgraded). Once "delayed" it seems to continually send the "delayed" data even when the master receives it (data seen coming across the master event bus over and over is available in the master job cache after the first time it is received). |
Additionally, I added some extra logging in _return_pub_syndic in minion.py
and the exception being thrown is "Nonce verification error", so perhaps this is related to #65114. I unfortunately can't upgrade to 3006.5 due to other issues to see if it corrects this in my environment. Curious if any of the others in this thread have tested against 3006.5 |
Hello, @Iomeroe We're still using 3006.3. Upgrading our infrastructure is time consumming. Regards. |
I haven't seen the "nonce verification error" exception since I posted that comment, so that may have been a red herring or one-off event. That said, I went ahead and patched our MoMs and syndics with the fixes from #65247 which are mentioned as the fix in #65114 just to test it out, but it didn't change anything with regards to this problem... Edit to add: the exception being thrown in _return_pub_syndic from my additional logging has just been "Message Timed Out" |
Hello, We have several worker threads on MoM to manage our 4000 minions. When a highstate is launch, each workers start to report their job returns in the jobs_cache. It seems that several workers try to report the same job for the same minion :
Maybe it would help. |
We attempted raising In the end, the problem seemed to be that the syndic reads every return off the event bus during its polling interval and attempts to send them all to the MoM in one large return. This seemed to be the crux of the issue with simultaneous jobs, which was flooding the MoM event bus with one single, giant return that was overrunning everything else (i.e. one return to the MoM that contained hundreds of returns from the syndic). If the MoM could not handle it quickly enough, the syndic would just resend it (causing the "extra return" log events), but also exasperating the issue on the MoM's event bus (subsequent returns from the syndic to the MoM would also be blocked until the large return finally succeeded or the syndic was restarted). Perhaps throwing more CPU at the MoM which would allow more worker threads would have helped as well, but we were determined not to just throw more hardware at the problem (edit to add: I also don't know that more worker threads would matter, being one large return, it would seem likely it is confined to one worker thread). Modifying our syndics with the following patch has yielded good results, which sends individual returns up to the MoM, instead of grouping them together. Perhaps some middle ground could be found, and returns could be sent in a configurable number of chunks, but in our environment, we have not seen errors or issues with returns in our time of running this patch. I'd be interested to see how it performs in the minion swarm setup used by @anandarajan-vivekanandam-agilysys Here's the patch, for a 3006.7 minion.py
|
@lomeroe We have upgraded the salt versions in Masters (MoM and Syndics) to 3006.5 . We are not observing the error "An extra return was detected" anymore in Master log. This may be mainly due to the fact that we moved away from pinging every minion from Master of Masters for gathering the live status of the minion. Instead, we configured beacon on Minions and started capturing the beacon events using an engine in Syndic masters. We have restricted the async job execution from Masters to 500. This approach has eliminated the extra return problem and the infra is now stable and could hold up to 80K minions at a time. We could not upgrade our minions beyond 3004.2, because we are seeing frequent timeouts in the latest versions of salt minions. So I could replicate this in our swarm for now. |
Description
We are using an M.O.M and multimaster syndic setup. We were able to scale upto 20000 minions with the minion swarm without any issues when we were using the salt version 3004.2
We recently upgraded the saltmaster and syndic to 3006.3 one dir version. Post upgrade, the M.O.M master is getting spammed with the error message "salt-master[22011]: [ERROR ] An extra return was detected from minion XXXXXXX-XXXXXX-XXXXXX, please verify the minion, this could be a replay attack"
This is affecting the performance and master, eventually facing error in Syndic too. Pasting the syndic log below.
Nov 6 10:43:26 mi-syndic-master-01-perf salt-syndic: [ERROR ] Unable to call _return_pub_multi on cnc-perf-mi.hospitalityrevolution.com, trying another...
Nov 6 10:43:41 mi-syndic-master-01-perf salt-syndic: [WARNING ] The minion failed to return the job information for job 20231106102101948872. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.
Nov 6 10:43:41 mi-syndic-master-01-perf salt-syndic: [ERROR ] Unable to call _return_pub_multi on cnc-perf-mi.hospitalityrevolution.com, trying another...
Nov 6 10:44:04 mi-syndic-master-01-perf salt-syndic: [WARNING ] The minion failed to return the job information for job 20231106102101948872. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.
Nov 6 10:44:05 mi-syndic-master-01-perf salt-syndic: [ERROR ] Unable to call _return_pub_multi on cnc-perf-mi.hospitalityrevolution.com, trying another...
Moreover, even after shutting down , we are seeing the same message spamming the logs and the minion return are capture by the salt event bus event_bus.get_event( tag='salt/job//ret/',match_type='fnmatch')
Setup
We have a single M.O.M and 4 Syndic Masters ( All the syndic masters shares same pub key since they are set in multimaster mode). We also have an minion swarm setup for scale testing purpose.
M.O.M and Syndic masters use Salt version 3006.3 one dir version
Swarm minion uses the salt minion version 3004.2. Attached the minion swarm python code.
MinionSwarm.zip
Please be as specific as possible and give set-up details.
Steps to Reproduce the behavior
Use Minion swarm with the M.O.M and Multimaster Syndic setup.
Expected behavior
Minions should connect without issues.
Screenshots
Versions Report
M.O.M:
Salt Version:
Salt: 3006.3
Python Version:
Python: 3.10.13 (main, Sep 6 2023, 02:11:27) [GCC 11.2.0]
Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.2
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: 3.16.0
pycryptodome: Not Installed
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: 0.13.10
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: centos 7.9.2009 Core
locale: utf-8
machine: x86_64
release: 3.10.0-1160.88.1.el7.x86_64
system: Linux
version: CentOS Linux 7.9.2009 Core
Syndic Master:
Salt Version:
Salt: 3006.3
Python Version:
Python: 3.10.13 (main, Sep 6 2023, 02:11:27) [GCC 11.2.0]
Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.2
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: 3.16.0
pycryptodome: Not Installed
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: 0.13.10
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: centos 7.9.2009 Core
locale: utf-8
machine: x86_64
release: 3.10.0-1160.95.1.el7.x86_64
system: Linux
version: CentOS Linux 7.9.2009 Core
The text was updated successfully, but these errors were encountered: