Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] salt-minion gets disconnected from salt-master #66715

Open
gvfnix opened this issue Jul 15, 2024 · 6 comments
Open

[BUG] salt-minion gets disconnected from salt-master #66715

gvfnix opened this issue Jul 15, 2024 · 6 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@gvfnix
Copy link

gvfnix commented Jul 15, 2024

Description
During the day I found out that recently updated salt-minion does not respond:

    salt-master:~$ salt us1101bzrd002 test.ping 
    us1101bzrd002:
        Minion did not return. [Not connected]
    ERROR: Minions returned with non-zero exit code

Minion itself seems to be pretty happy though:

    us1101bzrd002:~$ sudo salt-call test.ping
    [DEBUG   ] Configuration file path: /etc/salt/minion
    [DEBUG   ] Grains refresh requested. Refreshing grains.
    [DEBUG   ] Reading configuration from /etc/salt/minion
    [DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
    [DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
    [DEBUG   ] Including configuration from '/etc/salt/minion.d/grains-minion.conf'
    [DEBUG   ] Reading configuration from /etc/salt/minion.d/grains-minion.conf
    [DEBUG   ] Including configuration from '/etc/salt/minion.d/grains-percona.conf'
    [DEBUG   ] Reading configuration from /etc/salt/minion.d/grains-percona.conf
    [DEBUG   ] Including configuration from '/etc/salt/minion.d/misc.conf'
    [DEBUG   ] Reading configuration from /etc/salt/minion.d/misc.conf
    [DEBUG   ] Including configuration from '/etc/salt/minion.d/pymysql.conf'
    [DEBUG   ] Reading configuration from /etc/salt/minion.d/pymysql.conf
    [DEBUG   ] The functions from module 'google_metadata' are being loaded by dir() on the loaded module
    [DEBUG   ] The functions from module 'core' are being loaded by dir() on the loaded module
    [DEBUG   ] The functions from module 'disks' are being loaded by dir() on the loaded module
    [DEBUG   ] The functions from module 'extra' are being loaded by dir() on the loaded module
    [DEBUG   ] The functions from module 'lvm' are being loaded by dir() on the loaded module
    [DEBUG   ] The functions from module 'mdadm' are being loaded by dir() on the loaded module
    [DEBUG   ] The functions from module 'minion_process' are being loaded by dir() on the loaded module
    [DEBUG   ] The functions from module 'opts' are being loaded by dir() on the loaded module
    [DEBUG   ] The functions from module 'package' are being loaded by dir() on the loaded module
    [DEBUG   ] Override  __utils__: <module 'salt.loaded.int.grains.zfs' from '/opt/saltstack/salt/lib/python3.10/site-packages/salt/grains/zfs.py'>
    [DEBUG   ] The functions from module 'zfs' are being loaded by dir() on the loaded module
    [DEBUG   ] The `lspci` binary is not available on the system. GPU grains will not be available.
    [DEBUG   ] The functions from module 'zfs' are being loaded by dir() on the loaded module
    [DEBUG   ] LazyLoaded zfs.is_supported
    [DEBUG   ] Using selector: EpollSelector
    [DEBUG   ] Connecting to master. Attempt 1 (infinite attempts)
    [DEBUG   ] "salt-master.tangome.cloud" Not an IP address? Assuming it is a hostname.
    [DEBUG   ] Master URI: tcp://172.20.176.16:4506
    [DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'us1101bzrd002', 'tcp://172.20.176.16:4506')
    [DEBUG   ] Generated random reconnect delay between '13669ms' and '43669ms' (13669)
    [DEBUG   ] Setting zmq_reconnect_ivl to '13669ms'
    [DEBUG   ] Setting zmq_reconnect_ivl_max to '43669ms'
    [DEBUG   ] salt.crypt.get_rsa_key: Loading private key
    [DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
    [DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
    [DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
    [DEBUG   ] Decrypting the current master AES key
    [DEBUG   ] salt.crypt.get_rsa_key: Loading private key
    [DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
    [DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
    [DEBUG   ] <salt.crypt.AsyncAuth object at 0x763add563430> Got new master aes key.
    [DEBUG   ] Closing AsyncReqChannel instance
    [DEBUG   ] Connecting the Minion to the Master publish port, using the URI: tcp://172.20.176.16:4505
    [DEBUG   ] Using selector: EpollSelector
    [DEBUG   ] Using selector: EpollSelector
    [DEBUG   ] Publisher connecting to /var/run/salt/minion/minion_event_02e4eee724_pull.ipc
    [DEBUG   ] salt.crypt.get_rsa_key: Loading private key
    [DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
    [DEBUG   ] Closing _TCPPubServerPublisher instance
    [DEBUG   ] Determining pillar cache
    [DEBUG   ] Using selector: EpollSelector
    [DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'us1101bzrd002', 'tcp://172.20.176.16:4506')
    [DEBUG   ] salt.crypt.get_rsa_key: Loading private key
    [DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
    [DEBUG   ] salt.crypt.verify_signature: Loading public key
    [DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
    [DEBUG   ] salt.crypt.verify_signature: Verifying signature
    [DEBUG   ] Closing AsyncReqChannel instance
    [DEBUG   ] The functions from module 'jinja' are being loaded by dir() on the loaded module
    [DEBUG   ] LazyLoaded jinja.render
    [DEBUG   ] The functions from module 'yaml' are being loaded by dir() on the loaded module
    [DEBUG   ] LazyLoaded yaml.render
    [DEBUG   ] The functions from module 'test' are being loaded by dir() on the loaded module
    [DEBUG   ] LazyLoaded test.ping
    [DEBUG   ] The functions from module 'direct_call' are being loaded by dir() on the loaded module
    [DEBUG   ] LazyLoaded direct_call.execute
    [DEBUG   ] test.ping received for minion 'us1101bzrd002'
    [DEBUG   ] Using selector: EpollSelector
    [DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'us1101bzrd002', 'tcp://172.20.176.16:4506')
    [DEBUG   ] Closing AsyncReqChannel instance
    [DEBUG   ] The functions from module 'nested' are being loaded by dir() on the loaded module
    [DEBUG   ] LazyLoaded nested.output
    local:
        True

When I try to restart the minion, it does not stop in a neat way:

    us1101bzrd002:~$ sudo systemctl stop salt-minion      # waits about a minute
    us1101bzrd002:~$ sudo systemctl status salt-minion
    × salt-minion.service - The Salt Minion
        Loaded: loaded (/lib/systemd/system/salt-minion.service; enabled; vendor preset: enabled)
        Active: failed (Result: timeout) since Thu 2024-07-11 15:13:19 UTC; 3s ago
        Docs: man:salt-minion(1)
                file:///usr/share/doc/salt/html/contents.html
                https://docs.saltproject.io/en/latest/contents.html
        Process: 3309726 ExecStart=/usr/bin/salt-minion (code=killed, signal=KILL)
    Main PID: 3309726 (code=killed, signal=KILL)
        Tasks: 3 (limit: 251201)
        Memory: 110.4M
            CPU: 35.774s
        CGroup: /system.slice/salt-minion.service
                └─3309911 "/opt/saltstack/salt/bin/python3.10 /usr/bin/salt-minion MultiMinionProcessManager MinionProcessManager"

    Jul 11 15:11:48 us1101bzrd002 salt-minion[3309911]: [WARNING ] Minion received a SIGTERM. Exiting.
    Jul 11 15:11:48 us1101bzrd002 salt-minion[3309911]: [INFO    ] Shutting down the Salt Minion
    Jul 11 15:11:48 us1101bzrd002 salt-minion[3309911]: The Salt Minion is shutdown. Minion received a SIGTERM. Exited.
    Jul 11 15:13:19 us1101bzrd002 systemd[1]: salt-minion.service: State 'stop-sigterm' timed out. Killing.
    Jul 11 15:13:19 us1101bzrd002 systemd[1]: salt-minion.service: Killing process 3309726 (python3.10) with signal SIGKILL.
    Jul 11 15:13:19 us1101bzrd002 systemd[1]: salt-minion.service: Main process exited, code=killed, status=9/KILL
    Jul 11 15:13:19 us1101bzrd002 systemd[1]: salt-minion.service: Failed with result 'timeout'.
    Jul 11 15:13:19 us1101bzrd002 systemd[1]: salt-minion.service: Unit process 3309911 (/opt/saltstack/) remains running after unit stopped.
    Jul 11 15:13:19 us1101bzrd002 systemd[1]: Stopped The Salt Minion.
    Jul 11 15:13:19 us1101bzrd002 systemd[1]: salt-minion.service: Consumed 35.774s CPU time.

Seems like MultiMinionProcessManager hangs and does not get terminated by systemd:

    us1101bzrd002:~$ ps -ef | grep salt-minion
    root      203915       1  0 Jul11 ?        00:00:10 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-minion MultiMinionProcessManager MinionProcessManager

After killing MultiMinionProcessManager and restarting salt-minion.service, minion normally connects to the master for the next couple of hours. During that time of normal work minion can be restarted with no problem.

The issue does not reproduce in salt-minion 3004, but reproduces in salt-minion 3006.

Setup

salt-master 3007.1 runs in a container (based on Ubuntu 22.04) orchestrated by Kubernetes, has 4 CPUs and 8Gi of memory, resources are guaranteed.

Master config:

env:
  driver: env
ext_pillar:
  - git:
      - __env__ https://gitlab.internal.host/saltstack.git:
        - root: pillar
        - user: git
        - password: sdb://env/GITLAB_TOKEN
fileserver_backend:
  - gitfs
git_pillar_base: main
gitfs_base: main
gitfs_disable_saltenv_mapping: true
gitfs_provider: pygit2
gitfs_remotes:
  - https://gitlab.internal.host/saltstack.git:
      - user: git
      - password: sdb://env/GITLAB_TOKEN
      - update_interval: 30
gitfs_root: state
log_file: /dev/null
log_fmt_console: MASTER %(asctime)s [%(levelname)s] %(message)s
log_level: info
master_id: production
pillar_merge_lists: true
presence_events: true
schedule:
  cleanup_keys:
    function: manage.down
    kwargs:
      removekeys: true
    minutes: 60
  sync_runners:
    function: saltutil.sync_runners
    minutes: 1

Minion 3007.1 run on Ubuntu Server 22.04.

Minion config:

master: salt-master.internal.host
master_tries: -1
auth_tries: 30
ping_interval: 1
recon_max: 30000
saltenv: base
pillarenv: base
minion_restart_command: [systemctl, restart, salt-minion]
log_level: debug

Steps to Reproduce the behavior
Wait a couple of hours, then try to run any module at salt-master targeting salt-minion.

Expected behavior
The module should run normally.

Versions Report

    Salt Version:
              Salt: 3007.1
    
    Python Version:
            Python: 3.10.14 (main, Apr  3 2024, 21:30:09) [GCC 11.2.0]
    
    Dependency Versions:
              cffi: 1.16.0
          cherrypy: 18.8.0
          dateutil: 2.8.2
        docker-py: Not Installed
            gitdb: Not Installed
        gitpython: Not Installed
            Jinja2: 3.1.4
          libgit2: Not Installed
      looseversion: 1.3.0
          M2Crypto: Not Installed
              Mako: Not Installed
          msgpack: 1.0.7
      msgpack-pure: Not Installed
      mysql-python: Not Installed
        packaging: 23.1
        pycparser: 2.21
          pycrypto: Not Installed
      pycryptodome: 3.19.1
            pygit2: Not Installed
      python-gnupg: 0.5.2
            PyYAML: 6.0.1
            PyZMQ: 25.1.2
            relenv: 0.16.0
            smmap: Not Installed
          timelib: 0.3.0
          Tornado: 6.3.3
              ZMQ: 4.3.4
    
    Salt Package Information:
      Package Type: onedir
    
    System Versions:
              dist: ubuntu 22.04.4 jammy
            locale: utf-8
          machine: x86_64
          release: 6.5.0-1020-gcp
            system: Linux
          version: Ubuntu 22.04.4 jammy
@gvfnix gvfnix added Bug broken, incorrect, or confusing behavior needs-triage labels Jul 15, 2024
Copy link

welcome bot commented Jul 15, 2024

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@tjyang
Copy link
Contributor

tjyang commented Jul 15, 2024

@gvfnix , pls take a look of this bug #66282. Also 3007.1 is STS, not LTS.

@gvfnix
Copy link
Author

gvfnix commented Jul 16, 2024

@tjyang I presume that the issue is not related to salt-master as it only depends on salt-minion version by my observation so far.

salt-master:3006, salt-minion:3004 - the problem does not appear.
salt-master:3006, salt-minion:3006 - the problem first appeared.
salt-master:3007, salt-minion:3006 - the problem appears.
salt-master:3007, salt-minion:3007 - the problem appears.
salt-master:3007, salt-minion:3004 - the problem does not appear.

There is no MultiMinionProcessManager in salt-minion:3004, so my assumption is that this very entity is somehow related to the problem.

@tjyang
Copy link
Contributor

tjyang commented Jul 20, 2024

@gvfnix thanks for your usage sharing.

  • salt-master:3006, salt-minion:3006 - the problem first appeared.

    master on 3006.8 bring me much more stability on pub port with older and newer-then-master minions.
    But to my surprise that I have a few incidents that ssh into minions with 3006.8 to restart minion is needed to bring up pub port.

@dwoz dwoz added this to the Chlorine v3007.2 milestone Jul 30, 2024
@dwoz dwoz self-assigned this Jul 30, 2024
@dwoz
Copy link
Contributor

dwoz commented Jul 30, 2024

@tjyang @gvfnix Does adding master_alive_interval in addition to ping_interval to your minion config solve the problem?

@MartinEmrich
Copy link

If anyone finds this while googling for their variant of

    Jul 11 15:11:48 us1101bzrd002 salt-minion[3309911]: The Salt Minion is shutdown. Minion received a SIGTERM. Exited.
    Jul 11 15:13:19 us1101bzrd002 systemd[1]: salt-minion.service: State 'stop-sigterm' timed out. Killing.

This is what happened to me:

The new "Broadcom-Flavoured" Debian packages apparently mark /etc/salt/minion as a proper config file.
This broke the automated setup during VM provisioning, which first prepares /etc/salt/minion and /etc/salt/minion_id, and then Install the salt-minion package (e.g. apt-get install -y salt-minion), and finally starts the salt-minion service (via systemctl).

Now debconf asks whether to overwrite the pre-existing salt config files, which won't work if running e.g. in cloud-init. This left the salt-minion package in a dpkg half-configured state. The service starts nevertheless.

After starting the salt-minion service, the minion did start up, but did run (accorting to it's debug log, which I enabled for discovering this), some dpkg-query command. This command finished up the salt-minion package installation, which triggered a minion restart from within salt minion.

The effect for me was that the salt-minion was first running, but then after 90 seconds terminated itself during this phenomenon.

I suspect that this SIGTERM from within the salt-minion process tree itself makes salt-minion struggle?

My solution will be to suppress the interactive "overwrite? (Y/I/N/O/D/Z)" with apt/dpkg means.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

4 participants