Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reactor race condition. found in 2019.2.0 seen in other releases #54045

Closed
whytewolf opened this issue Jul 29, 2019 · 27 comments
Closed

reactor race condition. found in 2019.2.0 seen in other releases #54045

whytewolf opened this issue Jul 29, 2019 · 27 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Magnesium Mg release after Na prior to Al severity-high 2nd top severity, seen by most users, causes major problems ZD The issue is related to a Zendesk customer support ticket.
Milestone

Comments

@whytewolf
Copy link
Collaborator

Description of Issue

Looks like we have some kind of race condition in the reactor system when passing pillar data to orchestration.

Setup

The setup I used to test this was based on salt-api webhooks. to send data from the webhook into the reactor and carry that data into orchestration that holds saltify cloud.profile.

The reactors all rendered 100% correctly. But the orchestrations were all over the place. Some repeating pillar information and in one case no pillar information was rendered.

The orchestration that rendered without any pillar data came from an attempt to mitigate issue using sleep of 1 second in-between sending the webhooks

Steps to Reproduce Issue

config files

/etc/salt/cloud.providers.d/saltify.conf

salty:
  driver: saltify

/etc/salt/cloud.profiles.d/saltify_host.conf

saltify_host:
  provider: salty
  force_minion_config: true

/etc/salt/master.d/*

rest_cherrypy:
  port: 8000
  webhook_disable_auth: True
  disable_ssl: True
log_level_logfile: all
reactor:
  - 'salt/netapi/hook/deploy_minion':
    - /srv/salt/reactor/deploy_minion.sls

/srv/salt/reactor/deploy_minion.sls


# {{data}}


{% set post_data = data.get('post', {}) %}

#{{post_data}}

{% set host = post_data.get('host', '') %}

#{{host}}

{#
Initiate the Orchestration State
#}
deploy_minion_runner:
  runner.state.orch:
    - mods: orch.minion.deploy
    - pillar:
        params:
          host: {{ host }}

/srv/salt/orch/minion/deploy.sls

{% set errors = [] %}
{% set params = salt['pillar.get']('params', {}) %}

{% if 'host' not in params %}
{% do errors.append('Please provide a host') %}
{% set host = '' %}
{% else %}
{% set host = params.get('host') %}
{% endif %}

{% if errors|length > 0 %}
preflight_check:
  test.configurable_test_state:
    - name: Additional information required
    - changes: False
    - result: False
    - comment: '{{ errors|join(", ") }}'
{% else %}
preflight_check:
  test.succeed_without_changes:
    - name: 'All parameters accepted'
{% endif %}

provision_minion:
  salt.runner:
    - name: cloud.profile
    - prof: saltify_host
    - instances:
      - {{ host }}
    - require:
      - preflight_check
    - vm_overrides:
        ssh_host: {{ host }}
        ssh_username: centos
        ssh_keyfile: /etc/salt/id_rsa
        minion:
          master:
            - 172.16.22.60
            - 172.16.22.115

sync_all:
 salt.function:
   - tgt: {{ host }}
   - name: saltutil.sync_all

#execute_highstate:
# salt.state:
#   - tgt: {{ host }}
#   - highstate: True

Command run

for i in {1..8}; do curl localhost:8000/hook/deploy_minion -d host=test-${i}; done

master (3).log

As can be seen from the log.
deploy_minion.sls rendered in order
test-1
test-2
test-3
test-4
test-5
test-6
test-7
test-8

But deploy.sls
rendered in order
test-8
test-2
test-3
test-1
test-7
test-6
test-6
test-5

I have not found the cause of the issue yet.

Versions Report

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: Not Installed
       cherrypy: unknown
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Jun 20 2019, 20:27:34)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-862.14.4.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core
@whytewolf
Copy link
Collaborator Author

This seems to happen more if the orchestrations that the reactor runs are long-running. Moving to another setup that uses test.arg or some other orchestration that is over quickly doesn't trigger the issue.

@whytewolf
Copy link
Collaborator Author

zd-3944

@whytewolf whytewolf added the ZD The issue is related to a Zendesk customer support ticket. label Jul 29, 2019
@garethgreenaway garethgreenaway added this to the Approved milestone Jul 30, 2019
@garethgreenaway garethgreenaway added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P3 Priority 3 labels Jul 30, 2019
@stale
Copy link

stale bot commented Jan 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Jan 8, 2020
@whytewolf
Copy link
Collaborator Author

this is not stale

@stale
Copy link

stale bot commented Jan 8, 2020

Thank you for updating this issue. It is no longer marked as stale.

@stale stale bot removed the stale label Jan 8, 2020
@sagetherage
Copy link
Contributor

@whytewolf this won't get marked stale any further, we updated the config and this shouldn't be a problem any longer, let me know if you see it on other tickets, too, you shouldn't

@whytewolf
Copy link
Collaborator Author

@sagetherage Yep, I ended up getting a flood of these when the timer was adjusted. but it all looks good now.

@sagetherage
Copy link
Contributor

@whytewolf Yes, we didn't communicate that very well (at all) and should have, and although we are on a better path, we still want to iterate on the config or use GH Actions. LMK if you have other ideas, love to hear it!

@oeuftete
Copy link
Contributor

ZD-4993

@gee456
Copy link

gee456 commented Apr 29, 2020

I wanted to add that we also see this inconsistent reactor to orchestration behavior. This occurs where the reactor fails to pass data or will even run a server ID twice and skip others. To duplicate this add the states below and accept 15+ minions at once.

Details of flow

  1. Reactor salt/minion/*/start calls salt://reactor_state.sls
  2. state reactor_state.sls uses runner.state.orchestrate to call orchestrate_state.sls
  3. reactor_state.sls fails to pass event_data to orchestrate_state.sls

salt -V
Salt Version:
Salt: 2019.2.0

Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: 0.21.1
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.5 (default, Sep 26 2019, 13:23:47)
python-gnupg: 2.3.1
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4

System Versions:
dist: redhat 7.8 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-1127.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.8 Maipo

################################
reactor.conf
reactor:

  • 'salt/minion/*/start':
    • salt://reactor_state.sls

################################
reactor_state.sls
reactor_state_call:
runner.state.orchestrate:
- timeout: 2000
- args:
- mods: orchestrate_state
- pillar:
event_tag: {{ tag }}
event_data: {{ data|json }}

################################
orchestrate_state.sls
{% set tag = salt.pillar.get('event_tag',{}) %}
{% set data = salt.pillar.get('event_data',{}) %}
orchestrate_state_data:
test.configurable_test_state:
- name: Data and tag pillar
- changes: False
- result: True
- comment: |
event_data: {{data}}
tag: {{tag}}

{% if 'id' not in data %}
orchestrate_state-notify-on-Data-failure:
test.configurable_test_state:
- name: Data or pillar Failure on Orchestration Call
- changes: False
- result: False
- comment: |
Failed to get proper pillar values for event_data in orchstandardconf orchestration
event_data: {{data}}
{%else%}
orchestrate_state-success:
test.configurable_test_state:
- name: Data success on Orchestration Call
- changes: False
- result: True
- comment: |
Finished ok
event_data: {{data}}
{%endif%}
################################

Here is the output of a failed job and a successful job

failed job "20200428210139495968"

from the event bus

salt/run/20200428210508971833/ret {
"stamp": "2020-04-28T21:05:09.741246",
"fun": "runner.jobs.list_job",
"fun_args": [
"20200428210139495968"
],
"jid": "20200428210508971833",
"return": {
"EndTime": "2020, Apr 28 21:01:50.904429",
"Error": "Cannot contact returner or no job with this jid",
"Result": {
"gcp6387testsse01.earaa6387.gcp.slb.com_master": {
"return": {
"stamp": "2020-04-28T21:01:39.509888",
"fun": "runner.state.orchestrate",
"fun_args": [
{
"mods": "orchestrate_state",
"pillar": {
"event_data": {
"stamp": "2020-04-28T21:01:39.212370",
"cmd": "minion_event",
"data": "Minion gcp6387temp36 started at Tue Apr 28 21:01:39 2020",
"id": "gcp6387temp36",
"pretag": null,
"tag": "salt/minion/gcp6387temp36/start"
},
"event_tag": "salt/minion/gcp6387temp36/start"
}
}
],
"jid": "20200428210139495968",
"return": {
"data": {
"gcp6387testsse01.earaa6387.gcp.slb.com_master": {
"test
|-orchestrate_state-notify-on-Data-failure
|-Data or pillar Failure on Orchestration Call
|-configurable_test_state": {
"id": "orchestrate_state-notify-on-Data-failure",
"run_num": 1,
"sls": "orchestrate_state",
"changes": {},
"comment": "Failed to get proper pillar values for event_data in orchstandardconf orchestration \nevent_data: {} \n",
"duration": 2.84,
"name": "Data or pillar Failure on Orchestration Call",
"result": false,
"start_time": "21:01:50.889726"
},
"test
|-orchestrate_state_data_|-Data and tag pillar_|-configurable_test_state": {
"id": "orchestrate_state_data",
"run_num": 0,
"sls": "orchestrate_state",
"changes": {},
"comment": "event_data: {} \ntag: {}\n",
"duration": 3.148,
"name": "Data and tag pillar",
"result": true,
"start_time": "21:01:50.885958"
}
}
},
"outputter": "highstate",
"retcode": 1
},
"success": false,
"user": "Reactor"
}
}
},
"StartTime": "2020, Apr 28 21:01:39.495968",
"jid": "20200428210139495968"
},
"success": true,
"user": "root"
}

[root@gcp6387testsse01 ~]# salt-run jobs.list_job 20200428210139495968
EndTime:
2020, Apr 28 21:01:50.904429
Error:
Cannot contact returner or no job with this jid
Result:
----------
gcp6387testsse01.earaa6387.gcp.slb.com_master:
----------
return:
----------
stamp:
2020-04-28T21:01:39.509888
fun:
runner.state.orchestrate
fun_args:
|

----------
mods:
orchestrate_state
pillar:
----------
event_data:
----------
stamp:
2020-04-28T21:01:39.212370
cmd:
minion_event
data:
Minion gcp6387temp36 started at Tue Apr 28 21:01:39 2020
id:
gcp6387temp36
pretag:
None
tag:
salt/minion/gcp6387temp36/start
event_tag:
salt/minion/gcp6387temp36/start
jid:
20200428210139495968
return:
----------
data:
----------
gcp6387testsse01.earaa6387.gcp.slb.com_master:
----------
test
|-orchestrate_state-notify-on-Data-failure
|-Data or pillar Failure on Orchestration Call_|-configurable_test_state:
----------
id:
orchestrate_state-notify-on-Data-failure
run_num:
1
sls:
orchestrate_state
changes:
----------
comment:
Failed to get proper pillar values for event_data in orchstandardconf orchestration
event_data: {}
duration:
2.84
name:
Data or pillar Failure on Orchestration Call
result:
False
start_time:
21:01:50.889726
test_|-orchestrate_state_data_|-Data and tag pillar_|-configurable_test_state:
----------
id:
orchestrate_state_data
run_num:
0
sls:
orchestrate_state
changes:
----------
comment:
event_data: {}
tag: {}
duration:
3.148
name:
Data and tag pillar
result:
True
start_time:
21:01:50.885958
outputter:
highstate
retcode:
1
success:
False
user:
Reactor
StartTime:
2020, Apr 28 21:01:39.495968
jid:
20200428210139495968

ok job from event bus

salt/run/20200428210624555837/ret {
"stamp": "2020-04-28T21:06:25.339035",
"fun": "runner.jobs.list_job",
"fun_args": [
"20200428210149620344"
],
"jid": "20200428210624555837",
"return": {
"EndTime": "2020, Apr 28 21:01:54.637924",
"Error": "Cannot contact returner or no job with this jid",
"Result": {
"gcp6387testsse01.earaa6387.gcp.slb.com_master": {
"return": {
"stamp": "2020-04-28T21:01:49.636384",
"fun": "runner.state.orchestrate",
"fun_args": [
{
"mods": "orchestrate_state",
"pillar": {
"event_data": {
"stamp": "2020-04-28T21:01:45.194576",
"cmd": "minion_event",
"data": "Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020",
"id": "gcp6387temp38",
"pretag": null,
"tag": "salt/minion/gcp6387temp38/start"
},
"event_tag": "salt/minion/gcp6387temp38/start"
}
}
],
"jid": "20200428210149620344",
"return": {
"data": {
"gcp6387testsse01.earaa6387.gcp.slb.com_master": {
"test
|-orchestrate_state-success
|-Data success on Orchestration Call
|-configurable_test_state": {
"id": "orchestrate_state-success",
"run_num": 1,
"sls": "orchestrate_state",
"changes": {},
"comment": "Finished ok \nevent_data: {u'stamp': u'2020-04-28T21:01:45.194576', u'cmd': u'minion_event', u'data': u'Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020', u'id': u'gcp6387temp38', u'pretag': None, u'tag': u'salt/minion/gcp6387temp38/start'} \n",
"duration": 0.31,
"name": "Data success on Orchestration Call",
"result": true,
"start_time": "21:01:54.636565"
},
"test
|-orchestrate_state_data
|-Data and tag pillar
|-configurable_test_state": {
"id": "orchestrate_state_data",
"run_num": 0,
"sls": "orchestrate_state",
"changes": {},
"comment": "event_data: {u'_stamp': u'2020-04-28T21:01:45.194576', u'cmd': u'_minion_event', u'data': u'Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020', u'id': u'gcp6387temp38', u'pretag': None, u'tag': u'salt/minion/gcp6387temp38/start'} \ntag: salt/minion/gcp6387temp38/start\n",
"duration": 0.417,
"name": "Data and tag pillar",
"result": true,
"start_time": "21:01:54.636010"
}
}
},
"outputter": "highstate",
"retcode": 0
},
"success": true,
"user": "Reactor"
}
}
},
"StartTime": "2020, Apr 28 21:01:49.620344",
"jid": "20200428210149620344"
},
"success": true,
"user": "root"
}

[root@gcp6387testsse01 ~]# salt-run jobs.list_job 20200428210149620344
EndTime:
2020, Apr 28 21:01:54.637924
Error:
Cannot contact returner or no job with this jid
Result:
----------
gcp6387testsse01.earaa6387.gcp.slb.com_master:
----------
return:
----------
stamp:
2020-04-28T21:01:49.636384
fun:
runner.state.orchestrate
fun_args:
|

----------
mods:
orchestrate_state
pillar:
----------
event_data:
----------
stamp:
2020-04-28T21:01:45.194576
cmd:
minion_event
data:
Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020
id:
gcp6387temp38
pretag:
None
tag:
salt/minion/gcp6387temp38/start
event_tag:
salt/minion/gcp6387temp38/start
jid:
20200428210149620344
return:
----------
data:
----------
gcp6387testsse01.earaa6387.gcp.slb.com_master:
----------
test
|-orchestrate_state-success
|-Data success on Orchestration Call_|-configurable_test_state:
----------
id:
orchestrate_state-success
run_num:
1
sls:
orchestrate_state
changes:
----------
comment:
Finished ok
event_data: {u'stamp': u'2020-04-28T21:01:45.194576', u'cmd': u'minion_event', u'data': u'Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020', u'id': u'gcp6387temp38', u'pretag': None, u'tag': u'salt/minion/gcp6387temp38/start'}
duration:
0.31
name:
Data success on Orchestration Call
result:
True
start_time:
21:01:54.636565
test
|-orchestrate_state_data
|-Data and tag pillar_|-configurable_test_state:
----------
id:
orchestrate_state_data
run_num:
0
sls:
orchestrate_state
changes:
----------
comment:
event_data: {u'_stamp': u'2020-04-28T21:01:45.194576', u'cmd': u'_minion_event', u'data': u'Minion gcp6387temp38 started at Tue Apr 28 21:01:45 2020', u'id': u'gcp6387temp38', u'pretag': None, u'tag': u'salt/minion/gcp6387temp38/start'}
tag: salt/minion/gcp6387temp38/start
duration:
0.417
name:
Data and tag pillar
result:
True
start_time:
21:01:54.636010
outputter:
highstate
retcode:
0
success:
True
user:
Reactor
StartTime:
2020, Apr 28 21:01:49.620344
jid:
20200428210149620344

@amalaguti
Copy link

amalaguti commented May 4, 2020

I was able to deploy 30 to 50 minions without issues by adding some randomness to the orchestration...

{% set random_start = range(1, 30) | random %}
pause:
  salt.runner:
    - name: test.sleep
    - s_time: {{ random_start }}

@whytewolf whytewolf added CS-S1 and removed CS-S3 labels May 13, 2020
@sagetherage sagetherage removed the P3 Priority 3 label Jun 3, 2020
@oeuftete
Copy link
Contributor

oeuftete commented Jun 8, 2020

ZD-4955

@sagetherage sagetherage added the Magnesium Mg release after Na prior to Al label Jun 9, 2020
@sagetherage sagetherage added severity-high 2nd top severity, seen by most users, causes major problems and removed severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around labels Jun 25, 2020
@sagetherage sagetherage assigned Ch3LL and unassigned s0undt3ch Jun 25, 2020
@oeuftete
Copy link
Contributor

ZD-5336

@sagetherage sagetherage modified the milestones: Approved, Magnesium Jul 14, 2020
@max-arnold
Copy link
Contributor

I think I got bitten by this bug on 2019.2.5, but it is hard to reduce my states to a minimal test case. Unfortunately, it looks like Salt reactors can't be relied upon 😟

@doesitblend
Copy link
Collaborator

When I run the above suggested steps to reproduce this I get the attached log.
gh54045.log

@theskabeater
Copy link
Contributor

@oeuftete I was able to replicate your results using your example. @Ch3LL mentioned testing with this PR: #56513 -- and it appears to solve the issue in your example. I'm seeing all reactor and orchestration logs without any duplicate or missing indices in the log messages. Would you mind checking out that PR to see if you get similar results?

@oeuftete
Copy link
Contributor

@theskabeater Yes, I just tested now the current master (04a0068) + a patch generated from #56513 and I'm seeing all of the reaction orchestrations happen now, once only, in the parallel and non-parallel cases. Great!

I'm also going to link in the issue at #58369 to make sure this issue is visible there.

@Ch3LL
Copy link
Contributor

Ch3LL commented Oct 5, 2020

Since #56513 is merged I am closing here

@Ch3LL Ch3LL closed this as completed Oct 5, 2020
@lomeroe
Copy link
Contributor

lomeroe commented Dec 2, 2020

Are there PRs other than #56513 that this fix is dependent on? 3002.2 appears to have all the modifications from that PR and I'm still seeing this issue.

# salt-call -V
Salt Version:
          Salt: 3002.2

Dependency Versions:
          cffi: 1.11.5
      cherrypy: Not Installed
      dateutil: 2.6.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.10.1
       libgit2: Not Installed
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.14
      pycrypto: Not Installed
  pycryptodome: Not Installed
        pygit2: Not Installed
        Python: 3.6.8 (default, Apr 16 2020, 01:36:27)
  python-gnupg: 0.4.6
        PyYAML: 3.12
         PyZMQ: 19.0.0
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.3

System Versions:
          dist: centos 8 Core
        locale: UTF-8
       machine: x86_64
       release: 4.18.0-193.28.1.el8_2.x86_64
        system: Linux
       version: CentOS Linux 8 Core

@oeuftete
Copy link
Contributor

oeuftete commented Dec 2, 2020

@lomeroe I will reopen this. The important memory leak fixes that went into 3002.2 reverted the changes that resolved this issue. I believe that pending PR #58853 will re-resolve this issue.

@oeuftete oeuftete reopened this Dec 2, 2020
@lomeroe
Copy link
Contributor

lomeroe commented Dec 2, 2020

@oeuftete good info, thanks!

@sagetherage
Copy link
Contributor

looks like #58853 was merged and will be in the Aluminium release, closing this issue

@oeuftete
Copy link
Contributor

ZD-6199. 3002.2, reactors not involved, multiple concurrent orchestrations using inline pillar data, though.

@s0undt3ch
Copy link
Collaborator

@oeuftete so, this probably needs a new issue for the scenario you're facing?
Or reopen this one?

@oeuftete
Copy link
Contributor

@s0undt3ch I think we're OK with no new issue and keeping this closed. I added the comment just for Enterprise issue cross-referencing.

I believe the "new" issue is just a less common way to reproduce the issue of cross-talk in concurrent orchestrations with pillar data. The only difference (I think) is that the previous examples were (conveniently) triggered by reactors, whereas the recent one was via manual/API-driven orchestrations.

@oeuftete
Copy link
Contributor

oeuftete commented Mar 9, 2021

Using the test case in #54045 (comment), tested with https://github.com/saltstack/salt/releases/tag/v3003rc1, and both the regular and parallel cases are working fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Magnesium Mg release after Na prior to Al severity-high 2nd top severity, seen by most users, causes major problems ZD The issue is related to a Zendesk customer support ticket.
Projects
None yet
Development

No branches or pull requests