-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reactor race condition. found in 2019.2.0 seen in other releases #54045
Comments
This seems to happen more if the orchestrations that the reactor runs are long-running. Moving to another setup that uses test.arg or some other orchestration that is over quickly doesn't trigger the issue. |
zd-3944 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue. |
this is not stale |
Thank you for updating this issue. It is no longer marked as stale. |
@whytewolf this won't get marked |
@sagetherage Yep, I ended up getting a flood of these when the timer was adjusted. but it all looks good now. |
@whytewolf Yes, we didn't communicate that very well (at all) and should have, and although we are on a better path, we still want to iterate on the config or use GH Actions. LMK if you have other ideas, love to hear it! |
ZD-4993 |
I wanted to add that we also see this inconsistent reactor to orchestration behavior. This occurs where the reactor fails to pass data or will even run a server ID twice and skip others. To duplicate this add the states below and accept 15+ minions at once. Details of flow
salt -V Dependency Versions: System Versions: ################################
################################ ################################ {% if 'id' not in data %} Here is the output of a failed job and a successful job failed job "20200428210139495968" from the event bus salt/run/20200428210508971833/ret { [root@gcp6387testsse01 ~]# salt-run jobs.list_job 20200428210139495968 ok job from event bus salt/run/20200428210624555837/ret { [root@gcp6387testsse01 ~]# salt-run jobs.list_job 20200428210149620344 |
I was able to deploy 30 to 50 minions without issues by adding some randomness to the orchestration...
|
ZD-4955 |
ZD-5336 |
I think I got bitten by this bug on 2019.2.5, but it is hard to reduce my states to a minimal test case. Unfortunately, it looks like Salt reactors can't be relied upon 😟 |
When I run the above suggested steps to reproduce this I get the attached log. |
@oeuftete I was able to replicate your results using your example. @Ch3LL mentioned testing with this PR: #56513 -- and it appears to solve the issue in your example. I'm seeing all reactor and orchestration logs without any duplicate or missing indices in the log messages. Would you mind checking out that PR to see if you get similar results? |
@theskabeater Yes, I just tested now the current master (04a0068) + a patch generated from #56513 and I'm seeing all of the reaction orchestrations happen now, once only, in the parallel and non-parallel cases. Great! I'm also going to link in the issue at #58369 to make sure this issue is visible there. |
Since #56513 is merged I am closing here |
Are there PRs other than #56513 that this fix is dependent on? 3002.2 appears to have all the modifications from that PR and I'm still seeing this issue.
|
@oeuftete good info, thanks! |
looks like #58853 was merged and will be in the Aluminium release, closing this issue |
ZD-6199. 3002.2, reactors not involved, multiple concurrent orchestrations using inline pillar data, though. |
@oeuftete so, this probably needs a new issue for the scenario you're facing? |
@s0undt3ch I think we're OK with no new issue and keeping this closed. I added the comment just for Enterprise issue cross-referencing. I believe the "new" issue is just a less common way to reproduce the issue of cross-talk in concurrent orchestrations with pillar data. The only difference (I think) is that the previous examples were (conveniently) triggered by reactors, whereas the recent one was via manual/API-driven orchestrations. |
Using the test case in #54045 (comment), tested with https://github.com/saltstack/salt/releases/tag/v3003rc1, and both the regular and parallel cases are working fine. |
Description of Issue
Looks like we have some kind of race condition in the reactor system when passing pillar data to orchestration.
Setup
The setup I used to test this was based on
salt-api
webhooks. to send data from the webhook into the reactor and carry that data into orchestration that holds saltify cloud.profile.The reactors all rendered 100% correctly. But the orchestrations were all over the place. Some repeating pillar information and in one case no pillar information was rendered.
The orchestration that rendered without any pillar data came from an attempt to mitigate issue using sleep of 1 second in-between sending the webhooks
Steps to Reproduce Issue
config files
/etc/salt/cloud.providers.d/saltify.conf
/etc/salt/cloud.profiles.d/saltify_host.conf
/etc/salt/master.d/*
/srv/salt/reactor/deploy_minion.sls
/srv/salt/orch/minion/deploy.sls
Command run
for i in {1..8}; do curl localhost:8000/hook/deploy_minion -d host=test-${i}; done
master (3).log
As can be seen from the log.
deploy_minion.sls rendered in order
test-1
test-2
test-3
test-4
test-5
test-6
test-7
test-8
But deploy.sls
rendered in order
test-8
test-2
test-3
test-1
test-7
test-6
test-6
test-5
I have not found the cause of the issue yet.
Versions Report
The text was updated successfully, but these errors were encountered: