Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All of the returners (excluding mysql and local_cache) are not handling the "req" job id correctly #51809

Open
arizvisa opened this issue Feb 25, 2019 · 20 comments
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Milestone

Comments

@arizvisa
Copy link
Contributor

Description of Issue/Question

Hiya. I'm the author of PR #51363 (etcd returner), and I notice that the mysql returner specially handles the case when the job id is set to "req". This job id seems to be submitted by a minion when it is activated independant of a command from the master. (Relevant code: https://github.com/saltstack/salt/blob/develop/salt/returners/mysql.py#L294)

The way this magic job-id is handled by the returner seems to be that when it is seen, to replace it with a new job id generated from returner.prep_jid. However, it appears that only the mysql and local_cache returners handle this job id explicitly, and everything else does not tamper with it at all. Which returner is handling it correctly?

Is this something that a returner is actually not supposed to receive and a result of a bug or misconfiguration? Most of the logic that messes with this magic job-id immediately replaces it with a call to prep_jid, however there are a few cases where this logic is missed which results in a returner receiving this 'req' job...

Setup

To view the job-id in use, simply set the startup_states in a minion to execute some state that takes a while (so that you can race against the state being applied). Configure any of the non-local returners so that you can manually check the job being created.

Steps to Reproduce Issue

As the state is being applied, when you view the active jobs salt-run jobs.active you'll see that the job id for the state that was started by the minion has the job-id of "req".

Versions Report

Salt Version:
           Salt: 2019.2.0

Dependency Versions:
           cffi: 1.11.5
       cherrypy: Not Installed
       dateutil: 2.8.0
      docker-py: Not Installed
          gitdb: 2.0.3
      gitpython: 2.1.11
          ioflo: 1.7.5
         Jinja2: 2.10
        libgit2: 0.27.8
        libnacl: 1.6.1
       M2Crypto: Not Installed
           Mako: 1.0.7
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: 3.7.0
         pygit2: 0.27.4
         Python: 2.7.15 (default, Oct 15 2018, 15:26:09)
   python-gnupg: Not Installed
         PyYAML: 4.2
          PyZMQ: 17.0.0
           RAET: 0.6.8
          smmap: 2.0.3
        timelib: Not Installed
        Tornado: 5.0.2
            ZMQ: 4.1.6

System Versions:
           dist: fedora 29 Twenty Nine
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 4.14.96-coreos
         system: Linux
        version: Fedora 29 Twenty Nine
@arizvisa
Copy link
Contributor Author

arizvisa commented Feb 26, 2019

It looks like the reason why returners might get req as the job instead of a real job id is because of the following code. But again, I'm really unfamiliar with this beastly code-base so I'm not 100% sure:

https://github.com/saltstack/salt/blob/develop/salt/daemons/masterapi.py#L780

    def _return(self, load):
        '''
        Handle the return data sent from the minions
        '''
...
        if load['jid'] == 'req':
            # The minion is returning a standalone job, request a jobid
            prep_fstr = '{0}.prep_jid'.format(self.opts['master_job_cache'])
            load['jid'] = self.mminion.returners[prep_fstr](nocache=load.get('nocache', False))

            # save the load, since we don't have it
            saveload_fstr = '{0}.save_load'.format(self.opts['master_job_cache'])
            self.mminion.returners[saveload_fstr](load['jid'], load)
...
        if not self.opts['job_cache'] or self.opts.get('ext_job_cache'):    # XXX: is this right?
            return

        fstr = '{0}.update_endtime'.format(self.opts['master_job_cache'])
        if (self.opts.get('job_cache_store_endtime')
                and fstr in self.mminion.returners):
            self.mminion.returners[fstr](load['jid'], endtime)

        fstr = '{0}.returner'.format(self.opts['master_job_cache'])
self.mminion.returners[fstr](load)

You can see the conditional at XXX where it's checking to see if the caches don't exist, however it looks like the test for both job_cache and ext_job_cache intended to be wrapped in parentheses because this line is essentially saying that if there's _not_ a job cache or there _is_ an external job cache, then ignore the following code which calls the returner() function and uses the master_job_cache.

Is this intentional, or was it really supposed to be the following which results in the returner being called if there's an external job cache configured?

if not (self.opts['job_cache'] or self.opts.get('ext_job_cache')):

@Ch3LL
Copy link
Contributor

Ch3LL commented Feb 26, 2019

it looks like adding req to the jid was on purpose. If you can see this PR: #16392

if you see this comment: #16392 (comment) looks like we need to add that code handling the req return to all the returners.

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior Question The issue is more of a question rather than a bug or a feature request and removed Bug broken, incorrect, or confusing behavior labels Feb 26, 2019
@Ch3LL Ch3LL added this to the Approved milestone Feb 26, 2019
@arizvisa
Copy link
Contributor Author

Ok. Because yeah, if that explicit check doesn't exist then it results in the load data being overwritten since all jids spawned from a minion are 'req'. I'll update my PR then.

Thanks!

@arizvisa
Copy link
Contributor Author

arizvisa commented Feb 27, 2019

Ftr, that unfortunately means that all the returners (other than mysql and local_cache) are busted. Maybe not totally busted actually since there's no error (just silence), but essentially any jobs spawned by minions will cease to exist..

Should I create a new issue pointing that out, or perhaps re-open this and change the title?

@Ch3LL
Copy link
Contributor

Ch3LL commented Mar 4, 2019

lets re-title this and track here :)

@Ch3LL Ch3LL reopened this Mar 4, 2019
@arizvisa arizvisa changed the title [Question] the "req" job id and how it relates to returners All of the returners (excluding mysql and local_cache) are not handling the "req" job id correctly Mar 6, 2019
@arizvisa
Copy link
Contributor Author

arizvisa commented Mar 6, 2019

Okay, cool. Re-titled it.

As another question, is there some way to see the 'req' jobs that are currently being run in a salt environment? Although returners should re-map the 'req' to a new job id, I've been unable to figure out how to view that new job id without actually watching the event queue.. If we go to the prior logic where 'req' is not re-mapped, then you can view the job...but if more than one minion makes a 'req' then the previous results are overwritten.

@arizvisa
Copy link
Contributor Author

arizvisa commented Mar 8, 2019

@Ch3LL, do you mind re-tagging this since it isn't a question anymore?

@arizvisa
Copy link
Contributor Author

arizvisa commented Mar 8, 2019

Ftr, PR #51363 fixes this issue just for the etcd returner. So once/if that gets merged there's one less returner to fix.

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P3 Priority 3 triage and removed Question The issue is more of a question rather than a bug or a feature request labels Mar 12, 2019
@Ch3LL
Copy link
Contributor

Ch3LL commented Mar 12, 2019

if we go to the prior logic where 'req' is not re-mapped, then you can view the job

How were you previously viewing it when it was working?

@arizvisa
Copy link
Contributor Author

By using 'req' as the job id. However anytime another client started another job, it would overwrite the previous job data for 'req'.

@Ch3LL
Copy link
Contributor

Ch3LL commented Mar 20, 2019

when you say overwrite do you mean its overwriting the req job in etcd where the job returns are stored?

@arizvisa
Copy link
Contributor Author

yes. it's ovewriting because the job-id is 'req', so any returner that lets you use a non-integral job id will be overwritten if the job id is being used as a unique key. offhand i think cassandra does this, but there's probably others.

@arizvisa
Copy link
Contributor Author

what might work as a workaround would be to introduce minion-specific job ids. This way job ids can be unique which corresponds to the assumptions that the sql-based returners make when creating an index for the job id.

Something like req-${MINION_ID}-${TIMESTAMP} or something similar since it seems that once a job id hits the returner, salt doesn't seem to care about it anymore (not sure about this). But this way the job id is still searchable via returner.list_jobs (and friends).

This'll definitely need some discussion though as I'm really only familiar with the returner interfaces from my work on etcd. Pretty much the most complete returners are the sql-based one, the local_cache one, the write-only ones (email-based and such), and the etcd one (afaict, and whenever it gets merged).

@Ch3LL
Copy link
Contributor

Ch3LL commented Mar 21, 2019

thanks for clarifying. wanted to make sure i was following correctly.

ping @saltstack/team-core any ideas on how we could approach this?

@waynew
Copy link
Contributor

waynew commented Apr 30, 2019

Apparently I need to dig into returner code, because it's a place I'm not familiar with at all right now 😛

@arizvisa
Copy link
Contributor Author

arizvisa commented May 7, 2019

@waynew, if you want to check my PR for etcd support in the returners. I'd be welling to bet that it's the most complete and documented of the returner implementations.

#51363

Here's a link to the specific file in my submitted branch: https://github.com/arizvisa/saltstack-salt/blob/etcd-returner-is-busted/salt/returners/etcd_return.py

@arizvisa
Copy link
Contributor Author

arizvisa commented May 7, 2019

If you use the etcd returner PR, it's very easy to monitor what a client is doing as etcd simplifies monitoring its schema via etcdclient. You can monitor for "/$jid/$id/$num" (or similar, i haven't messed with this in a long time) where your jid is "req" or a result from "prep_jid", and your "num" is an incrementing number for the state that the client is processing.

Otherwise you'll need to use state.event to watch what's happening as jobs get returned and keep an eye for "$jid" which can be "req" or a jid depending on the issue i mentioned.

Or you can also just plant logs in the returner that you're using (non-local ones).

@stale
Copy link

stale bot commented Jan 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Jan 8, 2020
@arizvisa
Copy link
Contributor Author

arizvisa commented Jan 8, 2020

This is referenced by #52775 and seems to be the only mention of the special handling of the "req" jid by the returners. Each returner has to explicitly handle this job and so this should probably be left open.

Relevent documentation for how the etcd2 returner deals with this special case can be found in the documentation for the salt.returner.etcd_return module merged by PR #51363.

@stale
Copy link

stale bot commented Jan 8, 2020

Thank you for updating this issue. It is no longer marked as stale.

@stale stale bot removed the stale label Jan 8, 2020
@waynew waynew added the Confirmed Salt engineer has confirmed bug/feature - often including a MCVE label Feb 27, 2020
@sagetherage sagetherage removed the P3 Priority 3 label Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests

4 participants