-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[windows] new service and packaging #2417
Conversation
The service part of the windows agent has been rewritten: instead of having the service file (former `agent.py`) including the "supervisor" logic, both now live in two separate files. In addition to a better separation of concerns, this allows us to put our supervisor outside of the executable generated by `py2exe`. Which again, makes it easier for somebody willing to contribute to access the agent's python venv and it's pip. Hacking on the agent on Windows should now be a bit easier. The `agent_service.py` file takes care of managing the global `Datadog Agent` service... all it does is starting `supervisor.py` using `subprocess.Popen`) and killing that process when the OS asks the service to stop. It comes along with the set of changes brought by the Omnibus 4 build.
Our home-made Windows supervisor is know using processes launched using `subprocess.Popen` instead of `multiprocessing`. It gets closer to the behaviour of `supervisord` on Linux and leads to a much smaller code for the supervisor itself. On the other hand it makes it a bit harder to have a windows-specific behaviour for our subprocesses (like JMXFetch) but cross-compatibility issues should rather be taken care of in the code of these subprocesses themselves instead of doing it in the upper layer.
For now the path setup in the new Windows supervisor and in the service layer on top of it were convenient paths to hack on the agent. We now use the real ones. Logs for our new supervisor and service layer have also been added to the flare command.
Before the supervisor was killed in a very brutal and rude manner. Now both layers can talk through a socket bound to port 9001 so that the service layer can ask the supervisor to terminate smoothly and the latter replies once he's done so that the service can finally terminate. Also, the way the service layer computes the path of the supervisor has been enhanced to support different agent installation configurations: So far, only the MSI installer was supported. With this change (along with the rest of the Omnibus 4 build), it is now possible to use the source install script as well (even though we don't have one yet on Windows). Finally, just git pull-ing the `dd-agent` repo, installing the requirements and running the agent (YES, even as a regular Windows service) has also been made possible. This commit is particularly intented to people willing to get hack on the agent.
The embedded python shell for `dd-agent` isn't necessary with the new agent build. Accessing the embedded python is a simple as calling the embedded python executable... and so is calling pip :) I also simplified the way the agent service discovers where the agent is actually installed (just by using `os.path.join` instead of calling `os.path.dirname` several times).
We now use the Python built in logging module in the `win32/ddagent_service.py` file as well. Logs are quite pretty now :) Also a useless log line has just been removed from the `windows_supervisor.py` file.
When the latter is started without the `start` argument. Also a useless import has been removed from the service file.
Even though we're not supposed to use the GUI on OSX anymore... who knows what could happen.
This is actually a fix for the biggest syntax error ever... I couldn't rebase because of a recent merge :/
Windows has its special one. (memory watchdog was removed) The old one is now the POSIX one. (still memory watchdog when enabled) They now all live in the utils/watchdog.py file.
As it doesn't exist
And add missing run method
It works for us.
windows_supervisor: flake8 and comments win32/ddagent_service -> win32/service: it's the only service. + flake8 and comments
When the API key is not set in the config but available in the registry.
For get_config, it was failing the collector.
Doc is not the most explicit
import supervisor.xmlrpc | ||
except ImportError: | ||
# Not used/shipped on Windows | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agent.py
from utils.compat import supervisor_xmlrpc
utils/compat.py
Nitpick: I like having these try/except ImportError on a separate compat
module, i.e.
try:
import supervisor.xmlrpc as supervisor_xmlrpc
except ImportError:
supervisor_xmlrpc = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I respectfully disagree. I don't think it's needed for one dependency. Actually I would prefer extracting the function using supervisor.xmlrpc
in its own utils/
and have the logic there.
utils/flare.py
Outdated
) | ||
self._add_log_file_tar( | ||
"{0}/service.log".format(os.path.dirname(self._collector_log)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supervisor.log
does not exist anymore
win32/service.py
Outdated
log.error( | ||
"{0} reached the limit of restarts ({1} tries during the last {2}s" | ||
" (max authorized: {3})). Not restarting." | ||
.format(self._name, len(self._restarts), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use .format()
here. (I wrote this line :p)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> there are multiple occurrences of this.
utils/watchdog.py
Outdated
psutil.Process().kill() | ||
|
||
def reset(self): | ||
log.debug("Resetting watchdog for %d" % self._duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried this gets too spammy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were already doing it in the Unix Watchdog, I think it's fine.
utils/windows_configuration.py
Outdated
except OSError as e: | ||
log.exception('Unable to save new datadog.conf') | ||
|
||
log.debug('Sucessfully saved the new datadog.cong') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Sucessfully/Successfully
and s/datadog.cong/datadog.conf
win32/service.py
Outdated
|
||
log.info("%s is stopped.", self._name) | ||
else: | ||
log.info('%s was not running.', self._name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched it to debug. I think it's still useful because it tells us the service didn't find the process. (more obvious than an absence of log)
Issue: JMXFetch is not properly exited when the service stops. |
Can we also document how to use the Datadog Agent from the command-line? We used to have a |
Fix log formatting, use JMXFetchProcess (so that we can actually stop the java jmxfetch process).
If the service somehow fails to write the config, the agent processes will now query the registry as well.
It could be useful to debug the GUI/service.
JMXFetch now exits properly with the agent. |
As they're responsible for the logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments but nothing major. Feel free to merge once they're answered/addressed
One more thing, that you can take care of later: let's remove these functions
Lines 664 to 702 in 47f9a3e
def set_win32_cert_path(): | |
"""In order to use tornado.httpclient with the packaged .exe on Windows we | |
need to override the default ceritifcate location which is based on the path | |
to tornado and will give something like "C:\path\to\program.exe\tornado/cert-file". | |
If pull request #379 is accepted (https://github.com/facebook/tornado/pull/379) we | |
will be able to override this in a clean way. For now, we have to monkey patch | |
tornado.httpclient._DEFAULT_CA_CERTS | |
""" | |
if hasattr(sys, 'frozen'): | |
# we're frozen - from py2exe | |
prog_path = os.path.dirname(sys.executable) | |
crt_path = os.path.join(prog_path, 'ca-certificates.crt') | |
else: | |
cur_path = os.path.dirname(__file__) | |
crt_path = os.path.join(cur_path, 'packaging', 'datadog-agent', 'win32', | |
'install_files', 'ca-certificates.crt') | |
import tornado.simple_httpclient | |
log.info("Windows certificate path: %s" % crt_path) | |
tornado.simple_httpclient._DEFAULT_CA_CERTS = crt_path | |
def set_win32_requests_ca_bundle_path(): | |
"""In order to allow `requests` to validate SSL requests with the packaged .exe on Windows, | |
we need to override the default certificate location which is based on the location of the | |
requests or certifi libraries. | |
We override the path directly in requests.adapters so that the override works even when the | |
`requests` lib has already been imported | |
""" | |
import requests.adapters | |
if hasattr(sys, 'frozen'): | |
# we're frozen - from py2exe | |
prog_path = os.path.dirname(sys.executable) | |
ca_bundle_path = os.path.join(prog_path, 'cacert.pem') | |
requests.adapters.DEFAULT_CA_BUNDLE_PATH = ca_bundle_path | |
log.info("Default CA bundle path of the requests library: {0}" | |
.format(requests.adapters.DEFAULT_CA_BUNDLE_PATH)) |
setup.py
Outdated
@@ -118,8 +70,8 @@ def __init__(self, **kw): | |||
'compressed': True, | |||
'bundle_files': 3, | |||
'excludes': ['numpy'], | |||
'dll_excludes': ['crypt32.dll',"IPHLPAPI.DLL", "NSI.dll", "WINNSI.DLL", "WTSAPI32.dll"], | |||
'ascii':False, | |||
'dll_excludes': ["IPHLPAPI.DLL", "NSI.dll", "WINNSI.DLL", "WTSAPI32.dll", "WTSAPI32.dll"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 things here:
- is removing
crypt32.dll
intentional? (was added here: [core] [windows] Fixes windows agent for windows server 2008 and earlier #2486) WTSAPI32.dll
is listed twice, probably not intentional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, re-included it.
setup.py
Outdated
|
||
# windows-specific deps | ||
install_requires.append('pywin32==217') | ||
|
||
# Modules to force-include in the exe | ||
include_modules = [ | ||
# stdlib | ||
'os', | ||
'time', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need to explicitly include modules from the stdlib, but whatever :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed them, it still works.
win32/service.py
Outdated
# Let's have an uptime counter | ||
self.start_ts = None | ||
|
||
# find the main agent dir, for instance C:\Program Files\Datadog\Datadog Agent\agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment is incorrect right? Think this should read C:\Program Files\Datadog\Datadog Agent
del env['PYTHONHOME'] | ||
if env['PATH'][-1] != ';': | ||
env['PATH'] += ';' | ||
env['PATH'] += "{};{};".format(os.path.join(dd_dir, 'bin'), os.path.join(dd_dir, 'embedded')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we also add embedded
to the path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because gohai
requires python
to be in the path (to get the python version), and python.exe
is in embedded
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove it, we don't need this info in gohai anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually you can disregard my comment: we do the same path handling on Linux so let's do the same thing here
Remove stdlib from include_modules (it shouldn't be needed). Re-include `crypt32.dll` for WS 2008.
* master: (67 commits) [network] dont combine connection states (#3158) renames function in line with other checks [couchbase] Modified service_check_tags in couchbase.py to include user-specified tags. (#3079) [docker] fix image tag extraction Fix tests, refactor how we collect container and volume states test_docker_daemon.py: fix syntax errors Beginning work on docker_daemon tests. Add 5 opt-in checks to docker_daemon add mention of office hours (#3171) updates psycopg2 to 2.6.2 (#3170) [trace-agent] adding output of 'trace-agent -info' (#3164) [riak] Change default value in configuration example to match default value from the code move setting parameter to instance level riak security support [dns_check][ci] Bring back assertions on metrics (#3162) [powerdns_recursor] adds support for v4. (#3166) [tcp_check] Add custom tags to response_time gauge catch can't connect instead of failing on nodata found (#3127) [php-fpm] add http_host tag [dns_check] Document NXDOMAIN usage in yaml example file ...
bba9887
to
04e40d0
Compare
win32/service.py
Outdated
|
||
self._proc.terminate() | ||
|
||
self._proc.wait(timeout=self._STOP_TIMEOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also document that this wait
step is not supposed to do anything right now (since terminate
simply kills the process on Windows)
67fa052
to
20c5e6c
Compare
We want to JMXFetch to exit properly because issues could arise should it close abruptly its JMX connectioni (this is pure hypothesis, but let's stay safe). We still kill it if we have to wait more than 15s. (= one full run of JMXFetch)
20c5e6c
to
af04f0b
Compare
Cool stuff! |
This PR completely changes the agent on Windows.
Try it here : https://drive.google.com/a/datadoghq.com/file/d/0B6bdbouD69UhS3djVzBlaDRxZDQ/view?usp=sharing
@elafarge is the main author of this PR, I fixed some bugs and rebased.
Previous state
The agent on Windows has two parts :
win32/agent.py
) : starts and watch 4 processes (collector, forwarder, dogstatsd, jmxfetch), which reuse a lot of code from the Unix agent (except the collector, not usingagent.py
).Both of them are shipped with
py2exe
, so we have to specify all their dependencies insetup.py
, and it's really difficult to add a Python package once installed.Why change this ?
agent.py
) is annoying.Omnibus helps us prepare the dependencies for the agent, but it won't solve our problems if the agent is still completely shipped frozen using py2exe.
What does this PR do ?
First things first : it doesn't change the GUI. (it adds two displayable logs files, but that's it)
However it completely changes the service : it now lives in
win32/service.py
, and only launches a "Windows supervisor" (windows_supervisor.py
).This service and the GUI are STILL shipped using py2exe. (see updated
setup.py
), but the Windows supervisor and the agent are not. They are using a python shipped by Omnibus.The Windows supervisor launches the agent processes (collector, dogstatsd, forwarder, jmxfetch) and restarts them if they die. It relies on
psutil
to do so.The agent is slightly modified for Windows compatibility (watchdog and os.signal).
It is now way easier to play with the Windows agent, and it's even possible to interact with it using the command line. And the python shipped with it is a real/full one, with pip and the possibility to install other dependencies.
This PR also adds "custom checks" for Windows, i.e. we could specify a directory where customers could put custom checks (Unix agent-like).
Does it affect other platforms ?
It shouldn't, but the watchdog and the agent were modified/moved for Windows compatibility. It was only tested on Windows.
Main review points
win32/service.py
) : launch the Windows supervisor and stops it.windows_supervisor.py
), written from scratch, starts and stops the agent processes.utils/watchdog.py
).Related PR
DataDog/dd-agent-omnibus#74
DataDog/omnibus-software#48
DataDog/omnibus-ruby#20
https://github.com/DataDog/osx-dd-agent-build-box/pull/1