Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[windows] new service and packaging #2417

Merged
merged 56 commits into from
Feb 13, 2017
Merged

Conversation

degemer
Copy link
Member

@degemer degemer commented Apr 13, 2016

This PR completely changes the agent on Windows.

Try it here : https://drive.google.com/a/datadoghq.com/file/d/0B6bdbouD69UhS3djVzBlaDRxZDQ/view?usp=sharing

@elafarge is the main author of this PR, I fixed some bugs and rebased.

Previous state

The agent on Windows has two parts :

  • the service (win32/agent.py) : starts and watch 4 processes (collector, forwarder, dogstatsd, jmxfetch), which reuse a lot of code from the Unix agent (except the collector, not using agent.py).
  • the gui : awful GUI we all know, displays the state/logs/conf files. Only easy way to edit the configuration files on Windows. Interacts with the service to start/stop/restart it.

Both of them are shipped with py2exe, so we have to specify all their dependencies in setup.py, and it's really difficult to add a Python package once installed.

Why change this ?

  1. It's really difficult to add a check dependency.
  2. It's a pain to debug anything on Windows since the main interface is the GUI (or the embedded python, but it's frozen by py2exe).
  3. Having to maintain a part of the code for Windows specifically for Windows (and duplicate the feature added to agent.py) is annoying.

Omnibus helps us prepare the dependencies for the agent, but it won't solve our problems if the agent is still completely shipped frozen using py2exe.

What does this PR do ?

First things first : it doesn't change the GUI. (it adds two displayable logs files, but that's it)

However it completely changes the service : it now lives in win32/service.py, and only launches a "Windows supervisor" (windows_supervisor.py).

This service and the GUI are STILL shipped using py2exe. (see updated setup.py), but the Windows supervisor and the agent are not. They are using a python shipped by Omnibus.

The Windows supervisor launches the agent processes (collector, dogstatsd, forwarder, jmxfetch) and restarts them if they die. It relies on psutil to do so.

The agent is slightly modified for Windows compatibility (watchdog and os.signal).

It is now way easier to play with the Windows agent, and it's even possible to interact with it using the command line. And the python shipped with it is a real/full one, with pip and the possibility to install other dependencies.

This PR also adds "custom checks" for Windows, i.e. we could specify a directory where customers could put custom checks (Unix agent-like).

Does it affect other platforms ?

It shouldn't, but the watchdog and the agent were modified/moved for Windows compatibility. It was only tested on Windows.

Main review points

  • New service (win32/service.py) : launch the Windows supervisor and stops it.
  • Windows supervisor (windows_supervisor.py), written from scratch, starts and stops the agent processes.
  • Watchdog for Windows and Linux (utils/watchdog.py).

Related PR

DataDog/dd-agent-omnibus#74
DataDog/omnibus-software#48
DataDog/omnibus-ruby#20
https://github.com/DataDog/osx-dd-agent-build-box/pull/1

Etienne LAFARGE and others added 24 commits April 13, 2016 16:07
The service part of the windows agent has been rewritten: instead of
having the service file (former `agent.py`) including the "supervisor"
logic, both now live in two separate files.

In addition to a better separation of concerns, this allows us to put
our supervisor outside of the executable generated by `py2exe`. Which
again, makes it easier for somebody willing to contribute to access the
agent's python venv and it's pip. Hacking on the agent on Windows should
now be a bit easier.

The `agent_service.py` file takes care of managing the global `Datadog
Agent` service... all it does is starting  `supervisor.py` using
`subprocess.Popen`) and killing that process when the OS asks the
service to stop.

It comes along with the set of changes brought by the Omnibus 4 build.
Our home-made Windows supervisor is know using processes launched using
`subprocess.Popen` instead of `multiprocessing`. It gets closer to the
behaviour of `supervisord` on Linux and leads to a much smaller code for
the supervisor itself. On the other hand it makes it a bit harder to
have a windows-specific behaviour for our subprocesses (like JMXFetch)
but cross-compatibility issues should rather be taken care of in the
code of these subprocesses themselves instead of doing it in the upper
layer.
For now the path setup in the new Windows supervisor and in the service
layer on top of it were convenient paths to hack on the agent. We now
use the real ones.

Logs for our new supervisor and service layer have also been added to
the flare command.
Before the supervisor was killed in a very brutal and rude manner.
Now both layers can talk through a socket bound to port 9001 so that
the service layer can ask the supervisor to terminate smoothly and
the latter replies once he's done so that the service can finally
terminate.

Also, the way the service layer computes the path of the supervisor has
been enhanced to support different agent installation configurations:

So far, only the MSI installer was supported. With this change (along
with the rest of the Omnibus 4 build), it is now possible to use the
source install script as well (even though we don't have one yet on
Windows). Finally, just git pull-ing the `dd-agent` repo, installing
the requirements and running the agent (YES, even as a regular
Windows service) has also been made possible.

This commit is particularly intented to people willing to get hack on
the agent.
The embedded python shell for `dd-agent` isn't necessary with the new
agent build. Accessing the embedded python is a simple as calling the
embedded python executable... and so is calling pip :)

I also simplified the way the agent service discovers where the agent is
actually installed (just by using `os.path.join` instead of calling
`os.path.dirname` several times).
We now use the Python built in logging module in the
`win32/ddagent_service.py` file as well. Logs are quite
pretty now :)

Also a useless log line has just been removed from the
`windows_supervisor.py` file.
When the latter is started without the `start` argument.
Also a useless import has been removed from the service file.
Even though we're not supposed to use the GUI on OSX anymore... who
knows what could happen.
This is actually a fix for the biggest syntax error ever...
I couldn't rebase because of a recent merge :/
Windows has its special one. (memory watchdog was removed)
The old one is now the POSIX one. (still memory watchdog when enabled)
They now all live in the utils/watchdog.py file.
windows_supervisor: flake8 and comments

win32/ddagent_service -> win32/service: it's the only service. + flake8 and
comments
When the API key is not set in the config but available in the registry.
For get_config, it was failing the collector.
Doc is not the most explicit
@masci masci modified the milestones: 5.11.0, 5.12.0 Jan 24, 2017
import supervisor.xmlrpc
except ImportError:
# Not used/shipped on Windows
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agent.py

from utils.compat import supervisor_xmlrpc

utils/compat.py
Nitpick: I like having these try/except ImportError on a separate compat module, i.e.

try:
  import supervisor.xmlrpc as supervisor_xmlrpc
except ImportError:
  supervisor_xmlrpc = None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I respectfully disagree. I don't think it's needed for one dependency. Actually I would prefer extracting the function using supervisor.xmlrpc in its own utils/ and have the logic there.

utils/flare.py Outdated
)
self._add_log_file_tar(
"{0}/service.log".format(os.path.dirname(self._collector_log))
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supervisor.log does not exist anymore

win32/service.py Outdated
log.error(
"{0} reached the limit of restarts ({1} tries during the last {2}s"
" (max authorized: {3})). Not restarting."
.format(self._name, len(self._restarts),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use .format() here. (I wrote this line :p)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> there are multiple occurrences of this.

psutil.Process().kill()

def reset(self):
log.debug("Resetting watchdog for %d" % self._duration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried this gets too spammy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were already doing it in the Unix Watchdog, I think it's fine.

except OSError as e:
log.exception('Unable to save new datadog.conf')

log.debug('Sucessfully saved the new datadog.cong')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Sucessfully/Successfully and s/datadog.cong/datadog.conf

win32/service.py Outdated

log.info("%s is stopped.", self._name)
else:
log.info('%s was not running.', self._name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched it to debug. I think it's still useful because it tells us the service didn't find the process. (more obvious than an absence of log)

@yannmh
Copy link
Member

yannmh commented Jan 24, 2017

Issue: JMXFetch is not properly exited when the service stops.

@yannmh
Copy link
Member

yannmh commented Jan 24, 2017

Can we also document how to use the Datadog Agent from the command-line? We used to have a shell.exe Python entry point, but it's not the case anymore.

Fix log formatting, use JMXFetchProcess (so that we can actually stop
the java jmxfetch process).
If the service somehow fails to write the config, the agent processes
will now query the registry as well.
It could be useful to debug the GUI/service.
@degemer
Copy link
Member Author

degemer commented Jan 25, 2017

JMXFetch now exits properly with the agent.

As they're responsible for the logging.
Copy link
Member

@olivielpeau olivielpeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments but nothing major. Feel free to merge once they're answered/addressed

One more thing, that you can take care of later: let's remove these functions

dd-agent/config.py

Lines 664 to 702 in 47f9a3e

def set_win32_cert_path():
"""In order to use tornado.httpclient with the packaged .exe on Windows we
need to override the default ceritifcate location which is based on the path
to tornado and will give something like "C:\path\to\program.exe\tornado/cert-file".
If pull request #379 is accepted (https://github.com/facebook/tornado/pull/379) we
will be able to override this in a clean way. For now, we have to monkey patch
tornado.httpclient._DEFAULT_CA_CERTS
"""
if hasattr(sys, 'frozen'):
# we're frozen - from py2exe
prog_path = os.path.dirname(sys.executable)
crt_path = os.path.join(prog_path, 'ca-certificates.crt')
else:
cur_path = os.path.dirname(__file__)
crt_path = os.path.join(cur_path, 'packaging', 'datadog-agent', 'win32',
'install_files', 'ca-certificates.crt')
import tornado.simple_httpclient
log.info("Windows certificate path: %s" % crt_path)
tornado.simple_httpclient._DEFAULT_CA_CERTS = crt_path
def set_win32_requests_ca_bundle_path():
"""In order to allow `requests` to validate SSL requests with the packaged .exe on Windows,
we need to override the default certificate location which is based on the location of the
requests or certifi libraries.
We override the path directly in requests.adapters so that the override works even when the
`requests` lib has already been imported
"""
import requests.adapters
if hasattr(sys, 'frozen'):
# we're frozen - from py2exe
prog_path = os.path.dirname(sys.executable)
ca_bundle_path = os.path.join(prog_path, 'cacert.pem')
requests.adapters.DEFAULT_CA_BUNDLE_PATH = ca_bundle_path
log.info("Default CA bundle path of the requests library: {0}"
.format(requests.adapters.DEFAULT_CA_BUNDLE_PATH))
since they're useless with these changes

setup.py Outdated
@@ -118,8 +70,8 @@ def __init__(self, **kw):
'compressed': True,
'bundle_files': 3,
'excludes': ['numpy'],
'dll_excludes': ['crypt32.dll',"IPHLPAPI.DLL", "NSI.dll", "WINNSI.DLL", "WTSAPI32.dll"],
'ascii':False,
'dll_excludes': ["IPHLPAPI.DLL", "NSI.dll", "WINNSI.DLL", "WTSAPI32.dll", "WTSAPI32.dll"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 things here:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, re-included it.

setup.py Outdated

# windows-specific deps
install_requires.append('pywin32==217')

# Modules to force-include in the exe
include_modules = [
# stdlib
'os',
'time',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to explicitly include modules from the stdlib, but whatever :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed them, it still works.

win32/service.py Outdated
# Let's have an uptime counter
self.start_ts = None

# find the main agent dir, for instance C:\Program Files\Datadog\Datadog Agent\agent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is incorrect right? Think this should read C:\Program Files\Datadog\Datadog Agent

del env['PYTHONHOME']
if env['PATH'][-1] != ';':
env['PATH'] += ';'
env['PATH'] += "{};{};".format(os.path.join(dd_dir, 'bin'), os.path.join(dd_dir, 'embedded'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we also add embedded to the path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because gohai requires python to be in the path (to get the python version), and python.exe is in embedded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove it, we don't need this info in gohai anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you can disregard my comment: we do the same path handling on Linux so let's do the same thing here

Remove stdlib from include_modules (it shouldn't be needed).
Re-include `crypt32.dll` for WS 2008.
* master: (67 commits)
  [network] dont combine connection states (#3158)
  renames function in line with other checks
  [couchbase] Modified service_check_tags in couchbase.py to include user-specified tags. (#3079)
  [docker] fix image tag extraction
  Fix tests, refactor how we collect container and volume states
  test_docker_daemon.py: fix syntax errors
  Beginning work on docker_daemon tests.
  Add 5 opt-in checks to docker_daemon
  add mention of office hours (#3171)
  updates psycopg2 to 2.6.2 (#3170)
  [trace-agent] adding output of 'trace-agent -info' (#3164)
  [riak] Change default value in configuration example to match default value from the code
  move setting parameter to instance level
  riak security support
  [dns_check][ci] Bring back assertions on metrics (#3162)
  [powerdns_recursor] adds support for v4. (#3166)
  [tcp_check] Add custom tags to response_time gauge
  catch can't connect instead of failing on nodata found (#3127)
  [php-fpm] add http_host tag
  [dns_check] Document NXDOMAIN usage in yaml example file
  ...
@degemer degemer force-pushed the quentin/windows-omnibus-5.0 branch from bba9887 to 04e40d0 Compare February 7, 2017 22:29
win32/service.py Outdated

self._proc.terminate()

self._proc.wait(timeout=self._STOP_TIMEOUT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also document that this wait step is not supposed to do anything right now (since terminate simply kills the process on Windows)

@degemer degemer force-pushed the quentin/windows-omnibus-5.0 branch 3 times, most recently from 67fa052 to 20c5e6c Compare February 10, 2017 20:10
We want to JMXFetch to exit properly because issues could arise should
it close abruptly its JMX connectioni (this is pure hypothesis, but
let's stay safe). We still kill it if we have to wait more than 15s.
(= one full run of JMXFetch)
@degemer degemer force-pushed the quentin/windows-omnibus-5.0 branch from 20c5e6c to af04f0b Compare February 10, 2017 20:33
@degemer degemer merged commit 7251c9c into master Feb 13, 2017
@degemer degemer deleted the quentin/windows-omnibus-5.0 branch February 13, 2017 16:33
@raboof
Copy link

raboof commented Feb 14, 2017

Cool stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants