-
-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pantsd without forking #7596
Pantsd without forking #7596
Conversation
7cfc6e8
to
b8d598d
Compare
d1642af
to
3513d9f
Compare
### Problem In the context of #7596, pants tasks will be executed outside of the main thread. This means that tasks like python-repl cannot override signal handling (code), because this can only be done from the main thread. ### Solution Create an atomic variable, SignalHandler._ignore_sigint, that gates SIGINT handling. Create a global context manager inside ExceptionSink that toggles this for a certain time if needed. ### Result Any thread can pause the handling of SIGINT in a controlled way. python-repl no longer installs signal handlers, but rather it toggles the handling of the signal. ### Caveat We may want to gate all signals individually, but the semantics of SIGINT are usually different enough from SIGTERM and SIGKILL that I think it's okay to only gate the first one.
Heads up, #7623 just merged and it unblocks a better fix for |
3513d9f
to
49ae0dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good, although once things are green the travis edits need to be nuked.
class NoopExiter(Exiter): | ||
def exit(self, result, *args, **kwargs): | ||
if result != 0: | ||
# TODO this exception was not designed for this, but it happens to do what we want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding this comment would be good: what is it that we want from the NoopExiter
? As with many things, I expect that we can clean it up post merge, but would be good to include a bit more context in case ... that takes a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some context on this: We want the invocation of LocalPantsRunner.run()
to raise an exception if it finishes with an error code, because then DaemonPantsRunner
can catch that and process that as necessary (sending the error message to the nailgun client, for example). This processing happens in this set of catch blocks.
_GracefulTerminationException
was designed to be a way to signal "errors that happen when we try to close the pants run before it's done". These exceptions were raised by this call.
In this line, we are changing the semantics of _GracefulTerminationException()
, from "pants hasn't been able to complete the run" to "pants hasn't been able to complete the run, or pants has completed the run and it is a failure (error code >0)".
The correct way to do this would be to either explicitly update the docstring of _GracefulTerminationException
to account for the change in semantics, or to create a new exception type (maybe PantsFinishedWithFailureException
), and add another elif
branch to this block of code
|
||
def __init__(self, socket): | ||
def __init__(self, maybe_shutdown_socket): | ||
# N.B. Assuming a fork()'d child, cause os._exit to be called here to avoid the routine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part isn't accurate anymore, so would be good to clarify what this class does now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added todo in #7606
graph_helper = None | ||
target_roots = None | ||
options_bootstrapper = None | ||
# N.B. This will be overridden with the correct value if options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should still be accurate I think? If it's not, then creating the subprocess_dir
should no longer be necessary, and we can leave a TODO about removing it.
yield fd | ||
else: | ||
with open('/dev/null', 'rb') as fh: | ||
yield fh.fileno() | ||
|
||
# The NailgunStreamWriter probably shouldn't grab the socket without a lock... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should convert this to a TODO that references a ticket, probably.
self._exiter.exit(e.exit_code) | ||
except Exception: | ||
except Exception as e: | ||
# LocalPantsRunner.set_start_time resets the global exiter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still true? My understanding was that context managers were added to reset the exiters.
src/python/pants/java/nailgun_io.py
Outdated
finally: | ||
os.close(read_fd) | ||
|
||
# TODO: The error is that when the client finishes (and presumably closes the socket), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to refer to a ticket, or expand/remove this.
@@ -402,7 +403,7 @@ def test_pantsd_pid_change(self): | |||
def test_pantsd_memory_usage(self): | |||
"""Validates that after N runs, memory usage has increased by no more than X percent.""" | |||
number_of_runs = 10 | |||
max_memory_increase_fraction = 0.35 | |||
max_memory_increase_fraction = 0.60 # TODO this should be closer to 0.35 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can refer to Patrick's ticket for this.
@@ -477,6 +479,7 @@ def test_pantsd_parse_exception_success(self): | |||
finally: | |||
rm_rf(test_path) | |||
|
|||
@unittest.skip("Pantsd is not expected to handle parallel runs anymore.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these will be skipped rather than deleted, we should probably refer to a ticket about re-enabling parallel runs for v2.
os.kill(client_pid, signum) | ||
waiter_run = waiter_handle.join() | ||
self.assert_failure(waiter_run) | ||
|
||
|
||
print("Run finished, stderr data is {}".format(waiter_run.stderr_data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very exciting! Thanks for all the work on this.
# That said, this means that under pantsd, | ||
# Ctrl-C will simply crash the repl (and the daemon). | ||
# TODO(#7623) Potential more robust (but more invasive) fix. | ||
self._run_repl(pex, **pex_run_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed as of #7623, we can just replace the whole body of this method with:
self._run_repl(pex, **pex_run_kwargs) | |
with ExceptionSink.ignoring_signint(): | |
self._run_repl(pex, **pex_run_kwargs) |
@@ -425,6 +425,7 @@ def _exit_with_failure(cls, terminal_msg): | |||
@classmethod | |||
def _log_unhandled_exception_and_exit(cls, exc_class=None, exc=None, tb=None, add_newline=False): | |||
"""A sys.excepthook implementation which logs the error and exits with failure.""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover blank line
from pants.init.util import clean_global_runtime_state | ||
from pants.java.nailgun_io import NailgunStreamStdinReader, NailgunStreamWriter | ||
from pants.java.nailgun_protocol import ChunkType, NailgunProtocol | ||
from pants.pantsd.process_manager import ProcessManager | ||
from pants.java.nailgun_protocol import ChunkType, MaybeShutdownSocket, NailgunProtocol | ||
from pants.util.contextutil import hermetic_environment_as, stdio_as | ||
from pants.util.socket import teardown_socket | ||
|
||
|
||
class DaemonSignalHandler(SignalHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this signal handler ever takes effect, because we don't have a good way to enable it from the main thread (this whole piece of code runs in a child thread, and we can't use custom signal handlers in those for now). I think this entire class can be removed.
class NoopExiter(Exiter): | ||
def exit(self, result, *args, **kwargs): | ||
if result != 0: | ||
# TODO this exception was not designed for this, but it happens to do what we want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some context on this: We want the invocation of LocalPantsRunner.run()
to raise an exception if it finishes with an error code, because then DaemonPantsRunner
can catch that and process that as necessary (sending the error message to the nailgun client, for example). This processing happens in this set of catch blocks.
_GracefulTerminationException
was designed to be a way to signal "errors that happen when we try to close the pants run before it's done". These exceptions were raised by this call.
In this line, we are changing the semantics of _GracefulTerminationException()
, from "pants hasn't been able to complete the run" to "pants hasn't been able to complete the run, or pants has completed the run and it is a failure (error code >0)".
The correct way to do this would be to either explicitly update the docstring of _GracefulTerminationException
to account for the change in semantics, or to create a new exception type (maybe PantsFinishedWithFailureException
), and add another elif
branch to this block of code
"""An Exiter that emits unhandled tracebacks and exit codes via the Nailgun protocol.""" | ||
"""An Exiter that emits unhandled tracebacks and exit codes via the Nailgun protocol. | ||
|
||
TODO: This no longer really follows the Exiter API, per-se (or at least, it doesn't call super). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this no longer follows the Exiter
API, and (for now) we don't need the DaemonExiter
to act as an exiter, we could maybe inline this code, or turn it into a standalone class and make this not inherit Exiter
(possibly changing the name to DaemonNailgunCloser
or something to avoid confusing this with other *Exiter
classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added todo in #7606
|
||
self._exiter = DaemonExiter(socket) | ||
self.exit_code = exit_code | ||
self._exiter = DaemonExiter(maybe_shutdown_socket) | ||
|
||
def _make_identity(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not need this anymore. I'm not sure, but it seems related to ProcessManager
. If we need it, I suspect it will be for Nailgun communication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added TODO
options_bootstrapper=options_bootstrapper, | ||
) | ||
global_options = options.for_global_scope() | ||
setup_logging_from_options(global_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth noting that this change only works because of the encapsulated_logger
in DaemonPantsRunner
, and therefore we don't have to gate logging setup anymore.
src/python/pants/bin/pants_runner.py
Outdated
@@ -57,7 +65,10 @@ def run(self): | |||
for message_regexp in global_bootstrap_options.ignore_pants_warnings: | |||
warnings.filterwarnings(action='ignore', message=message_regexp) | |||
|
|||
if global_bootstrap_options.enable_pantsd: | |||
# TODO For kill-pantsd and clean-all, we cannot run them in the daemon itself, because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._services = services | ||
|
||
def handle_sigint(self, signum, _frame): | ||
for service in self._services: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should maybe terminate itself but not watchman (with this function).
Something like:
class PantsDaemonSignalHandler(SignalHandler):
def __init__(self, daemon):
super(PantsDaemonSignalHandler, self).__init__()
self._daemon = daemon
def handle_sigint(self, signum, _frame):
self._daemon.terminate(include_watchman=False)
# Send a useful message to users
|
||
def handle_sigint(self, signum, _frame): | ||
for service in self._services: | ||
service.record_exception(KeyboardInterrupt('remote client sent control-c!')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a test for an api that I forgot to remove, it doesn't actually do anything, this line can be safely removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woo!
@@ -458,6 +459,7 @@ def test_pantsd_invalidation_stale_sources(self): | |||
finally: | |||
rm_rf(test_path) | |||
|
|||
@unittest.skip("Pantsd is not expected to handle parallel runs anymore.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this long-term true? I thought we did expect pantsd to still handle parallel runs of some goals, including list?
If it's long-term true, let's delete the test.
If it's not long-term true, let's add a link to a ticket about fixing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, added TODO
e8fa10d
to
9e85d3e
Compare
This synchronises background threads that read from the nailgun socket with the DaemonPantsRunner which closes it
**Implementation for #8002** Before removing forking from pantsd (#7596 ), we needed to warm up the v2 product graph of the daemon before forking into a pantsd-runner process (relevant code here). Now, however, this warming is not necessary (because we don't fork a process per run of pants anymore), so we can do some cleanup to that code: We can remove the concept of "warming" (here), which is invoked from scheduler_service.prefork (here). This probably means that we can remove the entire warm_product_graph function. We can reword the code around scheduler_service.prefork so that it doesn't mention forking at all. It was called prefork because that's what we did before #7596, but it can be called something different now. We can merge the try:except block in DaemonPantsRunner.create() (here) into the try:except block in DaemonPantsRunner.run() (here). The run() function aims to centralize and correctly handle all the exceptions that happen on a pants run, so the "prefork" logic should be integrated there as well. This will help a lot in correctly diagnosing pantsd problems, as this split and the wrangled exception handling that happen as a result have caused numerous issues.
Problem
see #7390. New PR with a shared branch for collaboration (if only github would allow editing target branches in PRs)
Solution
TODO:
Result