Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pantsd without forking #7596

Merged
merged 19 commits into from
May 3, 2019
Merged

Pantsd without forking #7596

merged 19 commits into from
May 3, 2019

Conversation

ity
Copy link
Contributor

@ity ity commented Apr 19, 2019

Problem

see #7390. New PR with a shared branch for collaboration (if only github would allow editing target branches in PRs)

Solution

TODO:

  • REPLs can't install signal handlers.
  • Environment variables are inherited for every pants run under pantsd.

Result

@ity ity added this to the 1.16.x milestone Apr 19, 2019
@blorente blorente force-pushed the pantsd-sans-forking branch 2 times, most recently from 7cfc6e8 to b8d598d Compare April 25, 2019 09:20
@blorente blorente force-pushed the pantsd-sans-forking branch from d1642af to 3513d9f Compare April 26, 2019 14:20
blorente added a commit that referenced this pull request Apr 30, 2019
### Problem
In the context of #7596, pants tasks will be executed outside of the main thread. This means that tasks like python-repl cannot override signal handling (code), because this can only be done from the main thread.

### Solution
Create an atomic variable, SignalHandler._ignore_sigint, that gates SIGINT handling. Create a global context manager inside ExceptionSink that toggles this for a certain time if needed.

### Result
Any thread can pause the handling of SIGINT in a controlled way.
python-repl no longer installs signal handlers, but rather it toggles the handling of the signal.

### Caveat
We may want to gate all signals individually, but the semantics of SIGINT are usually different enough from SIGTERM and SIGKILL that I think it's okay to only gate the first one.
@blorente
Copy link
Contributor

Heads up, #7623 just merged and it unblocks a better fix for python_repl.py, namely using ExceptionSink.ignoring_sigint().

@ity ity force-pushed the pantsd-sans-forking branch from 3513d9f to 49ae0dd Compare April 30, 2019 20:58
Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good, although once things are green the travis edits need to be nuked.

class NoopExiter(Exiter):
def exit(self, result, *args, **kwargs):
if result != 0:
# TODO this exception was not designed for this, but it happens to do what we want.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding this comment would be good: what is it that we want from the NoopExiter? As with many things, I expect that we can clean it up post merge, but would be good to include a bit more context in case ... that takes a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context on this: We want the invocation of LocalPantsRunner.run() to raise an exception if it finishes with an error code, because then DaemonPantsRunner can catch that and process that as necessary (sending the error message to the nailgun client, for example). This processing happens in this set of catch blocks.

_GracefulTerminationException was designed to be a way to signal "errors that happen when we try to close the pants run before it's done". These exceptions were raised by this call.

In this line, we are changing the semantics of _GracefulTerminationException(), from "pants hasn't been able to complete the run" to "pants hasn't been able to complete the run, or pants has completed the run and it is a failure (error code >0)".

The correct way to do this would be to either explicitly update the docstring of _GracefulTerminationException to account for the change in semantics, or to create a new exception type (maybe PantsFinishedWithFailureException), and add another elif branch to this block of code


def __init__(self, socket):
def __init__(self, maybe_shutdown_socket):
# N.B. Assuming a fork()'d child, cause os._exit to be called here to avoid the routine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part isn't accurate anymore, so would be good to clarify what this class does now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added todo in #7606

graph_helper = None
target_roots = None
options_bootstrapper = None
# N.B. This will be overridden with the correct value if options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should still be accurate I think? If it's not, then creating the subprocess_dir should no longer be necessary, and we can leave a TODO about removing it.

yield fd
else:
with open('/dev/null', 'rb') as fh:
yield fh.fileno()

# The NailgunStreamWriter probably shouldn't grab the socket without a lock...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should convert this to a TODO that references a ticket, probably.

self._exiter.exit(e.exit_code)
except Exception:
except Exception as e:
# LocalPantsRunner.set_start_time resets the global exiter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still true? My understanding was that context managers were added to reset the exiters.

finally:
os.close(read_fd)

# TODO: The error is that when the client finishes (and presumably closes the socket),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to refer to a ticket, or expand/remove this.

@@ -402,7 +403,7 @@ def test_pantsd_pid_change(self):
def test_pantsd_memory_usage(self):
"""Validates that after N runs, memory usage has increased by no more than X percent."""
number_of_runs = 10
max_memory_increase_fraction = 0.35
max_memory_increase_fraction = 0.60 # TODO this should be closer to 0.35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can refer to Patrick's ticket for this.

@@ -477,6 +479,7 @@ def test_pantsd_parse_exception_success(self):
finally:
rm_rf(test_path)

@unittest.skip("Pantsd is not expected to handle parallel runs anymore.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these will be skipped rather than deleted, we should probably refer to a ticket about re-enabling parallel runs for v2.

os.kill(client_pid, signum)
waiter_run = waiter_handle.join()
self.assert_failure(waiter_run)


print("Run finished, stderr data is {}".format(waiter_run.stderr_data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xx

Copy link
Contributor

@blorente blorente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very exciting! Thanks for all the work on this.

# That said, this means that under pantsd,
# Ctrl-C will simply crash the repl (and the daemon).
# TODO(#7623) Potential more robust (but more invasive) fix.
self._run_repl(pex, **pex_run_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed as of #7623, we can just replace the whole body of this method with:

Suggested change
self._run_repl(pex, **pex_run_kwargs)
with ExceptionSink.ignoring_signint():
self._run_repl(pex, **pex_run_kwargs)

@@ -425,6 +425,7 @@ def _exit_with_failure(cls, terminal_msg):
@classmethod
def _log_unhandled_exception_and_exit(cls, exc_class=None, exc=None, tb=None, add_newline=False):
"""A sys.excepthook implementation which logs the error and exits with failure."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover blank line

from pants.init.util import clean_global_runtime_state
from pants.java.nailgun_io import NailgunStreamStdinReader, NailgunStreamWriter
from pants.java.nailgun_protocol import ChunkType, NailgunProtocol
from pants.pantsd.process_manager import ProcessManager
from pants.java.nailgun_protocol import ChunkType, MaybeShutdownSocket, NailgunProtocol
from pants.util.contextutil import hermetic_environment_as, stdio_as
from pants.util.socket import teardown_socket


class DaemonSignalHandler(SignalHandler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this signal handler ever takes effect, because we don't have a good way to enable it from the main thread (this whole piece of code runs in a child thread, and we can't use custom signal handlers in those for now). I think this entire class can be removed.

class NoopExiter(Exiter):
def exit(self, result, *args, **kwargs):
if result != 0:
# TODO this exception was not designed for this, but it happens to do what we want.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context on this: We want the invocation of LocalPantsRunner.run() to raise an exception if it finishes with an error code, because then DaemonPantsRunner can catch that and process that as necessary (sending the error message to the nailgun client, for example). This processing happens in this set of catch blocks.

_GracefulTerminationException was designed to be a way to signal "errors that happen when we try to close the pants run before it's done". These exceptions were raised by this call.

In this line, we are changing the semantics of _GracefulTerminationException(), from "pants hasn't been able to complete the run" to "pants hasn't been able to complete the run, or pants has completed the run and it is a failure (error code >0)".

The correct way to do this would be to either explicitly update the docstring of _GracefulTerminationException to account for the change in semantics, or to create a new exception type (maybe PantsFinishedWithFailureException), and add another elif branch to this block of code

"""An Exiter that emits unhandled tracebacks and exit codes via the Nailgun protocol."""
"""An Exiter that emits unhandled tracebacks and exit codes via the Nailgun protocol.

TODO: This no longer really follows the Exiter API, per-se (or at least, it doesn't call super).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this no longer follows the Exiter API, and (for now) we don't need the DaemonExiter to act as an exiter, we could maybe inline this code, or turn it into a standalone class and make this not inherit Exiter (possibly changing the name to DaemonNailgunCloser or something to avoid confusing this with other *Exiter classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added todo in #7606


self._exiter = DaemonExiter(socket)
self.exit_code = exit_code
self._exiter = DaemonExiter(maybe_shutdown_socket)

def _make_identity(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not need this anymore. I'm not sure, but it seems related to ProcessManager. If we need it, I suspect it will be for Nailgun communication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added TODO

options_bootstrapper=options_bootstrapper,
)
global_options = options.for_global_scope()
setup_logging_from_options(global_options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth noting that this change only works because of the encapsulated_logger in DaemonPantsRunner, and therefore we don't have to gate logging setup anymore.

@@ -57,7 +65,10 @@ def run(self):
for message_regexp in global_bootstrap_options.ignore_pants_warnings:
warnings.filterwarnings(action='ignore', message=message_regexp)

if global_bootstrap_options.enable_pantsd:
# TODO For kill-pantsd and clean-all, we cannot run them in the daemon itself, because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are #7205 (issue) and #7206 (PR), that originally proposed this solution. So this PR should either close that, or leave a TODO linking to that.

self._services = services

def handle_sigint(self, signum, _frame):
for service in self._services:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should maybe terminate itself but not watchman (with this function).

Something like:

class PantsDaemonSignalHandler(SignalHandler):
   def __init__(self, daemon):
    super(PantsDaemonSignalHandler, self).__init__()
    self._daemon = daemon

   def handle_sigint(self, signum, _frame):
    self._daemon.terminate(include_watchman=False)
    # Send a useful message to users


def handle_sigint(self, signum, _frame):
for service in self._services:
service.record_exception(KeyboardInterrupt('remote client sent control-c!'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a test for an api that I forgot to remove, it doesn't actually do anything, this line can be safely removed.

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woo!

@@ -458,6 +459,7 @@ def test_pantsd_invalidation_stale_sources(self):
finally:
rm_rf(test_path)

@unittest.skip("Pantsd is not expected to handle parallel runs anymore.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this long-term true? I thought we did expect pantsd to still handle parallel runs of some goals, including list?

If it's long-term true, let's delete the test.
If it's not long-term true, let's add a link to a ticket about fixing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, added TODO

@ity ity force-pushed the pantsd-sans-forking branch from abe3e0c to 2a79fc9 Compare May 3, 2019 03:27
@blorente blorente force-pushed the pantsd-sans-forking branch from e8fa10d to 9e85d3e Compare May 3, 2019 09:24
@ity ity force-pushed the pantsd-sans-forking branch from 9e85d3e to b6ef5b3 Compare May 3, 2019 14:54
@ity ity force-pushed the pantsd-sans-forking branch from b6ef5b3 to 9215622 Compare May 3, 2019 15:32
@ity ity merged commit 5ee0e81 into pantsbuild:master May 3, 2019
blorente pushed a commit that referenced this pull request Jul 24, 2019
**Implementation for #8002**
Before removing forking from pantsd (#7596 ), we needed to warm up the v2 product graph of the daemon before forking into a pantsd-runner process (relevant code here).

Now, however, this warming is not necessary (because we don't fork a process per run of pants anymore), so we can do some cleanup to that code:

We can remove the concept of "warming" (here), which is invoked from scheduler_service.prefork (here). This probably means that we can remove the entire warm_product_graph function.
We can reword the code around scheduler_service.prefork so that it doesn't mention forking at all. It was called prefork because that's what we did before #7596, but it can be called something different now.
We can merge the try:except block in DaemonPantsRunner.create() (here) into the try:except block in DaemonPantsRunner.run() (here). The run() function aims to centralize and correctly handle all the exceptions that happen on a pants run, so the "prefork" logic should be integrated there as well.
This will help a lot in correctly diagnosing pantsd problems, as this split and the wrangled exception handling that happen as a result have caused numerous issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants