Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pantsd output is inconsistently truncated when piped #7465

Closed
Eric-Arellano opened this issue Mar 30, 2019 · 17 comments
Closed

Pantsd output is inconsistently truncated when piped #7465

Eric-Arellano opened this issue Mar 30, 2019 · 17 comments

Comments

@Eric-Arellano
Copy link
Contributor

Eric-Arellano commented Mar 30, 2019

Multiple runs produce inconsistent line counts:

$ PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l
    1158

$ PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l
    1298

$ PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l
     549

This happens with Python 2, although less severely:

$ PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1675

$ PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1605

$ PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1605

$ PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1675

Note, however that removing the pipe, i.e. PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list ::, leads to the entire output being printed correctly.

--

This only happens on macOS, not Ubuntu.

@Eric-Arellano Eric-Arellano changed the title Pantsd output is inconsistently truncated when piped Pantsd output is inconsistently truncated when piped on macOS Mar 30, 2019
@stuhood stuhood changed the title Pantsd output is inconsistently truncated when piped on macOS Pantsd output is inconsistently truncated when piped Apr 30, 2019
@stuhood
Copy link
Member

stuhood commented Apr 30, 2019

I've edited the title based on the very high likelihood of a relation to #6687.

@blorente
Copy link
Contributor

blorente commented May 8, 2019

After a brief attempt to retro, this is what I get:

$ git show --quiet head
commit 3742edb4040967899c404b43570caa721bb3bd95
Author: Eric Arellano <[email protected]>
Date:   Tue May 7 22:07:20 2019 -0700

$ ./pants kill-pantsd
...
               SUCCESS

$ ./pants --enable-pantsd --v2 --no-v1 list :: | wc -l
    1695
$ ./pants --enable-pantsd --v2 --no-v1 list :: | wc -l
    1695
$ ./pants --enable-pantsd --v2 --no-v1 list :: | wc -l
    1695

I will keep investigating, my next step is to see whether test_v2_list is still flaky. My blind guess is that this got fixed when we changed a bunch of how the Nailgun stdio worked as part of #7596

@blorente
Copy link
Contributor

blorente commented May 8, 2019

./pants --owner-of=tests/python/pants_test/engine/legacy/test_console_rule_integration.py clean-all test -- -k test_v2_list passes locally several times for me, might need to close both issues.

@stuhood
Copy link
Member

stuhood commented May 8, 2019

Yay!

@Eric-Arellano
Copy link
Contributor Author

PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l still is inconsistent for me on macOS :/ consistent with my original issue.

@blorente
Copy link
Contributor

blorente commented May 10, 2019

I haven’t been able to repro this on sha 5eed2e7be5aa4672485f06e043bcec24bb9c668d. I have tried two laptops:

  • 10.14.3, MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports), 4 cores.
  • 10.14.4, an older MacBook Pro, 2 cores.

Could you tell me what sha you’re running it from, and what macOS version you have?

@Eric-Arellano
Copy link
Contributor Author

This is on master 5eed2e7. macOS 10.14.4 (Mojave), 2018 MacBook Pro.

pants at master · PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l
    1315
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l
    1315
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l
    1008
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l
    1315
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l
    1008
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants --no-v1 --v2 list :: | wc -l
    1154

With Py2:

pants at master · PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1698
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1698
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1698
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1698
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1698
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1698
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1698
〉pants at master · PANTS_ENABLE_PANTSD=true ./pants2 --no-v1 --v2 list :: | wc -l
    1698

So now is the inconsistency related to Py3..? Notably Py2 used to be inconsistent for me too, whereas now it appears deterministic.

@blorente
Copy link
Contributor

Might be related to the number of cores then? We are running in the same OS and the same commit, the only difference is the computer itself... I don't think I'll have access to a more powerful OSX machine anytime soon, but can investigate vms in my personal Linux Mint (~Ubuntu) laptop.

@Eric-Arellano
Copy link
Contributor Author

Might be related to the number of cores then?

Potentially.

pants at master · sysctl -n hw.ncpu
12

(Let me know if you'd like me to run any other commands)

I do think it's significant that there's a difference between Py2 and Py3. I'm really curious what the cause is, and suspect it may have to do with sys.stdout buffering changing between the two versions.

@blorente
Copy link
Contributor

I do think it's significant that there's a difference between Py2 and Py3. I'm really curious what the cause is, and suspect it may have to do with sys.stdout buffering changing between the two versions.

This is probably the case, but it's really hard to verify without access to a repro.

Could you run pantsd in debug mode and upload the resulting logs?

$ ./pants -ldebug --enable-pantsd list :: | wc -l
... You'll see some spurious output.
  <the result of wc -l>
$ cat .pants.d/pantsd/pantsd.log | pbcopy

@Eric-Arellano
Copy link
Contributor Author

Could you run pantsd in debug mode and upload the resulting logs?

https://gist.github.com/Eric-Arellano/f5b4c03af61784083c0ea443d1b6b056

@blorente
Copy link
Contributor

Thanks! I can't see anything obvious, but could you try this patch? Will only work on python3:

diff --git a/src/python/pants/engine/goal.py b/src/python/pants/engine/goal.py
index 7ad6735..21ce3e1 100644
--- a/src/python/pants/engine/goal.py
+++ b/src/python/pants/engine/goal.py
@@ -139,9 +139,12 @@ class LineOriented(object):
     if output_file:
       stdout = open(output_file, 'w')
 
     try:
-      print_stdout = lambda msg: print(msg, file=stdout, end=sep)
-      print_stderr = lambda msg: print(msg, file=stderr)
+      print_stdout = lambda msg: print(msg, file=stdout, end=sep, flush=True)
+      print_stderr = lambda msg: print(msg, file=stderr, flush=True)
       yield print_stdout, print_stderr
     finally:
       if output_file:

I'm sorry for the dance, if this fixes it then it was probably the buffering. If not, I will try to increase the buffer size to see if it repros on my machine.

@Eric-Arellano
Copy link
Contributor Author

Yes, that diff worked!

〉pants at master · ./pants --enable-pantsd list :: | wc -l
    1698
〉pants at master ✗ · ./pants --enable-pantsd list :: | wc -l
    1698
〉pants at master ✗ · ./pants --enable-pantsd list :: | wc -l
    1698
〉pants at master ✗ · ./pants --enable-pantsd list :: | wc -l
    1698

Yay!

However, when piping there's now consistently really bad performance:

time ./pants --enable-pantsd list :: | wc -l
    1698
./pants --enable-pantsd list ::  0.82s user 0.26s system 10% cpu 10.307 total
wc -l  0.00s user 0.00s system 0% cpu 10.306 total

This is only when piping. When outputting normally, it takes about 1.5 seconds. Also, ./pants2 was always taking about 8-9 seconds so I think this performance issue is not a Python 3 issue, but rather an issue with "when Pantsd is enabled and the output is piped, performance slows by 500%". I can't imagine we pipe much so probably not worth spending time, but wanted to at least flag.

@patliu85
Copy link
Contributor

patliu85 commented May 13, 2019

Seems like it takes longer for NailgunStreamWriter.open_multi(...) to finish writing (flushing out) the output in following section

      with writer.running():
        yield write_fds, writer

so when following section (in DaemonPantsRunner._pipe_stdio(...) exits, some outputs have not been received by NailgunStreamStdinReader reader yet.

    with maybe_handle_stdin(handle_stdin) as stdin_fd,\
      NailgunStreamWriter.open_multi(maybe_shutdown_socket.socket, types, ttys) as ((stdout_fd, stderr_fd), writer),\
      stdio_as(stdout_fd=stdout_fd, stderr_fd=stderr_fd, stdin_fd=stdin_fd):
      # N.B. This will be passed to and called by the `DaemonExiter` prior to sending an
      # exit chunk, to avoid any socket shutdown vs write races.
      stdout, stderr = sys.stdout, sys.stderr
      def finalizer():
        try:
          stdout.flush()
          stderr.flush()
        finally:
          time.sleep(.001)  # HACK: Sleep 1ms in the main thread to free the GIL.
          writer.stop()
          writer.join()
          stdout.close()
          stderr.close()
      yield finalizer

During my testing, if I provide an extra 1 second (sleep(1)) either at the end of NailgunStreamWriter.open_multi(...) or at the end of with maybe_handle_stdin(handle_stdin), it generates the output correctly. Probably latency happens within NailgunStreamWriter.open_multi(...).

@patliu85
Copy link
Contributor

patliu85 commented May 13, 2019

The default buffer size is 8192. Instead of using the default one, if we pass in buf_size=40000 (8192 * 5), that would work too.

diff --git a/src/python/pants/bin/daemon_pants_runner.py b/src/python/pants/bin/daemon_pants_runner.py

     with maybe_handle_stdin(handle_stdin) as stdin_fd,\
-      NailgunStreamWriter.open_multi(maybe_shutdown_socket.socket, types, ttys) as ((stdout_fd, stderr_fd), writer),\
+      NailgunStreamWriter.open_multi(maybe_shutdown_socket.socket, types, ttys, buf_size=40000) as ((stdout_fd, stderr_fd), writer),\
       stdio_as(stdout_fd=stdout_fd, stderr_fd=stderr_fd, stdin_fd=stdin_fd):

@stuhood
Copy link
Member

stuhood commented May 18, 2019

I think that we might have seen a repro of this on Linux as well (which isn't too surprising, since it seems like a timing issue).

stuhood pushed a commit that referenced this issue May 21, 2019
### Problem

Dealing with raw file descriptors is confusing at best and error-prone at worst.

### Solution

Define a `Pipe` abstraction that owns (read: "managest the lifetimes of") the write and read ends of a pipe, as well as provide convenience methods to query the state of the fds.

### Result

It's now a bit more obvious what pipes are open.

Part one of the fix for #7465
illicitonion pushed a commit that referenced this issue May 22, 2019
## Problem

As described in #7465, when we pipe the output of a pants run with pantsd into anything, it may truncate output.
The root cause:
- `NailgunStreamWriter` runs in another thread, and is periodically reading from `std{out,err}` and writing it in the Nailgun socket. Among other uses, we create one of these whenever we need to redirect stdout/err from pants into the client.
- In the current implementation of pantsd, this happens in DaemonPantsRunner, whenever we call to `nailgunned_stdio` and either stdout, err or in are not TTYs (there's more nuance, but it's not relevant). We create pipes and give the writing end to the `DaemonPantsRunner`, while the `NailgunStreamWriter` listens to the reading end for output. The `DaemonPantsRunner` then redirects `sys.stdout/err` (and some other stuff) to the reading ends of those pipes, through `stdio_as`. This happens in two places:
  - When we warm up the v2 product graph, where things like v2 list are computed.
  - When we actually execute the pants run under pantsd.
- When the daemon is finished with `nailgunned_stdio`, it stops the `NailgunStreamWriter` thread that is transmitting output to the client through the socket, and then it joins  until it's done.
**- The issue:** When the daemon `stop()`s the thread, this stops reading immediately, even if it hasn't finished transmitting every chunk to the client. Therefore, there is a race condition between `NailgunStreamWriter` trying to send all the chunk through the socket and `DaemonPantsRunner` trying to close it. This can be checked by inserting a `time.sleep(0.1)` [here](https://github.com/pantsbuild/pants/blob/3c17cd2b54b468590ddf8f486f37652ed0db6afc/src/python/pants/java/nailgun_io.py#L201), which will make the issue repro consistently.

## Solution

Depends on #7740, and includes commits from that.

**The only commit worth reviewing is b9d0c95**

The `DaemonPantsRunner` must no longer `stop()` the thread, but rather mark the pipes that are is stdout/err as "I'm not going to write to this anymore, please close yourself when you're done reading from this".

We create the class `PipedNailgunStreamWriter`, wich is aware that it's reading end belongs to a pipe. Whenever we don't have anything to read, we check if the write end of the pipe is closed. If so, we stop trying to read from that pipe. When no pipes are left, we close ourselves.

## Result

Hopefully, `./pants --enable-pantsd list :: | wc -l` should give consistent output, and #7465 should be resolved.
illicitonion pushed a commit that referenced this issue May 22, 2019
### Problem

Dealing with raw file descriptors is confusing at best and error-prone at worst.

### Solution

Define a `Pipe` abstraction that owns (read: "managest the lifetimes of") the write and read ends of a pipe, as well as provide convenience methods to query the state of the fds.

### Result

It's now a bit more obvious what pipes are open.

Part one of the fix for #7465
illicitonion pushed a commit that referenced this issue May 22, 2019
## Problem

As described in #7465, when we pipe the output of a pants run with pantsd into anything, it may truncate output.
The root cause:
- `NailgunStreamWriter` runs in another thread, and is periodically reading from `std{out,err}` and writing it in the Nailgun socket. Among other uses, we create one of these whenever we need to redirect stdout/err from pants into the client.
- In the current implementation of pantsd, this happens in DaemonPantsRunner, whenever we call to `nailgunned_stdio` and either stdout, err or in are not TTYs (there's more nuance, but it's not relevant). We create pipes and give the writing end to the `DaemonPantsRunner`, while the `NailgunStreamWriter` listens to the reading end for output. The `DaemonPantsRunner` then redirects `sys.stdout/err` (and some other stuff) to the reading ends of those pipes, through `stdio_as`. This happens in two places:
  - When we warm up the v2 product graph, where things like v2 list are computed.
  - When we actually execute the pants run under pantsd.
- When the daemon is finished with `nailgunned_stdio`, it stops the `NailgunStreamWriter` thread that is transmitting output to the client through the socket, and then it joins  until it's done.
**- The issue:** When the daemon `stop()`s the thread, this stops reading immediately, even if it hasn't finished transmitting every chunk to the client. Therefore, there is a race condition between `NailgunStreamWriter` trying to send all the chunk through the socket and `DaemonPantsRunner` trying to close it. This can be checked by inserting a `time.sleep(0.1)` [here](https://github.com/pantsbuild/pants/blob/3c17cd2b54b468590ddf8f486f37652ed0db6afc/src/python/pants/java/nailgun_io.py#L201), which will make the issue repro consistently.

## Solution

Depends on #7740, and includes commits from that.

**The only commit worth reviewing is b9d0c95**

The `DaemonPantsRunner` must no longer `stop()` the thread, but rather mark the pipes that are is stdout/err as "I'm not going to write to this anymore, please close yourself when you're done reading from this".

We create the class `PipedNailgunStreamWriter`, wich is aware that it's reading end belongs to a pipe. Whenever we don't have anything to read, we check if the write end of the pipe is closed. If so, we stop trying to read from that pipe. When no pipes are left, we close ourselves.

## Result

Hopefully, `./pants --enable-pantsd list :: | wc -l` should give consistent output, and #7465 should be resolved.
@Eric-Arellano
Copy link
Contributor Author

Fixed by #7782. Thank you @blorente and @patliu85 for the work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants