Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create WorkunitOutput subsystem which redirects workunit output #8765

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Dec 6, 2019

TODO: test cases!!!

Problem

We would like to be able to get some parts of pants's output in different places, for multiple reasons:

  • We might want to see the output/logging from pants when running a ./pants run, but direct the process's output somewhere else (instead of amidst all the pants output).
  • We might want to parse compiler error messages from outside of pants (this is a real use case at Twitter), without having to parse the output of pants itself to know where the compiler errors are.
  • We might want to see helpful stdout/stderr messages from some tool like coursier, which pants currently just hides all output from, unless it fails.

For some background on how this particular interface came to be, see this comment which tangentially introduced the idea of redirecting output: #7071 (comment), although it was buried amidst a larger discussion of output in v2 rules.

Solution

(also see: ./pants help workunit-output)

  • Create the WorkunitOutput subsystem and attach it to the RunTracker.
    • Parse --redirections as a dict mapping file names to "redirection specs", which are just nested dicts. This looks something like:
> ./pants --workunit-output-redirections="\
{'out.txt': [{
    'workunit_scope': '.*',
    'labels': ['TOOL', 'TEST'],
    'output_names': ['stdout', 'stderr'],
  }]}" test tests/python/pants_test/util: -- -vs
  • The above command line would tee all output sent to workunits with path matching regex .* (everything), which declare any of the TOOL or TEST labels, on the outputs named stdout and stderr (exact match). In this precise command line, that would mean that all pytest output would be appended to a file named out.txt.
  • Note that the output file name /dev/stdout is treated specially, and results in the tee subprocess inheriting the pants stdout. This means that workunit output which would normally be hidden (such as coursier) can be selectively toggled on by the user.
  • All of the above is accomplished via introducing OutputTeeingFileBackedRWBuf, which uses a tee subprocess to spread output across multiple files.

Result

Pants can redirect some of its output elsewhere with a command-line option without affecting the rest of the pants run!

accomplish this via introducing OutputTeeingFileBackedRWBuf, which uses a 'tee' subprocess to spread
output across multiple files.
@illicitonion
Copy link
Contributor

I'm not sure we want to be exposing this when the future of workunits in v2 is kind of up in the air, as it's presumably something we'd need to maintain compatibility for, or deprecate, which could constrain v2 direction noticeably... Is there a reason this is currently urgent?

Assuming we do do it, though, the interface of concatenating (and maybe interleaving? I didn't look at synchronisation) all of the outputs together into one file seems weird to me. How would you feel about the "output path" actually being a directory into which we'll put structure (either named files per workunit, or even a directory hierarchy, or something, given they have hierarchy)? If the user then wants to concat them, great, they can.

@ShaneDelmore
Copy link
Contributor

This would be useful to me for multiple scenarios. Will a work-unit support saving to the build-cache and allow me to retrieve it for cached artifacts? For example, if a build completed with 3 compiler warnings, and later I rebuilt, it would be useful to still be able to retrieve the warnings from the cached artifact.

@illicitonion
Copy link
Contributor

illicitonion commented Dec 6, 2019

This would be useful to me for multiple scenarios. Will a work-unit support saving to the build-cache and allow me to retrieve it for cached artifacts? For example, if a build completed with 3 compiler warnings, and later I rebuilt, it would be useful to still be able to retrieve the warnings from the cached artifact.

That would not come from this change. We'd need to include stdout and stderr in those cache bundles (which personally I think we should do), but that would be orthogonal. This would only output this information onto a file on your local disk, only if you didn't hit a cache.

@ShaneDelmore
Copy link
Contributor

Got it, I think I slightly misunderstood the change then. Although it wouldn't have to be std-in for my purposes as long as I could declare something as a cacheable artifact. For example, the diff-changed goal outputs a git patch to stdout, but I only did that as I am scraping stdout to retrieve it. I would love to be able to write it to a file instead, that I could easily retrieve later from the build cache.

@illicitonion
Copy link
Contributor

Got it, I think I slightly misunderstood the change then. Although it wouldn't have to be std-in for my purposes as long as I could declare something as a cacheable artifact. For example, the diff-changed goal outputs a git patch to stdout, but I only did that as I am scraping stdout to retrieve it. I would love to be able to write it to a file instead, that I could easily retrieve later from the build cache.

That, you actually can do, as long as you're willing to modify the pants task. Anything you write to the vt.results_dir will be included in the cache bundle, e.g. as is done here:

with self.invalidated(binaries, invalidate_dependents=True) as invalidation_check:
python_deployable_archive = self.context.products.get('deployable_archives')
python_pex_product = self.context.products.get('pex_archives')
for vt in invalidation_check.all_vts:
pex_path = os.path.join(vt.results_dir, f'{vt.target.name}.pex')
if not vt.valid:
self.context.log.debug(f'cache for {vt.target} is invalid, rebuilding')
self._create_binary(vt.target, vt.results_dir)
else:
self.context.log.debug(f'using cache for {vt.target}')

@cosmicexplorer
Copy link
Contributor Author

How would you feel about the "output path" actually being a directory into which we'll put structure (either named files per workunit, or even a directory hierarchy, or something, given they have hierarchy)?

Especially after talking with @ShaneDelmore further about what he might want from this kind of option, I'm thinking of a two-fold plan to unify these:

  1. (in this PR) Create cache entries for workunit output (so throwing away most of the current impl), and figure out some hacky way to print out the on-disk location of workunit outputs to a user (this is my interpretation of your suggestion -- let me know if I'm off).
  2. (in a followup PR) make the way to print out the on-disk location of cache entries less hacky.

For (2), @ShaneDelmore suggested something like:

> ./pants fetch lspIndexes dataproducts/project/::

Translating that into some idea of pants options could be:

> ./pants fetch --task=compile.lsp-indexes dataproducts/project/::

which would be interpreted as "fetch the cached results_dir from a remote cache if necessary, then print out the local paths to the output of the task named lsp-indexes in the compile goal".

I'm going to spend 5 minutes making this into a google doc.

@cosmicexplorer
Copy link
Contributor Author

@illicitonion
Copy link
Contributor

How would you feel about the "output path" actually being a directory into which we'll put structure (either named files per workunit, or even a directory hierarchy, or something, given they have hierarchy)?

Especially after talking with @ShaneDelmore further about what he might want from this kind of option, I'm thinking of a two-fold plan to unify these:

  1. (in this PR) Create cache entries for workunit output (so throwing away most of the current impl), and figure out some hacky way to print out the on-disk location of workunit outputs to a user (this is my interpretation of your suggestion -- let me know if I'm off).
  2. (in a followup PR) make the way to print out the on-disk location of cache entries less hacky.

Honestly, I'd probably rather just port scalafix to v2, which gets stdout and stderr caching and replay for free, rather than building an entire new caching mechanism for stdout and stderr specifically for v1, given this is already a solved problem in v2.

For (2), @ShaneDelmore suggested something like:

> ./pants fetch lspIndexes dataproducts/project/::

Translating that into some idea of pants options could be:

> ./pants fetch --task=compile.lsp-indexes dataproducts/project/::

which would be interpreted as "fetch the cached results_dir from a remote cache if necessary, then print out the local paths to the output of the task named lsp-indexes in the compile goal".

I'm going to spend 5 minutes making this into a google doc.

Commented on the doc.

@Eric-Arellano
Copy link
Contributor

Closing due to being stale. Workunits have seen some substantial changes the past few months, from what I can tell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants