-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix find_binary
rule to not use the cache
#10751
Fix find_binary
rule to not use the cache
#10751
Conversation
[ci skip-rust] [ci skip-build-wheels]
@@ -4,6 +4,7 @@ | |||
import random | |||
import uuid | |||
from dataclasses import dataclass, field | |||
from uuid import UUID as UUID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's convenient to have centralized import statements. This is exporting it.
We do this type of centralized import in a couple places, like rules.py
exporting Get
and MultiGet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but those are our own internal types, so we can choose where to make them publicly available. This is a python stdlib type, and it's more confusing than convenient to export it as-if it's an internal type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear to me that this is really the behavior we want.
@@ -42,7 +41,7 @@ | |||
from pants.core.util_rules.source_files import SourceFiles, SourceFilesRequest | |||
from pants.engine.addresses import Addresses | |||
from pants.engine.fs import AddPrefix, Digest, DigestSubset, MergeDigests, PathGlobs, Snapshot | |||
from pants.engine.internals.uuid import UUIDRequest | |||
from pants.engine.internals.uuid import UUID, UUIDRequest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just seems confusing and adds cognitive load: "What is this seemingly-custom UUID class? Oh it's actually just a superfluous alias for the well-known Python stdlib class."
|
||
# We get a UUID so that we ignore the cache every time, as this script depends on the state of | ||
# the external environment. | ||
script_digest, uuid = await MultiGet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this will cause every rule that depends on this product to rerun every time. Is that really what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the end result of the rule is the same, those downstream rules won't be invalidated. It only means that this Param will always be re-evaluated. This is the same with how the new TestEnvironment type works.
@gshuflin can you please confirm this is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the uncacheable UUID Get here will only cause this rule to have to rerun, rather than invalidating every downstream rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but every upstream rule will have to rerun. Is that what we want? What uses find_binary
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but every upstream rule will have to rerun. Is that what we want? What uses find_binary?
If your Python interpreters changed? Yes, absolutely, I think we should rerun every dependent rule.
For example, if you removed Python 3.7 from your machine and installed Python 3.8, that should invalidate your Python processes. It's not guaranteed that your tests would still pass.
Otherwise, the only way to get Pants to re-search for your interpreters is to purge lmdb_store
, or modify --python-setup-interpreter-search-paths
. There's no other way to get Pants to re-evaluate where your interpreters are.
--
Atm, find_binary
is only used for finding Python. In the example plugin repo, it's used to find zip
and bash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will rerun every single time
This specific rule will rerun every time, but the downstream rules will only rerun if the output of this specific rule changes.
This is the same with Greg's new --test-extra-env
feature he added. That re-evaluated os.environ
every single Pants run, yet we are still able to get caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a rule await Get
s on this rule (or something that depends on it)? The entire body of that rule has to rerun every time as well, no? Since this rule might be run conditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A not-so-side-note: Fixing things for the Pants client side is great, but we never have this problem for remoting since we effectively have a hash of the remote executuion image as a key in Procss executions via remote_execution_extra_platform_properties - "container-image=..."
. So if the fix affects both sides its good to keep in mind we could probably strive to do better in the future and not impact remoting somehow with an evolution of the current client side fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific rule will rerun every time, but the downstream rules will only rerun if the output of this specific rule changes.
This is correct. Uncacheable nodes are not completely free, but they are cheaper than actually re-running the logic above the uncacheable node.
An uncacheable node runs once per "session" (typically one pants run), and everything that depends on an uncacheable node is effectively always marked "dirty". Dirty nodes do not re-run unless their dependencies (recursively, all the way down to the uncacheable node) have changed output values, represented using an integer generation value. That rust-only recursion on integer graph ids and integer generation values (called "cleaning") is much cheaper than actually running the nodes.
BUT: two things.
- It's possible that this patch will experience the kinds of issues that @gshuflin ran into on Failed processes memoized under pantsd #10129 (...although we've been using uncacheable nodes for
test --force
, so perhaps not) - This might be better modeled (eventually?) as a native operation of the
CommandRunner
s as I suggested originally on Harden PATH setting in pex runs #9760: see the followup ticket I created after Harden PATH setting in pex runs #9760 landed: Make PATH scanning/filtering a native operation #10526. As @jsirois mentioned above, the behavior in the remote and local cases is potentially different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@Eric-Arellano here's an alternative PR - it looks like I may need to use this mechanism a third time to straighten out --use-first...: #10768. |
Is there an issue detailing this problem? I couldn't find one. |
No, I only realized the issue this week when I was iterating on MyPy handling of interpreter constraints and found that using |
That should be fixed by #10770. |
Superseded by #10770 and $10768. Thanks John! |
Because this rule depends entirely on the state of the external environment, it's more correct to not cache it. For example, if a user uninstalls a program or installs a new one, Pants needs to detect that automatically.
This does not fix every issue. The original motivation for this PR is Pyenv, where we want
pyenv global 2.7.15
to cause invalidation for most Pex processes. Unfortunately, this does not fix it because Pyenv uses a single folder.shims/
, so the output of this rule is simplypath/to/.shims
and things stay the same. But, this PR still improves the situation.[ci skip-rust]
[ci skip-build-wheels]