-
-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support using @dataclass for engine params #8181
support using @dataclass for engine params #8181
Conversation
d7a2080
to
13569b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EPIC. I love it!
src/python/pants/engine/README.md
Outdated
Exactly(FieldType))`). The bare type name (e.g. `FieldType`) can also be used as a shorthand for | ||
`Exactly(FieldType)`. If the tuple form is used, the constructor will create your object, then raise | ||
an error if the field value does not satisfy the type constraint. | ||
[`DatatypeMixin`](../util.objects.py) is intended to decouple the *declaration* of engine `Param` types from their *usage*. In particular, `DatatypeMixin` declares a few `@abstractmethod`s such as `get_type_dict()`, which it uses to implement several generic methods, including `__str__()`, `__repr__()`, and `copy()` (which returns a new instance of the same class, with one or more fields modified). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be feasible to get rid of DatatypeMixin
and @dataclasstype
at some point? It's a bit of a weird appendage. Any special methods on a datatype that the engine requires it can presumably implement as functions (e.g., copy(dataclass_instance, **kwargs)
instead of dataclass_instance.copy(**kwargs)
).
It seems best if the datatypes have little or no API at all, other than their declared fields. And being able to say "your pants products can be any frozen dataclass, with no extra boilerplate or syntax" is pretty hot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other conditions described in the Requirements for an Engine Param
are pretty important: must have useful eq and a useful hash. If there is a way to guarantee that on a dataclass, then maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All dataclasses get a useful __eq__
by default (using, effectively, tuple comparison of fields), unless you set eq=False
. And they all get a hash if frozen=True
and eq=True
.
True, both of these being meaningful depend on the types of the fields, but we would still be validating those, so we can ensure that.
It would be nice if our @datatype
was just an alias for @dataclass(frozen=True)
(with eq=True
by default), as long as we can use the latter as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if our @datatype was just an alias for @DataClass(frozen=True) (with eq=True by default), as long as we can use the latter as well.
I like this! I think just exposing a single uniform @datatype
annotation sounds great -- I'll use methods in pants.base.deprecated
and introduce a temporary annotation named @datatype_wrapper
or something to phase out the existing datatype
.
Any special methods on a datatype that the engine requires it can presumably implement as functions (e.g., copy(dataclass_instance, **kwargs) instead of dataclass_instance.copy(**kwargs)).
I hadn't thought of this and I think this is a great solution that accomplishes the goals I was looking for! Thank you!
It seems best if the datatypes have little or no API at all, other than their declared fields. And being able to say "your pants products can be any frozen dataclass, with no extra boilerplate or syntax" is pretty hot.
I agree in the extreme! I will work to implement this now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can @dataclasstype
imply @dataclass(frozen=True)
? The boilerplate of the two annotations is a bummer.
src/python/pants/engine/README.md
Outdated
Exactly(FieldType))`). The bare type name (e.g. `FieldType`) can also be used as a shorthand for | ||
`Exactly(FieldType)`. If the tuple form is used, the constructor will create your object, then raise | ||
an error if the field value does not satisfy the type constraint. | ||
[`DatatypeMixin`](../util.objects.py) is intended to decouple the *declaration* of engine `Param` types from their *usage*. In particular, `DatatypeMixin` declares a few `@abstractmethod`s such as `get_type_dict()`, which it uses to implement several generic methods, including `__str__()`, `__repr__()`, and `copy()` (which returns a new instance of the same class, with one or more fields modified). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other conditions described in the Requirements for an Engine Param
are pretty important: must have useful eq and a useful hash. If there is a way to guarantee that on a dataclass, then maybe.
e3240bc
to
101ca1e
Compare
@dataclasstype
to wrap @dataclass
es with datatype()
methods
I've just been able to remove |
@@ -122,7 +122,7 @@ print(x.content) # 'a string' | |||
# datatype objects can be easily inspected: | |||
print(x) # 'FormattedInt(content="a string")' | |||
|
|||
# Most mypy types, including parameterized types, may be used in field definitions. | |||
# All mypy types, including parameterized types, may be used in field definitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what "mypy types" are. Do we mean the types in the typing
module?
PS We still limit to dataclasses whose fields are (str|bytes|bool|int|a few other scalar types|collections of valid field types|dataclasses of field types), right? So you can't literally use any dataclass? I fear this sentence might read as contradicting that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what "mypy types" are. Do we mean the types in the
typing
module?
It was intended to read as "anything that can be used as the type of a type annotation", which would include the typing
module, however:
PS We still limit to dataclasses whose fields are (str|bytes|bool|int|a few other scalar types|collections of valid field types|dataclasses of field types), right? So you can't literally use any dataclass? I fear this sentence might read as contradicting that.
I think you're totally right here. You'll see in
pants/src/python/pants/engine/rules.py
Lines 29 to 41 in 5dca08b
class _type_field_constraint(TypeConstraint): | |
def __init__(self): | |
super().__init__('<@dataclass `frozen=True` attribute-checking constraint>') | |
def satisfied_by(self, obj): | |
if not issubclass(type(obj), type): | |
return False | |
if dataclasses.is_dataclass(obj) and not obj.__dataclass_params__.frozen: | |
raise TypeConstraintError( | |
f'type {obj} is a dataclass declared without `frozen=True`! ' | |
'The engine requires that fields in params are immutable for stable hashing!') | |
return True |
@dataclass
classes used in a UnionRule
, RootRule
, or TaskRule
have set frozen=True
-- I think we can perform a little more validation here to avoid confusing error messages from the engine about unhashability. I'll do that now, then update the README to indicate which @dataclass
field types that pants accepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is hashability enough? Don't we also need serializability? How do things get cached otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS We still limit to dataclasses whose fields are (str|bytes|bool|int|a few other scalar types|collections of valid field types|dataclasses of field types), right? So you can't literally use any dataclass? I fear this sentence might read as contradicting that.
Hm. Thinking about it again, I would prefer to try to avoid whitelisting specific types and instead just require hashability, which I think is the only real constraint on engine params? If you try to hash a dataclass which uses a List[_]
field, for example, you get this:
>>> @dataclass(frozen=True)
... class C:
... x: int
... y: Tuple[str]
... z: List[int]
...
>>> C(3, ("a",), [3])
C(x=3, y=('a',), z=[3])
>>> hash(C(3, ("a",), [3]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in __hash__
TypeError: unhashable type: 'list'
This does not occur for dataclasses with just Tuple[_]
fields, for example. So I think we might be able to amend this section to say just that the only requirement on dataclass fields is hashability? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end don't we need to be able to serialize all engine datatype
s, for the cache? This is an area of the engine I am fuzzy on, but I was under the strong impression that all Params can be cached, which presumably means they can be serialized. Although I don't know how this serialization happens today, if at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So say we codegen Foo.proto
to Foo.scala
. Foo.proto
will be in the CAS because it was an on-disk source, it sounds like. Foo.scala
is generated by remote process execution, so now it's sitting on disk on the remote machine. What happens to it then that allows it to participate in a subsequent process execution say to compile it to Foo.class
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uploads and downloads to a remote CAS happen on-demand when you execute a process remotely.
^ this. So after the process finishes executing remotely, we sync the output digests locally.
EDIT: Oh. Hm. Sorry: I went looking for code to confirm this, and then realized that we apparently download lazily currently: downloading eagerly would look like calling
pants/src/rust/engine/fs/store/src/lib.rs
Lines 495 to 499 in 70cb3c3
/// | |
/// Download a directory from Remote ByteStore recursively to the local one. Called only with the | |
/// Digest of a Directory. | |
/// | |
pub fn ensure_local_has_recursive_directory( |
But particularly in the context of remoting, you might not want to fetch something locally. So I cannot now remember whether this laziness was intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I'm very happy with the laziness... Why fetch a file unless you need it. Imagine a large .jar or something...
But then can the second process execution (that takes the Foo.java as input) just reference it via the CAS, so we never even see the full file locally? I.e., the first remote process execution creates it in the CAS, the second reads it from there, and we never need to download the intermediate file to the Pants-running machine, just its digest?
And my other question was more about this: in the remote execution of the proto compiler, something has to know to take the resulting Foo.scala file from the local filesystem and stick it in the CAS (regardless of whether we then fetch it eagerly or lazily). How does that decision get made? I.e., what knows that "Foo.scala" is a file we care about, a product, but "incidental log file.txt" is not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then can the second process execution (that takes the Foo.java as input) just reference it via the CAS, so we never even see the full file locally? I.e., the first remote process execution creates it in the CAS, the second reads it from there, and we never need to download the intermediate file to the Pants-running machine, just its digest?
Correct. But whether you download it or not (ie, lazy or no), it already exists in the remote CAS. Before executing the next process we do (confirmed) eagerly call
pants/src/rust/engine/fs/store/src/lib.rs
Lines 394 to 400 in 70cb3c3
/// | |
/// Ensures that the remote ByteStore has a copy of each passed Fingerprint, including any files | |
/// contained in any Directories in the list. | |
/// | |
/// Returns a structure with the summary of operations. | |
/// | |
pub fn ensure_remote_has_recursive( |
And my other question was more about this: in the remote execution of the proto compiler, something has to know to take the resulting Foo.scala file from the local filesystem and stick it in the CAS (regardless of whether we then fetch it eagerly or lazily). How does that decision get made? I.e., what knows that "Foo.scala" is a file we care about, a product, but "incidental log file.txt" is not...
The high-level (python) ExecuteProcessRequest
type has fields to define what should be captured from the process-execution sandbox, and these map ~1:1 with the bazel remoting API:
pants/src/python/pants/engine/isolated_process.py
Lines 21 to 22 in e3cf637
('output_files', hashable_string_list), | |
('output_directories', hashable_string_list), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so, to sum up:
Say we're running Pants on machine P1, and executing processes remotely on machines M1, M2.
-
Foo.proto
is a local source file, so it's already in the CAS and P1 has its fingerprint. -
P1 calls
ExecuteProcess(protoc, Fingerprint(Foo.proto))
, sending the request to M1. -
M1 fetches
Foo.proto
from the CAS, runsprotoc
on it (I'm eliding the question of where M1 gets aprotoc
binary...) thus generatingFoo.scala
. -
M1 puts
Foo.scala
in the CAS, returning justFingerprint(Foo.scala)
to P1. -
P1 calls
ExecuteProcess(scalac, Fingerprint(Foo.scala))
, sending the request to M2. -
M2 fetches
Foo.scala
from the CAS, runsscalac
on it, thus generatingFoo.class
. -
M2 puts
Foo.class
in the CAS, returning justFingerprint(Foo.class)
to P1.
So P1 does not have the content of Foo.scala
or Foo.class
locally at any point.
Does that about sum it up reasonably?
If so, one more question: what does that mean for running ensure_remote_has_recursive
before step 5 above? P1 doesn't have Foo.scala
's content, so it seems like it can't ensure anything. It can I guess error out if the CAS is missing the file, but then so can M2 directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks!
@@ -122,7 +122,7 @@ print(x.content) # 'a string' | |||
# datatype objects can be easily inspected: | |||
print(x) # 'FormattedInt(content="a string")' | |||
|
|||
# Most mypy types, including parameterized types, may be used in field definitions. | |||
# All mypy types, including parameterized types, may be used in field definitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cosmicexplorer: that's correct. I try to differentiate between them by calling the in-memory approach "memoization in the graph", vs calling the persistent approach "caching". Memoization happens for all rules and requires python hash/eq, caching happens only for process executions, and uses the protobuf definitions.
src/python/pants/util/objects.py
Outdated
|
||
# TODO(#7074): Migrate to python 3 dataclasses! | ||
@deprecated('1.22.0.dev0', hint_message='Use @dataclasstype to declare typed named tuples!') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to target dev2
with deprecations where possible. See:
https://groups.google.com/d/topic/pants-devel/rO1K7j1iCwk/discussion
I made a change to try and encourage this, so I'm surprised it's not taking effect here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe what change you made to encourage this? I'm not seeing any code in deprecated.py
that relates to this, and I'm wondering if I should make that change in a separate PR.
b9511ae
to
cb48ca8
Compare
fc9866b
to
c6ff796
Compare
77901f4
to
bcf8e5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cool!
definitions to provide a unique and descriptive type is highly recommended: | ||
disambiguate between various types of data in `@rule` signatures, so declaring small, unique classes | ||
to encapsulate different states can help to make `@rule`sets with complex control flow between rules | ||
more declarative and self-documenting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good rewrite!
"""Holds a normalized index of Rules used to instantiate Nodes.""" | ||
|
||
rules: Dict[Any, Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be tightened at all? I think rules are seen as Callable
? Don't let this block landing the PR though. We can come back to this and I'd rather land the great work you've done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a better version of lines 573-575 - should delete the original.
Thanks for adding type hints! Is Any
representative or because we don't know how to make it more precise? If the latter, it's better to do Dict
and Set
with no types specified so that we know to later go back and make the type hint more precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a followup PR that addresses those more-precise type hints!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #8496!
#8181 demonstrated that it's possible to use `@dataclass(frozen=True)` as an alternative to `datatype()`. As argued there, this brings several benefits, including aligning us more to the stdlib and allowing MyPy to understand our code for free. This PR takes a first pass to convert >75% of `datatype` usages to `@dataclass`. It naively does this translation: * Does not attempt to add any more type hints than were previously used. * Leaves untouched certain `datatypes()` that inherit normal classes with class variables, which would get overridden * Leaves untouched certain `datatypes()` with custom `__new__()` that is not easily replicated
… with V2 (#8431) ### Problem In #8423 (comment), we realized the need to be able to not always use `frozen=True` when using dataclasses for the V2 engine. Specifically, when we need a custom `__init__()`, we cannot have `frozen=True` because `__init__()` must be able to mutate the object's attributes. However, we still want to keep the benefits of `frozen=True` as much as possible. After we create the object, we do not want to allow any additional modifications. This is important because the V2 engine uses the `__hash__()` of an object to determine its node ID - if the object mutates after creation, the ID will become invalid. ### Solution As suggested in #8423 (comment), introduce a new decorator `@frozen_after_init` that only allows setting attributes in the `__init__()` and then marks the object as frozen after. This is done by setting a flag in the `__init__()` to mark that the object should there-on-out be frozen. Then, we modify `__setattr__()` to check for that flag to prohibit any mutations other than those in the `__init__()`. ### Result `@dataclass(unsafe_hash=True)` can now be marked with `@frozen_after_init` so that we can have the custom `__init__()`s but not lose the benefits of `frozen=True`. This also works on any arbitrary class. We do not yet enforce that dataclasses _must_ use this attribute (or `frozen=True`), which is left for #8181.
332cfd8
to
3fa073a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! Only holding off on approval because I didn't have enough time to closely review this yet.
"""Holds a normalized index of Rules used to instantiate Nodes.""" | ||
|
||
rules: Dict[Any, Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a better version of lines 573-575 - should delete the original.
Thanks for adding type hints! Is Any
representative or because we don't know how to make it more precise? If the latter, it's better to do Dict
and Set
with no types specified so that we know to later go back and make the type hint more precise.
src/python/pants/util/objects.py
Outdated
|
||
# TODO(#7074): Migrate to python 3 dataclasses! | ||
@deprecated('1.24.0.dev2', hint_message='Use @dataclass to declare typed named tuples instead!') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to remove this without a deprecation cycle once possible under the same logic as the enum()
removal, i.e. that this API was experimental (albeit highly useful!).
Maybe hold off on deprecating this so that we can have a dedicated PR for the removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that deprecating it is exactly what makes it easier to remove eventually? If we don't want to have a full deprecation cycle, we can kill it early. But if we're planning to remove it, I think a deprecation cycle sounds reasonable?
311f8ab
to
01e7854
Compare
deprecate datatype()! add test for rules with python dataclasses! remove all the conversion to custom datatypes!!! fix the readme!!! ensure @DataClass types have frozen=True when used in the engine clarify mypy compatibility don't display deprecation warnings in CI bump deprecation version fix import ordering make lint and unit tests pass fix mypy line number remove deprecation warning ignore filter in travis fix weird mypy errors try to see if removing the deprecation stops the timeouts! fix test failure and uncomment deprecation!
Co-Authored-By: Eric Arellano <[email protected]>
54b1e8a
to
a5ee2c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you!
This reverts commit 00c4f8c.
Problem
A first step to fixing #7074. We would like to be able to make use of python 3
@dataclass
es, partially because they are the canonical way to make typed tuples in python now, and partially because it allows us to use mypy to resolve type errors at "compile" time instead of performing type checks at runtime when the tuple is constructed, which is what we do now, e.g.:pants/src/python/pants/util/objects.py
Lines 120 to 125 in b4e316f
Solution
README.md
to demonstrate how to use@dataclass
and remove mentions of the olddatatype()
.@dataclass
objects.Result
Users are able to use a very canonical
@dataclass
declaration syntax for engine params which is statically type-checked with mypy!