-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHORE] Tidy up typing of binary ops [1/2] #1114
Conversation
} | ||
let supertype = try_get_supertype(lhs, rhs)?; | ||
Ok(Field::new(left.to_field(schema)?.name.as_str(), supertype)) | ||
let result_type = (&left_field.dtype + &right_field.dtype)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
fn physical_add(lhs: &Series, rhs: &Series, output_type: &DataType) -> DaftResult<Series> { | ||
use DataType::*; | ||
match output_type { | ||
Utf8 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is nice because we can match the intended behavior of the Series operation against the intended return type as intended by the type-system, instead of having Series do its own typing logic.
I do wonder about cases where the result dtype isn't fully expressive about the actual runtime implementation though. For example, if the result if a "date", it may not provide enough information about the implementation (is this a date plus a duration or a duration plus a date?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately, I guess instead of our current unit tests, we can maybe even just add asserts in our code that will ensure that the series output dtype matches the type-system's intended dtype? Not quite sure how best to tie the resolve-time and runtime behavior together yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it will still be up to the implementation to decide how to actually perform the operation. But the implementation can choose to do it however it wants since there will be no conflicting specification anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separately, I guess instead of our current unit tests, we can maybe even just add asserts in our code that will ensure that the series output dtype matches the type-system's intended dtype? Not quite sure how best to tie the resolve-time and runtime behavior together yet.
This can be accomplished by checking Schema vs Series datatypes at Table construction time, right?
Similar to #1114, for the remaining binary ops (logical and comp). Also cleans up the macros in SeriesLike binary_ops as well. --------- Co-authored-by: Xiayue Charles Lin <[email protected]>
This PR tidies up typing of arithmetic binary ops.
It makes the following small-but-key improvements:
1. Centralize output type resolution in one place (
daft-core::datatype::binary_op
)Before, both Expressions (
daft-dsl::expr
) and Series (daft-core::series
) implemented their own logic to determine (a) are the input types valid and (b) what should the output type be. This is a nontrivial amount of conditional logic (over 100 lines) that was duplicated twice and had no guarantee of being in sync.In addition, it made implementing new types for Series operations unwieldy, since one would have to interleave the computation implementation among the input type validation logic and the output type logic.
In this PR, input type validation and output type resolution is factored out to DataType. You can Add a DataType and DataType, and produce a Result which will tell you whether the operation is okay and what the output type will be.
Expression then simply calls DataType. Series calls DataType, and then does the computation logic after.
2. Change type resolution logic from implicit (subtractive) to explicit (additive).
Before, type resolution logic was "opt-out" - all operations would be resolved by the same type matrix, and there was a lot of logic excluding types from that behaviour. Overall, this made it difficult to reason about and modify what types were supported and what logic was being called. There is some concrete evidence of this in the code:
This PR changes the behaviour to be "opt-in". Binary ops explicitly specify the types they implement behaviour on. If behaviour is not specified, then it does not work - simple as that.
This change is mostly code rearrangement. The typing logic is separated between numeric types, remaining physical types, and logical types. Operations separately specify their behaviour across these different types using match rules instead of return statements, while still calling default implementations where appropriate.
Now it is really easy to see whether some behaviour is implemented or not, and where the correct place is to add new behaviour.
This PR implements the changes for arithmetic binary ops, which are the most complicated of the binary ops. A followup PR will implement the changes for logical and comparison ops, and remove the old supertype utility.