Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] Tidy up typing of binary ops [1/2] #1114

Merged
merged 12 commits into from
Jul 5, 2023
Merged

Conversation

xcharleslin
Copy link
Contributor

@xcharleslin xcharleslin commented Jun 30, 2023

This PR tidies up typing of arithmetic binary ops.

It makes the following small-but-key improvements:

1. Centralize output type resolution in one place (daft-core::datatype::binary_op)

Before, both Expressions (daft-dsl::expr) and Series (daft-core::series) implemented their own logic to determine (a) are the input types valid and (b) what should the output type be. This is a nontrivial amount of conditional logic (over 100 lines) that was duplicated twice and had no guarantee of being in sync.

In addition, it made implementing new types for Series operations unwieldy, since one would have to interleave the computation implementation among the input type validation logic and the output type logic.

In this PR, input type validation and output type resolution is factored out to DataType. You can Add a DataType and DataType, and produce a Result which will tell you whether the operation is okay and what the output type will be.

Expression then simply calls DataType. Series calls DataType, and then does the computation logic after.

2. Change type resolution logic from implicit (subtractive) to explicit (additive).

Before, type resolution logic was "opt-out" - all operations would be resolved by the same type matrix, and there was a lot of logic excluding types from that behaviour. Overall, this made it difficult to reason about and modify what types were supported and what logic was being called. There is some concrete evidence of this in the code:

  • dead code in the supertype match rules (mostly in the logical types)
  • the existence of multiple allowed/disallowed behaviours that we were unaware of (e.g. being unable to add bools, types where cmp is allowed but arithmetic is not, etc.)
  • multiple return statements in the typing logic (14 in Expression binary ops, 13 in Series arithmetic alone), which is a bit of a control-flow maze

This PR changes the behaviour to be "opt-in". Binary ops explicitly specify the types they implement behaviour on. If behaviour is not specified, then it does not work - simple as that.

This change is mostly code rearrangement. The typing logic is separated between numeric types, remaining physical types, and logical types. Operations separately specify their behaviour across these different types using match rules instead of return statements, while still calling default implementations where appropriate.

Now it is really easy to see whether some behaviour is implemented or not, and where the correct place is to add new behaviour.


This PR implements the changes for arithmetic binary ops, which are the most complicated of the binary ops. A followup PR will implement the changes for logical and comparison ops, and remove the old supertype utility.

@codecov
Copy link

codecov bot commented Jul 1, 2023

Codecov Report

Merging #1114 (459431b) into main (5eb5318) will not change coverage.
The diff coverage is n/a.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1114   +/-   ##
=======================================
  Coverage   88.49%   88.49%           
=======================================
  Files          54       54           
  Lines        5476     5476           
=======================================
  Hits         4846     4846           
  Misses        630      630           

@xcharleslin xcharleslin changed the title Charles/opstyping Tidy up typing of binary ops [1/2] Jul 3, 2023
@xcharleslin xcharleslin changed the title Tidy up typing of binary ops [1/2] [CHORE] Tidy up typing of binary ops [1/2] Jul 3, 2023
@github-actions github-actions bot added the chore label Jul 3, 2023
@xcharleslin xcharleslin marked this pull request as ready for review July 3, 2023 21:59
@xcharleslin xcharleslin merged commit 447cb2f into main Jul 5, 2023
@xcharleslin xcharleslin deleted the charles/opstyping branch July 5, 2023 18:28
}
let supertype = try_get_supertype(lhs, rhs)?;
Ok(Field::new(left.to_field(schema)?.name.as_str(), supertype))
let result_type = (&left_field.dtype + &right_field.dtype)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

fn physical_add(lhs: &Series, rhs: &Series, output_type: &DataType) -> DaftResult<Series> {
use DataType::*;
match output_type {
Utf8 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is nice because we can match the intended behavior of the Series operation against the intended return type as intended by the type-system, instead of having Series do its own typing logic.

I do wonder about cases where the result dtype isn't fully expressive about the actual runtime implementation though. For example, if the result if a "date", it may not provide enough information about the implementation (is this a date plus a duration or a duration plus a date?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, I guess instead of our current unit tests, we can maybe even just add asserts in our code that will ensure that the series output dtype matches the type-system's intended dtype? Not quite sure how best to tie the resolve-time and runtime behavior together yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it will still be up to the implementation to decide how to actually perform the operation. But the implementation can choose to do it however it wants since there will be no conflicting specification anywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, I guess instead of our current unit tests, we can maybe even just add asserts in our code that will ensure that the series output dtype matches the type-system's intended dtype? Not quite sure how best to tie the resolve-time and runtime behavior together yet.

This can be accomplished by checking Schema vs Series datatypes at Table construction time, right?

xcharleslin added a commit that referenced this pull request Jul 6, 2023
Similar to #1114, for the
remaining binary ops (logical and comp).

Also cleans up the macros in SeriesLike binary_ops as well.

---------

Co-authored-by: Xiayue Charles Lin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants