-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Make output dtype known for list.to_struct
when fields
are passed
#19439
Conversation
}; | ||
|
||
if *max_fields > 0 { | ||
inferred.min(*max_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upper_bound
was previously ignored when width was being inferred during execution - it was only used during IR resolving for getting the output names
@@ -343,9 +343,6 @@ impl serde::Serialize for PlCredentialProvider { | |||
{ | |||
use serde::ser::Error; | |||
|
|||
// TODO: | |||
// * Add magic bytes here to indicate a python function | |||
// * Check the Python version on deserialize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by - outdated todo
@@ -1092,7 +1092,7 @@ def to_array(self, width: int) -> Expr: | |||
|
|||
def to_struct( | |||
self, | |||
n_field_strategy: ToStructStrategy = "first_non_null", | |||
n_field_strategy: ListToStructWidthStrategy = "first_non_null", | |||
fields: Sequence[str] | Callable[[int], str] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this functionality where you can pass a Python function that gets called with a field index to generate names, but I'm not sure it's needed since you can just pass a list of field names. Maybe we should remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows you to create fields at runtime. Which is a bit dangerous as it will be dependent of the data. But I think we should keep for now.
I want to add a return_dtype: DataType
argument as well that allows you to specify the datatype immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a Notes
warning that it is a bug in a users query if they don't set fields
or return_dtype
? Then we can make it required in 2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to add a
return_dtype: DataType
argument as well that allows you to specify the datatype immediately.
I think we can already do this by passing a list of names to the fields
parameter. For the function case, we also get the output schema if the upper_bound
is set - I can add a warning if we get a function without upper_bound
being set?
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I want to add the warnings/notes in a separate PR
Nice! If a user doesn't provide any fields/dtype. We must set the output type to |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19439 +/- ##
==========================================
+ Coverage 79.94% 79.95% +0.01%
==========================================
Files 1534 1534
Lines 211059 211121 +62
Branches 2444 2444
==========================================
+ Hits 168727 168799 +72
+ Misses 41777 41767 -10
Partials 555 555 ☔ View full report in Codecov by Sentry. |
ba311d8
to
6de0c3e
Compare
25bfbd9
to
c9197aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one Simon. Thanks, this will fix a lot of hairy bugs.
If we receive a list of names in the
fields
parameter, we will be able to return a dtype into_field
(or if we have a non-zero upper_bound)Also refactors to define
list.to_struct
to be aListFunction
in the IR.