Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-work API for list.to_struct() #19525

Open
nameexhaustion opened this issue Oct 30, 2024 · 0 comments
Open

Re-work API for list.to_struct() #19525

nameexhaustion opened this issue Oct 30, 2024 · 0 comments
Labels
A-api Area: changes to the public API A-dtype-list/array Area: list/array data type enhancement New feature or an improvement of an existing feature P-low Priority: low
Milestone

Comments

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Oct 30, 2024

Description

This is what the current API looks like:

def to_struct(
        self,
        n_field_strategy: Literal["first_non_null", "max_width"] = "first_non_null",
        fields: Sequence[str] | Callable[[int], str] | None = None,
        upper_bound: int = 0,
    ) -> Expr:

This is problematic for Lazy execution, as if a user calls list.to_struct() without specifying the field names, it can lead to confusing errors in subsequent expressions as we cannot determine the schema during IR resolution - e.g.

Specifically, the error message could look something like struct.prefix_fields() not supported for dtype Unknown. It's not immediately clear that this is due to having done a list.to_struct() without specifying field names earlier in the query.

I would propose something like:

def to_struct(
    self,
    field_names: Sequence[str] | Callable[[int], str],
    *,
    infer_fields: Literal["first_non_null", "max_width"] | None = None,
) -> Expr:

Where we would forbid arguments that have an ambiguous number of output columns when executing on a LazyFrame:

Parameters Series DataFrame/LazyFrame
list.to_struct(['a', 'b'])
list.to_struct(field_names=lambda x: f"c{x}") error: infer_fields must be specified
list.to_struct(field_names=lambda x: f"c{x}", infer_fields="first_non_null")
list.to_struct(field_names=lambda x: f"c{x}", infer_fields="max_width")
Notes
  • We should probably also update arr.to_struct
@nameexhaustion nameexhaustion added enhancement New feature or an improvement of an existing feature A-api Area: changes to the public API P-low Priority: low A-dtype-list/array Area: list/array data type labels Oct 30, 2024
@nameexhaustion nameexhaustion added this to the 2.0.0 milestone Oct 30, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Oct 30, 2024
@nameexhaustion nameexhaustion changed the title Re-work API for list.to_struct Re-work API for list.to_struct() / array.to_struct() Dec 5, 2024
@nameexhaustion nameexhaustion changed the title Re-work API for list.to_struct() / array.to_struct() Re-work API for list.to_struct() Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API A-dtype-list/array Area: list/array data type enhancement New feature or an improvement of an existing feature P-low Priority: low
Projects
Status: Ready
Development

No branches or pull requests

1 participant