Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename .list.lengths and .str.lengths #11577

Closed
stinodego opened this issue Oct 6, 2023 · 10 comments · Fixed by #11613
Closed

Rename .list.lengths and .str.lengths #11577

stinodego opened this issue Oct 6, 2023 · 10 comments · Fixed by #11613
Assignees
Labels
accepted Ready for implementation deprecation Add a deprecation warning to outdated functionality enhancement New feature or an improvement of an existing feature

Comments

@stinodego
Copy link
Contributor

stinodego commented Oct 6, 2023

Description

As mentioned by @alexander-beedie in #11462 (comment)

lengths is a weird name, being plural. I started renaming it to length, but then I figured .len would also be an option (both Series and Expr have a method called len), and in the case of the list namespace, count also makes sense.

We generally prefer fully spelled out names (i.e. length instead of len), but both Rust and Python have len as the naming convention:

So I would propose to:

  • Deprecate .list.lengths(), add .list.len() as the replacement and add .list.count()
  • Deprecate .str.lengths(), add .str.len() as the replacement.

Thoughts?

@stinodego stinodego added enhancement New feature or an improvement of an existing feature deprecation Add a deprecation warning to outdated functionality labels Oct 6, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 6, 2023

Deprecate .str.lengths(), add .list.len() as the replacement.

str.len()? :)

Yup, len is pretty universal, but I'm good with length too, and that does feel very in-keeping with the way that the rest of the API leans towards being more explicit (though in that case we should unify with the other examples that currently use len). Certainly anyone looking for len will have no trouble seeing it coming and hitting <TAB> to go all the way ;)

While I don't mind list.count(), I don't think there's a lot of justification for special-casing it; people will naturally go looking for len/length and then wonder why it's not there, or whether it has non-standard behaviour (that would justify a non-standard name?)

@stinodego
Copy link
Contributor Author

stinodego commented Oct 6, 2023

str.len()? :)

Woops, fixed.

Yeah maybe we should leave out list.count for now. I think anything available on Expr should also be available on Expr.list. But we can leave that for another day.

Not 100% yet on len vs length... Either is OK I guess, but as you say, if we go for length we should also rename Series/Expr.len to length. Using length would definitely be more in line with our overall API design.

@ritchie46 want to make the decision here?

@stinodego stinodego self-assigned this Oct 6, 2023
@ritchie46
Copy link
Member

I want to go for len then as it is well established and it means we don't have to rename the Series method.

@stinodego stinodego added the accepted Ready for implementation label Oct 6, 2023
@github-project-automation github-project-automation bot moved this to Ready in Backlog Oct 6, 2023
@Julian-J-S
Copy link
Contributor

One important thing to consider regarding str.len is that len behaves differently in python and rust!

  • python: char count; ["Café", "345", "東京"] -> [4, 3, 2]
  • rust: byte count; ["Café", "345", "東京"] -> [5, 3, 6]

currently we have str.lengths for byte-count and n_chars for char-count

Because strings are different from lists and arrays I suggest 2 functions

  • str.len_bytes
  • str.len_chars
    and maybe for consistency
  • str.len (using str.len_bytes)

what do you think?

@avimallu
Copy link
Contributor

avimallu commented Oct 8, 2023

I think this deprecation to len is unnecessary, but I do understand the need to switch to length. As Polars starts maturing, people (including me) expect Polars to be consistent internally even in the API.

Going by the design of the rest of the API, len should be present as a convenient alias of length. Polars, as was mentioned long ago by Ritchie, should not mimic "common" functions (looking at strptime) and make it better (like to_datetime).

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 8, 2023

I suggest 2 functions

  • str.len_bytes
  • str.len_chars
    and maybe for consistency
  • str.len (using str.len_bytes)

what do you think?

I think adding the third "generic" len function would undo all of the good work done by having two clear type-specific funcs (as it reintroduces ambiguity - it wouldn't be clear which len_<type> a basic len would redirect to ;)

@ritchie46, @stinodego: It does raise a good point though, we have gone to the trouble of having two expressions to account for this; perhaps unifying them both under len_<type> names is actually the way to go - that way anyone looking for len* will be immediately be shown that they have a choice to make as to what, exactly, they think they are getting the length of. We would mitigate ambiguity, and any hunting around for alternative expressions, as they'd be right next to each other in the docs/autocomplete/etc? 🤔

  • str.len_bytes / str.length_bytes
  • str.len_chars / str.length_chars

@stinodego
Copy link
Contributor Author

I don't really like that so much. Most unexperienced users will pick len_chars, even though they are working with ASCII data, because they might not understand the difference between bytes and chars. And that will cost performance.

But I do agree that n_chars is a related function and the current naming does not reflect that very well. Not sure about the best solution just yet.

@Julian-J-S
Copy link
Contributor

I was actually thinking the same that some users might pick length_chars instead of length_bytes when dealing with ASCII data BUT

@orlp
Copy link
Collaborator

orlp commented Oct 9, 2023

@stinodego I don't think that len_chars has to be that expensive, especially not if we add an ASCII-specific happy path.

@stinodego
Copy link
Contributor Author

Yeah I am coming around on the idea of len_chars and len_bytes. I'll draw it up like that.

length_bytes doesn't work imo, because that leaves me wondering "What is a length byte?". length_as_number_of_bytes is too wordy, n_bytes can work but doesn't show a connection to len. I think len_bytes / len_chars is perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation deprecation Add a deprecation warning to outdated functionality enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants