Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python, rust): new n_chars functionality for utf8 strings #5252

Merged
merged 1 commit into from
Oct 18, 2022
Merged

feat(python, rust): new n_chars functionality for utf8 strings #5252

merged 1 commit into from
Oct 18, 2022

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Oct 18, 2022

Closes #5249.


The linked issue raises a good point; we're currently returning nbytes rather than what most people (and python) would consider to be the "actual" string length, and offering no function for getting that length. However, getting nbytes has value (and is faster).

So, this PR separates the two concepts into str.lengths and str.n_bytes, to offer the "expected" and the existing functionality (in a more unambiguous form; you can opt to use n_bytes when you know you're only dealing with ascii, for example):

Updated:

  • str.lengths remains unchanged (non-breaking for existing code).
  • str.n_chars (new) now provides the character-count functionality.
  • the docstrings for the respective functions explain the difference.

Example:

df = pl.DataFrame({
    "s":["Café","xyz","東京"]
}).with_columns([
    pl.col("s").str.lengths().alias("length"),
    pl.col("s").str.n_chars().alias("nchars"),
])
# shape: (3, 3)
# ┌──────┬────────┬────────┐
# │ s    ┆ length ┆ nchars │
# │ ---  ┆ ---    ┆ ---    │
# │ str  ┆ u32    ┆ u32    │
# ╞══════╪════════╪════════╡
# │ Café ┆ 5      ┆ 4      │  # << é: two bytes, one char
# ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ xyz  ┆ 3      ┆ 3      │  # << ascii: bytes and chars match
# ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ 東京  ┆ 6      ┆ 2      │  # << 東京 (tokyo): six bytes, two chars
# └──────┴────────┴────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Oct 18, 2022
@ritchie46
Copy link
Member

I want to follow the rust defaults here. Which is lengths for n_bytes and char lengths explicitly.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Oct 18, 2022

I want to follow the rust defaults here. Which is lengths for n_bytes and char lengths explicitly.

Fair enough - though even the rust docs state "it might not be what a human considers the length of the string" ;)

In that case should be unambiguous what's going on; how about this...

  • str.lengths stays "as-is" and returns n_bytes
  • str.n_chars returns the "human" string length?

...and I add a Notes section in the respective docstrings that goes out of its way to explain the difference? That should help point people at what they need/expect.

@fangyinc
Copy link

@alexander-beedie

  • str.lengths stays "as-is" and returns n_bytes
  • str.n_chars returns the "human" string length?

It looks like well, the meaning of str.lengths is not change.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Oct 18, 2022

@ritchie46: done; patch is now non-breaking (clearly a better idea), and character-counting functionality is provided by the new n_chars expression/method, with docstring updates to point the user between the functions (depending on what they want/expect) and examples that illustrate the difference.

@alexander-beedie alexander-beedie changed the title feat(python, rust): separate "lengths" and "nbytes" functions for utf8 strings feat(python, rust): new "n_chars" functionality for utf8 strings Oct 18, 2022
@alexander-beedie alexander-beedie changed the title feat(python, rust): new "n_chars" functionality for utf8 strings feat(python, rust): new n_chars functionality for utf8 strings Oct 18, 2022
@ritchie46
Copy link
Member

Looks great! Can we have a docs entry for n_chars?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Oct 18, 2022

Looks great! Can we have a docs entry for n_chars?

Added; one day I'll remember to do that without the reminder...😅

@ritchie46 ritchie46 merged commit f8d60ee into pola-rs:master Oct 18, 2022
@alexander-beedie alexander-beedie deleted the string-chars-count branch October 20, 2022 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New string chars count api for string expr
3 participants