feat(python, rust): new `n_chars` functionality for utf8 strings #5252

alexander-beedie · 2022-10-18T12:51:36Z

Closes #5249.

The linked issue raises a good point; we're currently returning nbytes rather than what most people (and python) would consider to be the "actual" string length, and offering no function for getting that length. However, getting nbytes has value (and is faster).

So, this PR separates the two concepts into str.lengths and str.n_bytes, to offer the "expected" and the existing functionality (in a more unambiguous form; you can opt to use n_bytes when you know you're only dealing with ascii, for example):

Updated:

str.lengths remains unchanged (non-breaking for existing code).
str.n_chars (new) now provides the character-count functionality.
the docstrings for the respective functions explain the difference.

Example:

df = pl.DataFrame({
    "s":["Café","xyz","東京"]
}).with_columns([
    pl.col("s").str.lengths().alias("length"),
    pl.col("s").str.n_chars().alias("nchars"),
])
# shape: (3, 3)
# ┌──────┬────────┬────────┐
# │ s    ┆ length ┆ nchars │
# │ ---  ┆ ---    ┆ ---    │
# │ str  ┆ u32    ┆ u32    │
# ╞══════╪════════╪════════╡
# │ Café ┆ 5      ┆ 4      │  # << é: two bytes, one char
# ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ xyz  ┆ 3      ┆ 3      │  # << ascii: bytes and chars match
# ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ 東京  ┆ 6      ┆ 2      │  # << 東京 (tokyo): six bytes, two chars
# └──────┴────────┴────────┘

ritchie46 · 2022-10-18T13:06:23Z

I want to follow the rust defaults here. Which is lengths for n_bytes and char lengths explicitly.

alexander-beedie · 2022-10-18T13:14:10Z

I want to follow the rust defaults here. Which is lengths for n_bytes and char lengths explicitly.

Fair enough - though even the rust docs state "it might not be what a human considers the length of the string" ;)

In that case should be unambiguous what's going on; how about this...

str.lengths stays "as-is" and returns n_bytes
str.n_chars returns the "human" string length?

...and I add a Notes section in the respective docstrings that goes out of its way to explain the difference? That should help point people at what they need/expect.

fangyinc · 2022-10-18T13:27:01Z

@alexander-beedie

str.lengths stays "as-is" and returns n_bytes

str.n_chars returns the "human" string length?

It looks like well, the meaning of str.lengths is not change.

alexander-beedie · 2022-10-18T14:30:56Z

@ritchie46: done; patch is now non-breaking (clearly a better idea), and character-counting functionality is provided by the new n_chars expression/method, with docstring updates to point the user between the functions (depending on what they want/expect) and examples that illustrate the difference.

ritchie46 · 2022-10-18T14:52:13Z

Looks great! Can we have a docs entry for n_chars?

alexander-beedie · 2022-10-18T15:48:46Z

Looks great! Can we have a docs entry for n_chars?

Added; one day I'll remember to do that without the reminder...😅

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Oct 18, 2022

owrior mentioned this pull request Oct 18, 2022

fix(rust):Count chars rather than bytes #5251

Closed

alexander-beedie changed the title ~~feat(python, rust): separate "lengths" and "nbytes" functions for utf8 strings~~ feat(python, rust): new "n_chars" functionality for utf8 strings Oct 18, 2022

alexander-beedie changed the title ~~feat(python, rust): new "n_chars" functionality for utf8 strings~~ feat(python, rust): new n_chars functionality for utf8 strings Oct 18, 2022

feat(python, rust): add "n_chars" functions for utf8 strings

5011de8

ritchie46 merged commit f8d60ee into pola-rs:master Oct 18, 2022

alexander-beedie deleted the string-chars-count branch October 20, 2022 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python, rust): new `n_chars` functionality for utf8 strings #5252

feat(python, rust): new `n_chars` functionality for utf8 strings #5252

alexander-beedie commented Oct 18, 2022 •

edited

Loading

ritchie46 commented Oct 18, 2022

alexander-beedie commented Oct 18, 2022 •

edited

Loading

fangyinc commented Oct 18, 2022

alexander-beedie commented Oct 18, 2022 •

edited

Loading

ritchie46 commented Oct 18, 2022

alexander-beedie commented Oct 18, 2022 •

edited

Loading

feat(python, rust): new n_chars functionality for utf8 strings #5252

feat(python, rust): new n_chars functionality for utf8 strings #5252

Conversation

alexander-beedie commented Oct 18, 2022 • edited Loading

ritchie46 commented Oct 18, 2022

alexander-beedie commented Oct 18, 2022 • edited Loading

fangyinc commented Oct 18, 2022

alexander-beedie commented Oct 18, 2022 • edited Loading

ritchie46 commented Oct 18, 2022

alexander-beedie commented Oct 18, 2022 • edited Loading

feat(python, rust): new `n_chars` functionality for utf8 strings #5252

feat(python, rust): new `n_chars` functionality for utf8 strings #5252

alexander-beedie commented Oct 18, 2022 •

edited

Loading

alexander-beedie commented Oct 18, 2022 •

edited

Loading

alexander-beedie commented Oct 18, 2022 •

edited

Loading

alexander-beedie commented Oct 18, 2022 •

edited

Loading