Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.split by an empty string produces incorrect results #14604

Closed
2 tasks done
stinodego opened this issue Feb 20, 2024 · 10 comments · Fixed by #15922
Closed
2 tasks done

str.split by an empty string produces incorrect results #14604

stinodego opened this issue Feb 20, 2024 · 10 comments · Fixed by #15922
Labels
A-dtype-string Area: string data type bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@stinodego
Copy link
Contributor

stinodego commented Feb 20, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

s = pl.Series(["abc", "ẞ", ""])
r = s.str.split("")
print(r)

Log output

shape: (2,)
Series: '' [list[str]]
[
        ["", "a", … ""]
        ["", "", ""]
        ["", ""]
]

Issue description

When splitting by an empty string, I would expect the string to be split into separate characters. This works, however, the result includes an empty string both at the start and end of the list.

Setting inclusive=True gets rid of the empty string at the end, but not at the start:

import polars as pl

s = pl.Series(["abc", "ẞ", ""])
r = s.str.split("", inclusive=True)
print(r)
shape: (2,)
Series: '' [list[str]]
[
        ["", "a", … "c"]
        ["", "ẞ"]
        [""]
]

Expected behavior

Expected output of the original example would be:

shape: (2,)
Series: '' [list[str]]
[
        ["a", "b", "c"]
        ["ẞ"]
        [""]
]

Installed versions

main

@stinodego stinodego added bug Something isn't working python Related to Python Polars P-low Priority: low A-dtype-string Area: string data type labels Feb 20, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Feb 20, 2024
@stinodego
Copy link
Contributor Author

Seems like this is the default behavior of Rust split though, so... maybe my expectations are incorrect?

fn main()  {
    let v: Vec<&str> = "Hello world!".split("").collect();
    println!("{:?}", v)
}
["", "H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d", "!", ""]

@stinodego stinodego added needs triage Awaiting prioritization by a maintainer and removed P-low Priority: low labels Feb 20, 2024
@etiennebacher
Copy link
Contributor

In case you want to compare to other languages, here's the behavior in R:

strsplit(c("abc", "", ""), "")
#> [[1]]
#> [1] "a" "b" "c"
#> 
#> [[2]]
#> [1] "ẞ"
#> 
#> [[3]]
#> character(0)

stringr::str_split(c("abc", "", ""), "")
#> [[1]]
#> [1] "a" "b" "c"
#> 
#> [[2]]
#> [1] "ẞ"
#> 
#> [[3]]
#> character(0)

@Julian-J-S
Copy link
Contributor

Julian-J-S commented Feb 20, 2024

Also python does not allow an empty separator

"Hello World!".split("")
# > ValueError: empty separator


# python "solution"
list("Hello World!")
# ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!']

What somehow makes sense because you can't split on "nothing".
You can only split on something you can actually find.

Maybe dont allow emtpy separators and force user to use None to mean split into chars?

@stinodego stinodego removed this from Backlog Feb 20, 2024
@stinodego stinodego added P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Feb 20, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Feb 20, 2024
@stinodego stinodego added P-low Priority: low and removed P-medium Priority: medium labels Feb 20, 2024
@stinodego
Copy link
Contributor Author

stinodego commented Feb 20, 2024

I discussed this with @orlp , and we indeed want to go for the expected behavior listed in the issue description.

The empty string input will be a special case that splits the string into its characters. Splitting an empty string this way will result in a list containing one empty string.

@cmdlineluser
Copy link
Contributor

I had resorted to using .extract_all for this (although it produces an empty list for the last case).

s.str.extract_all("(?s).")
# shape: (3,)
# Series: '' [list[str]]
# [
#     ["a", "b", "c"]
#     ["ẞ"]
#     []
# ]

@Julian-J-S
Copy link
Contributor

I discussed this with @orlp , and we indeed want to go for the expected behavior listed in the issue description.

The empty string input will be a special case that splits the string into its characters. Splitting an empty string this way will result in a list containing one empty string.

@stinodego
I am not convinced that splitting an emtpy string should return a list containing one emtpy string. I think this is not the expected behaviour but should be an emtpy list just like the example from @cmdlineluser.

There is no "split by nothing" so this special use case would instead mean "iterate over chars" I would assume?!

python

for line in ["abc", "ß", ""]:
    print(f'{line:5} -> {list(line)}')

# abc   -> ['a', 'b', 'c']
# ß     -> ['ß']
#       -> []

rust

vec!["abc", "ß", ""]
.iter()
.map(|line| line.chars().collect::<Vec<char>>())
.collect::<Vec<_>>();

// [['a', 'b', 'c'], ['ß'], []]

@orlp
Copy link
Collaborator

orlp commented Feb 21, 2024

@JulianCologne I feel it's a bit of a 0 to the 0th power situation. Is that 1 or is that 0? It depends from which side you approach "".split(""):

"bar".split("") -> ["b", "a", "r"]  # Desired as discussed.
"ba".split("") -> ["b", "a"]  # Desired as discussed.
"b".split("") -> ["b"]  # Desired as discussed.
"".split("") ->  ?
"".split("b") -> [""]  # Defined by Python, want to be consistent with.
"".split("ba") -> [""]  # Defined by Python, want to be consistent with.
"".split("bar") -> [""]  # Defined by Python, want to be consistent with.

@Julian-J-S
Copy link
Contributor

@orlp Interesting thoughts, however...

splitting by nothing is not defined.
So the new idea becomes "iterate over the chars" and imo the expected behaviour is to have as many items in the list as the text is long

List length should be equal to utf8-char-count

  • "bar" -> 3 chars
  • "ba" -> 2 chars
  • "b" -> 1 chars
  • "" -> 0 chars

Also your second half examples have a different meaning!
Splitting by something that is not there will result in the original string being returned

  • "XXX".split('abc') -> ['XXX']

Logic 1) if sep is empty -> special case -> list of all chars

"bar".split("") -> ["b", "a", "r"] # length: 3
"ba".split("") -> ["b", "a"] # length: 2
"b".split("") -> ["b"] # length: 1
"".split("") -> [] # length: 0

Logic 2) if sep is not found -> special case -> keep original string

"XXX".split("bar") -> ["XXX"]
"XXX".split("ba") -> ["XXX"]
"XXX".split("b") -> ["XXX"]
"XXX".split("") -> ["X", "X", "X"] # Different case! Cannot search for emtpy string so requires different logic from above! :)

Conclusion

"".split("") muss follow "Logic 1" as you cannot check for an emtpy string in your text.

@stinodego
Copy link
Contributor Author

I would actually tend to agree with @JulianCologne here - returning an empty list in that special case would be more useful.

@orlp
Copy link
Collaborator

orlp commented Apr 7, 2024

I'm fine with it, let's make it an empty list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-string Area: string data type bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants