-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str.split
by an empty string produces incorrect results
#14604
Comments
Seems like this is the default behavior of Rust fn main() {
let v: Vec<&str> = "Hello world!".split("").collect();
println!("{:?}", v)
}
|
In case you want to compare to other languages, here's the behavior in R: strsplit(c("abc", "ẞ", ""), "")
#> [[1]]
#> [1] "a" "b" "c"
#>
#> [[2]]
#> [1] "ẞ"
#>
#> [[3]]
#> character(0)
stringr::str_split(c("abc", "ẞ", ""), "")
#> [[1]]
#> [1] "a" "b" "c"
#>
#> [[2]]
#> [1] "ẞ"
#>
#> [[3]]
#> character(0) |
Also python does not allow an empty separator "Hello World!".split("")
# > ValueError: empty separator
# python "solution"
list("Hello World!")
# ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!'] What somehow makes sense because you can't split on "nothing". Maybe dont allow emtpy separators and force user to use |
I discussed this with @orlp , and we indeed want to go for the expected behavior listed in the issue description. The empty string input will be a special case that splits the string into its characters. Splitting an empty string this way will result in a list containing one empty string. |
I had resorted to using s.str.extract_all("(?s).")
# shape: (3,)
# Series: '' [list[str]]
# [
# ["a", "b", "c"]
# ["ẞ"]
# []
# ] |
@stinodego There is no "split by nothing" so this special use case would instead mean "iterate over chars" I would assume?! pythonfor line in ["abc", "ß", ""]:
print(f'{line:5} -> {list(line)}')
# abc -> ['a', 'b', 'c']
# ß -> ['ß']
# -> [] rustvec!["abc", "ß", ""]
.iter()
.map(|line| line.chars().collect::<Vec<char>>())
.collect::<Vec<_>>();
// [['a', 'b', 'c'], ['ß'], []] |
@JulianCologne I feel it's a bit of a 0 to the 0th power situation. Is that 1 or is that 0? It depends from which side you approach
|
@orlp Interesting thoughts, however... splitting by nothing is not defined. List length should be equal to utf8-char-count
Also your second half examples have a different meaning!
Logic 1) if sep is empty -> special case -> list of all chars"bar".split("") -> ["b", "a", "r"] # length: 3
"ba".split("") -> ["b", "a"] # length: 2
"b".split("") -> ["b"] # length: 1
"".split("") -> [] # length: 0 Logic 2) if sep is not found -> special case -> keep original string"XXX".split("bar") -> ["XXX"]
"XXX".split("ba") -> ["XXX"]
"XXX".split("b") -> ["XXX"]
"XXX".split("") -> ["X", "X", "X"] # Different case! Cannot search for emtpy string so requires different logic from above! :) Conclusion
|
I would actually tend to agree with @JulianCologne here - returning an empty list in that special case would be more useful. |
I'm fine with it, let's make it an empty list. |
Checks
Reproducible example
Log output
Issue description
When splitting by an empty string, I would expect the string to be split into separate characters. This works, however, the result includes an empty string both at the start and end of the list.
Setting
inclusive=True
gets rid of the empty string at the end, but not at the start:Expected behavior
Expected output of the original example would be:
Installed versions
main
The text was updated successfully, but these errors were encountered: