Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Poor performance scaling of str.count with long strings #15567

Closed
GregoryKimball opened this issue Apr 18, 2024 · 1 comment · Fixed by #15578
Closed

[BUG] Poor performance scaling of str.count with long strings #15567

GregoryKimball opened this issue Apr 18, 2024 · 1 comment · Fixed by #15578
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@GregoryKimball
Copy link
Contributor

Describe the bug
During large string testing, I noticed that we have poor performance scaling of the str.count API with string length.

Steps/Code to reproduce bug

df = cudf.DataFrame({'a':['abc','cdf']})
for n in [1,10,100,1000, 10000]:
    df['b'] = df['a'].str.repeat(n)
    t0 = time.time()
    _ = df['b'].str.count('a')
    t1 = time.time()
    string_length = int(df['b'].str.len().mean())
    print(f'count on string len {string_length}, took {t1-t0} seconds')
count on string len 3, took 0.0012462139129638672 seconds
count on string len 30, took 0.00027561187744140625 seconds
count on string len 300, took 0.0019071102142333984 seconds
count on string len 3000, took 0.12342023849487305 seconds
count on string len 30000, took 11.811477422714233 seconds

Expected behavior
The observed scaling is something like O(n^2), but I would expect scaling more like O(n).

Additional context
This issue is one component of improving performance with long strings in #13048

@GregoryKimball GregoryKimball added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Apr 18, 2024
@davidwendt davidwendt self-assigned this Apr 18, 2024
@davidwendt
Copy link
Contributor

davidwendt commented Apr 18, 2024

This is a character count and not the number of bytes so resolving multi-byte characters requires some computation.
Regardless, it should be possible to compute the character count in a more parallel way for long strings.

Thought this was a call to cudf::strings::character_count() when it is actually a call to cudf::strings::count_re() which is a regex call.

https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.core.column.string.stringmethods.count/#cudf.core.column.string.StringMethods.count

Regex calls like this one will be difficult to make scale well for long strings as mentioned in #13048

rapids-bot bot pushed a commit that referenced this issue Apr 25, 2024
Improves performance of `cudf::strings::count_re` when pattern starts with a literal character.
Although this is a specific use case, the regex code has special logic to help speed up the search in this case.

Since the pattern indicates the target must contain this character as the start of the matching sequence, it first does a normal find for the character before continuing matching the remaining pattern. The `find()` function can be inefficient for long strings since it is character based and must resolve the character's byte-position by counting from the beginning of the string. For a function like `count_re()` all occurrences are matched within a target meaning longer target strings can incur expensive counting.

The solution included here is to introduce a more efficient `find_char()` utility that accepts a `string_view::const_iterator()` which automatically keeps track of its byte and character positions. This helps minimize byte/character counting in between calls from `count_re()` and other similar functions that make repeated calls for all matches (e.g. `replace_re()` and `split_re()`).

Close #15567

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: #15578
@GregoryKimball GregoryKimball removed this from libcudf Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants