[BUG] Poor performance scaling of `str.count` with long strings #15567

GregoryKimball · 2024-04-18T18:40:54Z

Describe the bug
During large string testing, I noticed that we have poor performance scaling of the str.count API with string length.

Steps/Code to reproduce bug

df = cudf.DataFrame({'a':['abc','cdf']})
for n in [1,10,100,1000, 10000]:
    df['b'] = df['a'].str.repeat(n)
    t0 = time.time()
    _ = df['b'].str.count('a')
    t1 = time.time()
    string_length = int(df['b'].str.len().mean())
    print(f'count on string len {string_length}, took {t1-t0} seconds')

count on string len 3, took 0.0012462139129638672 seconds
count on string len 30, took 0.00027561187744140625 seconds
count on string len 300, took 0.0019071102142333984 seconds
count on string len 3000, took 0.12342023849487305 seconds
count on string len 30000, took 11.811477422714233 seconds

Expected behavior
The observed scaling is something like O(n^2), but I would expect scaling more like O(n).

Additional context
This issue is one component of improving performance with long strings in #13048

The text was updated successfully, but these errors were encountered:

davidwendt · 2024-04-18T19:38:10Z

This is a character count and not the number of bytes so resolving multi-byte characters requires some computation.
Regardless, it should be possible to compute the character count in a more parallel way for long strings.

Thought this was a call to cudf::strings::character_count() when it is actually a call to cudf::strings::count_re() which is a regex call.

https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.core.column.string.stringmethods.count/#cudf.core.column.string.StringMethods.count

Regex calls like this one will be difficult to make scale well for long strings as mentioned in #13048

Improves performance of `cudf::strings::count_re` when pattern starts with a literal character. Although this is a specific use case, the regex code has special logic to help speed up the search in this case. Since the pattern indicates the target must contain this character as the start of the matching sequence, it first does a normal find for the character before continuing matching the remaining pattern. The `find()` function can be inefficient for long strings since it is character based and must resolve the character's byte-position by counting from the beginning of the string. For a function like `count_re()` all occurrences are matched within a target meaning longer target strings can incur expensive counting. The solution included here is to introduce a more efficient `find_char()` utility that accepts a `string_view::const_iterator()` which automatically keeps track of its byte and character positions. This helps minimize byte/character counting in between calls from `count_re()` and other similar functions that make repeated calls for all matches (e.g. `replace_re()` and `split_re()`). Close #15567 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #15578

GregoryKimball added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Apr 18, 2024

GregoryKimball added this to libcudf Apr 18, 2024

davidwendt self-assigned this Apr 18, 2024

davidwendt mentioned this issue Apr 22, 2024

Improve performance for cudf::strings::count_re #15578

Merged

3 tasks

rapids-bot bot closed this as completed in #15578 Apr 25, 2024

GregoryKimball removed this from libcudf Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Poor performance scaling of `str.count` with long strings #15567

[BUG] Poor performance scaling of `str.count` with long strings #15567

GregoryKimball commented Apr 18, 2024

davidwendt commented Apr 18, 2024 •

edited

Loading

[BUG] Poor performance scaling of str.count with long strings #15567

[BUG] Poor performance scaling of str.count with long strings #15567

Comments

GregoryKimball commented Apr 18, 2024

davidwendt commented Apr 18, 2024 • edited Loading

[BUG] Poor performance scaling of `str.count` with long strings #15567

[BUG] Poor performance scaling of `str.count` with long strings #15567

davidwendt commented Apr 18, 2024 •

edited

Loading