You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
During large string testing, I noticed that we have poor performance scaling of the str.count API with string length.
Steps/Code to reproduce bug
df = cudf.DataFrame({'a':['abc','cdf']})
for n in [1,10,100,1000, 10000]:
df['b'] = df['a'].str.repeat(n)
t0 = time.time()
_ = df['b'].str.count('a')
t1 = time.time()
string_length = int(df['b'].str.len().mean())
print(f'count on string len {string_length}, took {t1-t0} seconds')
count on string len 3, took 0.0012462139129638672 seconds
count on string len 30, took 0.00027561187744140625 seconds
count on string len 300, took 0.0019071102142333984 seconds
count on string len 3000, took 0.12342023849487305 seconds
count on string len 30000, took 11.811477422714233 seconds
Expected behavior
The observed scaling is something like O(n^2), but I would expect scaling more like O(n).
Additional context
This issue is one component of improving performance with long strings in #13048
The text was updated successfully, but these errors were encountered:
This is a character count and not the number of bytes so resolving multi-byte characters requires some computation.
Regardless, it should be possible to compute the character count in a more parallel way for long strings.
Thought this was a call to cudf::strings::character_count() when it is actually a call to cudf::strings::count_re() which is a regex call.
Improves performance of `cudf::strings::count_re` when pattern starts with a literal character.
Although this is a specific use case, the regex code has special logic to help speed up the search in this case.
Since the pattern indicates the target must contain this character as the start of the matching sequence, it first does a normal find for the character before continuing matching the remaining pattern. The `find()` function can be inefficient for long strings since it is character based and must resolve the character's byte-position by counting from the beginning of the string. For a function like `count_re()` all occurrences are matched within a target meaning longer target strings can incur expensive counting.
The solution included here is to introduce a more efficient `find_char()` utility that accepts a `string_view::const_iterator()` which automatically keeps track of its byte and character positions. This helps minimize byte/character counting in between calls from `count_re()` and other similar functions that make repeated calls for all matches (e.g. `replace_re()` and `split_re()`).
Close#15567
Authors:
- David Wendt (https://github.com/davidwendt)
Approvers:
- Yunsong Wang (https://github.com/PointKernel)
- Nghia Truong (https://github.com/ttnghia)
URL: #15578
Describe the bug
During large string testing, I noticed that we have poor performance scaling of the
str.count
API with string length.Steps/Code to reproduce bug
Expected behavior
The observed scaling is something like
O(n^2)
, but I would expect scaling more likeO(n)
.Additional context
This issue is one component of improving performance with long strings in #13048
The text was updated successfully, but these errors were encountered: