Performance improvement for some libcudf regex functions for long strings #13322

davidwendt · 2023-05-09T20:00:01Z

Description

Changes the internal regex logic to minimize character counting to help performance with longer strings. The improvement applies mainly to libcudf regex functions that return strings (i.e. extract, replace, split). The changes here also improve the internal device APIs for clarity to improve maintenance. The most significant change makes the position variables input-only and returning an optional pair to indicate a successful match.

There are some more optimizations that are possible here where character positions are passed back and forth that could be replaced with byte positions to further reduce counting. Initial measurements showed this noticeably slowed down small strings so more analysis is required before continuing this optimization.

Reference: #13048

More Detail

First, there is a change to some internal regex function signatures. Notable the reprog_device::find() and reprog_device::extract() member functions declared in cpp/src/strings/regex/regex.cuh that are used by all the libcudf regex functions. The in/out parameters are now input-only parameters (pass by value) and the return is an optional pair that includes the match result. Also, the begin parameter is now an iterator and the end parameter now has a default. This change requires updating all the definitions and uses of the find and extract member functions.

Using an iterator as the begin parameter allows for some optimizations in the calling code to minimize character counting that may be needed for processing multi-byte UTF-8 characters. Rather than using the cudf::string_view::byte_offset() member function to convert character positions to byte positions, an iterator can be incremented as we traverse through the string which helps reduce some character counting. So the changes here involve removing some calls to byte_offset() and incrementing (really moving) iterators with a pattern like itr += (new_pos - itr.position()); There is another PR #13428 to make a move_to iterator member function.

It is possible to reduce the character counting even more as mentioned above but further optimization requires some deeper analysis.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

harrism

There are a lot of disparate changes here. I can't review them in detail, but I wonder what is the relationship between them all? Is there something common between the optimizations?

davidwendt · 2023-06-14T11:59:16Z

There are a lot of disparate changes here. I can't review them in detail, but I wonder what is the relationship between them all? Is there something common between the optimizations?

Yes. There are two main changes here. One is the change to the signatures of the two reprog_device member functions find() and extract() and this allows for the 2nd main change which is using the string_view::const_iterator more instead of the string_view::byte_offset() to help minimize character counting. I've also added more detail about both of these changes in the PR description under the More Detail heading.

harrism

Nice speedups!

Adds a `move_to()` function the `cudf::string_view::const_iterator` class to help minimize character counting when creating and incrementing the iterator on multi-byte UTF8 characters. The function simply moves the iterator from the current character position to the given one. This is just a shortcut for the form ``` itr += (new_position - itr.position()); ``` This pattern is repeated many times in #13322 and likely future PRs that require the same behavior. The PR also includes an update to the `string_view::begin()` to set the byte-offset directly rather than waste instructions calculating it. Authors: - David Wendt (https://github.com/davidwendt) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #13428

mythrocks · 2023-06-20T17:31:30Z

I'm reviewing this right now. Please pardon my prior absence.

mythrocks

A couple of clarifications and nitpicks. Looks good to me.

cpp/include/cudf/strings/detail/utilities.cuh

cpp/src/strings/contains.cu

cpp/src/strings/extract/extract.cu

cpp/src/strings/regex/regex.cuh

davidwendt · 2023-06-23T21:28:47Z

/merge

Performance improvement for libcudf regex functions for long strings

d798d86

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 9, 2023

davidwendt self-assigned this May 9, 2023

davidwendt added 12 commits May 11, 2023 09:05

Merge branch 'branch-23.06' into regex-byte-interface

4dd7b8f

add experimental workaround for contains

a857000

Merge branch 'branch-23.06' into regex-byte-interface

f8036ae

Merge branch 'branch-23.06' into regex-byte-interface

d70df5d

Merge branch 'branch-23.06' into regex-byte-interface

75b59c1

rework just regex calling logic to minimize character counting

bba83e5

Merge branch 'branch-23.06' into regex-byte-interface

0a98c53

Merge branch 'branch-23.06' into regex-byte-interface

d706051

Merge branch 'branch-23.06' into regex-byte-interface

a5c29eb

Merge branch 'branch-23.06' into regex-byte-interface

8e8e06d

remove unneeded length() calls

7af758c

Merge branch 'branch-23.06' into regex-byte-interface

1dbcf4e

davidwendt mentioned this pull request May 17, 2023

Add move_to function to cudf::string_view iterator #13369

Closed

3 tasks

davidwendt added 2 commits May 18, 2023 17:09

Merge branch 'branch-23.06' into regex-byte-interface

d797fb0

Merge branch 'branch-23.06' into regex-byte-interface

a6e83ca

davidwendt mentioned this pull request May 24, 2023

Add a move_to function to cudf::string_view::const_iterator #13428

Merged

3 tasks

Merge branch 'branch-23.08' into regex-byte-interface

7d035fa

davidwendt changed the base branch from branch-23.06 to branch-23.08 May 24, 2023 20:00

davidwendt added 3 commits May 24, 2023 20:00

Merge branch 'branch-23.08' into regex-byte-interface

1bb235f

fix merge conflict

75dc6dc

Merge branch 'branch-23.08' into regex-byte-interface

091f232

davidwendt mentioned this pull request May 26, 2023

Rework libcudf regex benchmarks with nvbench #13464

Merged

3 tasks

Merge branch 'branch-23.08' into regex-byte-interface

6d1af39

davidwendt added 2 commits June 9, 2023 17:15

Merge branch 'branch-23.08' into regex-byte-interface

43add7b

Merge branch 'branch-23.08' into regex-byte-interface

c8f1b84

davidwendt changed the title ~~Performance improvement for libcudf regex functions for long strings~~ Performance improvement for some libcudf regex functions for long strings Jun 12, 2023

Merge branch 'branch-23.08' into regex-byte-interface

8b331a8

davidwendt changed the title ~~Performance improvement for some libcudf regex functions for long strings~~ Performance improvement for some libcudf regex functions for long strings Jun 13, 2023

harrism reviewed Jun 14, 2023

View reviewed changes

davidwendt changed the title ~~Performance improvement for some libcudf regex functions for long strings~~ Performance improvement for some libcudf regex functions for long strings Jun 14, 2023

Merge branch 'branch-23.08' into regex-byte-interface

2f95fae

davidwendt changed the title ~~Performance improvement for some libcudf regex functions for long strings~~ Performance improvement for some libcudf regex functions for long strings Jun 14, 2023

harrism approved these changes Jun 14, 2023

View reviewed changes

Merge branch 'branch-23.08' into regex-byte-interface

905f960

mythrocks reviewed Jun 20, 2023

View reviewed changes

cpp/include/cudf/strings/detail/utilities.cuh Show resolved Hide resolved

cpp/src/strings/contains.cu Show resolved Hide resolved

cpp/src/strings/extract/extract.cu Outdated Show resolved Hide resolved

cpp/src/strings/regex/regex.cuh Show resolved Hide resolved

davidwendt added 2 commits June 20, 2023 18:33

remove unneeded variable declaration

4cb3d83

Merge branch 'branch-23.08' into regex-byte-interface

fd71b2f

davidwendt changed the title ~~Performance improvement for some libcudf regex functions for long strings~~ Performance improvement for some libcudf regex functions for long strings Jun 21, 2023

davidwendt requested a review from mythrocks June 22, 2023 12:27

davidwendt changed the title ~~Performance improvement for some libcudf regex functions for long strings~~ Performance improvement for some libcudf regex functions for long strings Jun 22, 2023

Merge branch 'branch-23.08' into regex-byte-interface

37fdc91

mythrocks approved these changes Jun 22, 2023

View reviewed changes

davidwendt changed the title ~~Performance improvement for some libcudf regex functions for long strings~~ Performance improvement for some libcudf regex functions for long strings Jun 22, 2023

Merge branch 'branch-23.08' into regex-byte-interface

c5cfba6

davidwendt changed the title ~~Performance improvement for some libcudf regex functions for long strings~~ Performance improvement for some libcudf regex functions for long strings Jun 23, 2023

Merge branch 'branch-23.08' into regex-byte-interface

5abb769

davidwendt changed the title ~~Performance improvement for some libcudf regex functions for long strings~~ Performance improvement for some libcudf regex functions for long strings Jun 23, 2023

rapids-bot bot merged commit f0c62cb into rapidsai:branch-23.08 Jun 23, 2023

davidwendt deleted the regex-byte-interface branch June 23, 2023 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement for some libcudf regex functions for long strings #13322

Performance improvement for some libcudf regex functions for long strings #13322

davidwendt commented May 9, 2023 •

edited

Loading

harrism left a comment •

edited

Loading

davidwendt commented Jun 14, 2023

harrism left a comment

mythrocks commented Jun 20, 2023 •

edited

Loading

mythrocks left a comment

davidwendt commented Jun 23, 2023

Performance improvement for some libcudf regex functions for long strings #13322

Performance improvement for some libcudf regex functions for long strings #13322

Conversation

davidwendt commented May 9, 2023 • edited Loading

Description

More Detail

Checklist

harrism left a comment • edited Loading

Choose a reason for hiding this comment

davidwendt commented Jun 14, 2023

harrism left a comment

Choose a reason for hiding this comment

mythrocks commented Jun 20, 2023 • edited Loading

mythrocks left a comment

Choose a reason for hiding this comment

davidwendt commented Jun 23, 2023

davidwendt commented May 9, 2023 •

edited

Loading

harrism left a comment •

edited

Loading

mythrocks commented Jun 20, 2023 •

edited

Loading