Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement for some libcudf regex functions for long strings #13322

Merged
merged 46 commits into from
Jun 23, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented May 9, 2023

Description

Changes the internal regex logic to minimize character counting to help performance with longer strings. The improvement applies mainly to libcudf regex functions that return strings (i.e. extract, replace, split). The changes here also improve the internal device APIs for clarity to improve maintenance. The most significant change makes the position variables input-only and returning an optional pair to indicate a successful match.

There are some more optimizations that are possible here where character positions are passed back and forth that could be replaced with byte positions to further reduce counting. Initial measurements showed this noticeably slowed down small strings so more analysis is required before continuing this optimization.

Reference: #13048

More Detail

First, there is a change to some internal regex function signatures. Notable the reprog_device::find() and reprog_device::extract() member functions declared in cpp/src/strings/regex/regex.cuh that are used by all the libcudf regex functions. The in/out parameters are now input-only parameters (pass by value) and the return is an optional pair that includes the match result. Also, the begin parameter is now an iterator and the end parameter now has a default. This change requires updating all the definitions and uses of the find and extract member functions.

Using an iterator as the begin parameter allows for some optimizations in the calling code to minimize character counting that may be needed for processing multi-byte UTF-8 characters. Rather than using the cudf::string_view::byte_offset() member function to convert character positions to byte positions, an iterator can be incremented as we traverse through the string which helps reduce some character counting. So the changes here involve removing some calls to byte_offset() and incrementing (really moving) iterators with a pattern like itr += (new_pos - itr.position()); There is another PR #13428 to make a move_to iterator member function.

It is possible to reduce the character counting even more as mentioned above but further optimization requires some deeper analysis.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 9, 2023
@davidwendt davidwendt self-assigned this May 9, 2023
@davidwendt davidwendt changed the base branch from branch-23.06 to branch-23.08 May 24, 2023 20:00
@davidwendt davidwendt changed the title Performance improvement for libcudf regex functions for long strings Performance improvement for some libcudf regex functions for long strings Jun 12, 2023
@davidwendt davidwendt changed the title Performance improvement for some libcudf regex functions for long strings Performance improvement for some libcudf regex functions for long strings Jun 13, 2023
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of disparate changes here. I can't review them in detail, but I wonder what is the relationship between them all? Is there something common between the optimizations?

@davidwendt davidwendt changed the title Performance improvement for some libcudf regex functions for long strings Performance improvement for some libcudf regex functions for long strings Jun 14, 2023
@davidwendt
Copy link
Contributor Author

There are a lot of disparate changes here. I can't review them in detail, but I wonder what is the relationship between them all? Is there something common between the optimizations?

Yes. There are two main changes here. One is the change to the signatures of the two reprog_device member functions find() and extract() and this allows for the 2nd main change which is using the string_view::const_iterator more instead of the string_view::byte_offset() to help minimize character counting. I've also added more detail about both of these changes in the PR description under the More Detail heading.

@davidwendt davidwendt changed the title Performance improvement for some libcudf regex functions for long strings Performance improvement for some libcudf regex functions for long strings Jun 14, 2023
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice speedups!

rapids-bot bot pushed a commit that referenced this pull request Jun 20, 2023
Adds a `move_to()` function the `cudf::string_view::const_iterator` class to help minimize character counting when creating and incrementing the iterator on multi-byte UTF8 characters.
The function simply moves the iterator from the current character position to the given one. This is just a shortcut for the form
```
itr += (new_position - itr.position());
```
This pattern is repeated many times in #13322 and likely future PRs that require the same behavior.
The PR also includes an update to the `string_view::begin()` to set the byte-offset directly rather than waste instructions calculating it.

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13428
@mythrocks
Copy link
Contributor

mythrocks commented Jun 20, 2023

I'm reviewing this right now. Please pardon my prior absence.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of clarifications and nitpicks. Looks good to me.

@davidwendt davidwendt changed the title Performance improvement for some libcudf regex functions for long strings Performance improvement for some libcudf regex functions for long strings Jun 21, 2023
@davidwendt davidwendt requested a review from mythrocks June 22, 2023 12:27
@davidwendt davidwendt changed the title Performance improvement for some libcudf regex functions for long strings Performance improvement for some libcudf regex functions for long strings Jun 22, 2023
@davidwendt davidwendt changed the title Performance improvement for some libcudf regex functions for long strings Performance improvement for some libcudf regex functions for long strings Jun 22, 2023
@davidwendt davidwendt changed the title Performance improvement for some libcudf regex functions for long strings Performance improvement for some libcudf regex functions for long strings Jun 23, 2023
@davidwendt davidwendt changed the title Performance improvement for some libcudf regex functions for long strings Performance improvement for some libcudf regex functions for long strings Jun 23, 2023
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit f0c62cb into rapidsai:branch-23.08 Jun 23, 2023
@davidwendt davidwendt deleted the regex-byte-interface branch June 23, 2023 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants