Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup cudf::strings::detail::regex_parser class source #10975

Merged
merged 15 commits into from
Jun 15, 2022

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented May 25, 2022

Cleans up the regex_parser class source to fix member names, comments, and simplify some logic creating parse-items.
Minimal changes were made to the regex_compiler to accommodate some public interface changes.

Also, the regex_compiler::expand_counted() function was moved into regex_parser since it only need the parser class' _items data. It seemed more apt for the member function to part of regex_parser than regex_compiler.

Reference #3582

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 25, 2022
@davidwendt davidwendt self-assigned this May 25, 2022
} d;
};
std::vector<Item> m_items;
std::vector<regex_parser::Item> expand_counted_items()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a new function but was simply moved from the regex_compiler class to here since it only needs the regex_parser::_items variable.

@codecov
Copy link

codecov bot commented May 25, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@77ca025). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #10975   +/-   ##
===============================================
  Coverage                ?   86.34%           
===============================================
  Files                   ?      144           
  Lines                   ?    22738           
  Branches                ?        0           
===============================================
  Hits                    ?    19632           
  Misses                  ?     3106           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 77ca025...bf64dd0. Read the comment docs.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 27, 2022
@davidwendt davidwendt marked this pull request as ready for review May 27, 2022 18:17
@davidwendt davidwendt requested a review from a team as a code owner May 27, 2022 18:17
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dense, but appears to mostly be renames of members. Looks better than it did for sure.

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. minor nits.

int quoted = nextc(yy);
_chr = 0;
char32_t chr = 0;
auto quoted = next_char(chr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know about performance but if the signature was cuda::std::pair<bool> next_char() then this would look slightly neater auto [is_quoted, chr] = next_char()

Comment on lines 172 to 176
int32_t _id_ccls_w{-1}; // alphanumeric
int32_t _id_ccls_W{-1}; // not alphanumeric
int32_t _id_ccls_s{-1}; // space
int32_t _id_ccls_d{-1}; // digit
int32_t _id_ccls_D{-1}; // not digit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the risk of making the code too verbose, can these get vowels and more elaborate docs?

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0d520e9 into rapidsai:branch-22.08 Jun 15, 2022
@davidwendt davidwendt deleted the regex-parser-cleanup branch June 15, 2022 21:51
rapids-bot bot pushed a commit that referenced this pull request Jun 27, 2022
This cleans up the awkward range literals for supporting the `CCLASS` and `NCCLASS` regex instructions. The range values were always paired (first,last) but arranged consecutively in a flat vector so `[idx] and [idx+1]` were range pairs `idx` was even. This PR introduces a `reclass_range` class that holds the pairs so we can use normal algorithms to manipulate them.

There is some overlap with code changes in PR #10975 

Reference #3582

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - MithunR (https://github.com/mythrocks)

URL: #11045
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants