Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup regex compiler operators and operands source #10879

Merged

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented May 17, 2022

Cleans up the regcomp.cpp source to fix class names, comments, and simplify logic around processing operators and operands returned by the parser. Several class member variables used for state are moved or eliminated. Some member functions and variables are renamed. Cleanup of the parser logic will be in a follow-on PR.

Reference #3582
Follow on to #10843

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. tech debt strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 17, 2022
@davidwendt davidwendt self-assigned this May 17, 2022
@davidwendt davidwendt changed the title Cleanup regex_compiler operators and operands Cleanup regex compiler operators and operands source May 17, 2022
@codecov
Copy link

codecov bot commented May 17, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@ca4bc97). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 869fa6f differs from pull request most recent head 848c64c. Consider uploading reports for the commit 848c64c to get more accurate results

@@               Coverage Diff               @@
##             branch-22.08   #10879   +/-   ##
===============================================
  Coverage                ?   86.32%           
===============================================
  Files                   ?      144           
  Lines                   ?    22696           
  Branches                ?        0           
===============================================
  Hits                    ?    19593           
  Misses                  ?     3103           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca4bc97...848c64c. Read the comment docs.

@davidwendt davidwendt changed the base branch from branch-22.06 to branch-22.08 May 24, 2022 21:36
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 25, 2022
@davidwendt davidwendt marked this pull request as ready for review May 25, 2022 17:11
@davidwendt davidwendt requested a review from a team as a code owner May 25, 2022 17:11
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanups. The PR description should probably be updated now that #10843 is merged.


static std::vector<int> tokens{STAR, STAR_LAZY, QUEST, QUEST_LAZY, PLUS, PLUS_LAZY, RBRA};
_last_was_and =
std::any_of(tokens.begin(), tokens.end(), [token](auto t) { return t == token; });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess tokens are cheap (and this is host code) so this doesn't really matter, but you could capture the token by reference to avoid copying right? Also, you could iterator using cbegin and cend instead and make the function parameter auto const t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy by value for int types is preferred over by reference if the value is not changed. I think the cbegin/cend is a good idea. I don't think declaring t as const is that helpful here. The compiler will generate the same code and a human only has one line of code to parse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm dumb, when I wrote "I guess tokens are cheap" I didn't check the vector above to see exactly what tokens are. Yes, definitely agree that copy by value is preferred here. I also agree that the const is not necessary, was just a minor suggestion.

Should we make the tokens vector constexpr, then? Also, just a note that if we ever switch OperatorType to be a C++-style scoped enum we should probably add a base type specifier that aligns with the type of the tokens vector just to be explicit and avoid type conversions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think constexpr would be more appropriate. Unfortunately you can only declare constexpr for literal or reference types.
So I prefer using static here.
It looks like c++20 will support constexpr with std::vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's a good point. This would probably be a good place to use std::array (with CTAD so that you don't even need to specify the length).

@davidwendt davidwendt requested a review from vyasr May 25, 2022 20:06
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small follow-ups to my previous review questions, but nothing blocking.

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1179e46 into rapidsai:branch-22.08 May 27, 2022
@davidwendt davidwendt deleted the regex-compiler-oper-cleanup branch May 27, 2022 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants