Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New fixed_regex_linter #1021

Merged
merged 34 commits into from
May 19, 2022
Merged

New fixed_regex_linter #1021

merged 34 commits into from
May 19, 2022

Conversation

MichaelChirico
Copy link
Collaborator

@MichaelChirico MichaelChirico commented Mar 28, 2022

Part of #962

Saved this one for last because I expected it to be a pain... the work setup is not very friendly to compiling packages with C/C++ code locally.

C/C++ means it comes with some overhead (?) of adding low-level code to the package. I'm not sure how comfortable you @AshesITR are with C/C++, but I think it's pretty straightforward as far as C code goes. Basically, I started writing R code that I knew would look 90% the same and actually be more natural (and faster) in C, so I ported it there.

NB: originally, it's implemented in C++ because google is very C++-heavy, but the code is 99% the same in C vs. C++ and R is C-friendlier, so made the switch here.

cc @jimhester here as well who has more experience with C/C++ and might weigh in on utility here.

@AshesITR
Copy link
Collaborator

I'm fine with C code, but I think we should first convert the STR_CONST nodes to the actual R string, to avoid having to handle R quoting and (raw) string literals.
One think to keep in mind is that installing packages with compiled code from source can be more challenging to the end user, so I'd ideally like to see a performance comparison of C vs. a pure R solution.

@MichaelChirico
Copy link
Collaborator Author

think we should first convert the STR_CONST nodes to the actual R string, to avoid having to handle R quoting and (raw) string literals.

the thinking is why bother doing any string processing in R when we can just jump straight to C instead

I'd ideally like to see a performance comparison of C vs. a pure R solution.

it should basically be a comparison of R-level and C-level for loops right?

installing packages with compiled code from source can be more challenging to the end user

my sense is that's more a problem when using exotic libraries / system dependencies but here is pretty pure C+R API. maybe windows users?

@AshesITR
Copy link
Collaborator

think we should first convert the STR_CONST nodes to the actual R string, to avoid having to handle R quoting and (raw) string literals.

the thinking is why bother doing any string processing in R when we can just jump straight to C instead

Because eval(xml_text(node)) will allow us to delegate the unescaping logic to R core without having to rewrite it in C, lessening the maintenance burden.

I'd ideally like to see a performance comparison of C vs. a pure R solution.

it should basically be a comparison of R-level and C-level for loops right?

Maybe the R solution can do smart things with regexes to detect fixed regexes?

installing packages with compiled code from source can be more challenging to the end user

my sense is that's more a problem when using exotic libraries / system dependencies but here is pretty pure C+R API. maybe windows users?

Windows users need RTools (a) installed and (b) properly configured to be able to install anything from source. Exotic libraries do increase the pain even more, though, since it requires operating pacman within the MSYS environment in RTools.
I've also had pain on macOS with specific libraries (OpenMP), but plain C code usually worked.
linux is the only OS I never had trouble building things from source.

@MichaelChirico
Copy link
Collaborator Author

MichaelChirico commented Mar 30, 2022

Tried re-writing is_not_regex using pure R.

for an input str = "'^x'" with length 1, the C version is 10x faster; for an input str with length 100 (just rep() that str 100 times), the C version is 150x faster; length 1000 --> 600x.

for a longer pattern, str = paste(c(letters, LETTERS), collapse = ""), the C version is 120x faster for length-1 input; length 10-->400x; length 100 --> 800x.

of course, for typical usage, there will only be 1 or up to a handful of patterns found on a given expression, but the really bad scaling w.r.t. pattern length is more damning.

@MichaelChirico
Copy link
Collaborator Author

MichaelChirico commented Mar 30, 2022

Because eval(xml_text(node)) will allow us to delegate the unescaping logic to R core without having to rewrite it in C, lessening the maintenance burden.

I think this is much more important for raw strings, but we have several other linters that need a pass to update logic for raw strings; lets do that all at once instead

@AshesITR
Copy link
Collaborator

AshesITR commented Mar 30, 2022

If I understand correctly, the actual R string must match a regular expression like ((?<![\\\[])(?:\\\\)*\\[a-zA-Z0-9<>]|(?<![\\\[])(?:\\\\)*[\^${(.*+?|]|(?<![\\\[])(?:\\\\)*\[[^\]]{2,}\]), right?
If so, that should be very fast using grepl().

@MichaelChirico
Copy link
Collaborator Author

MichaelChirico commented Mar 30, 2022

it's a fair amount more than that -- e.g. the linter handles escapes. see the tests

@AshesITR
Copy link
Collaborator

Can you test my regex? I'm pretty sure regex escapes are handled by the negative lookbehind. Maybe it's needed in more places in the regex.

@AshesITR
Copy link
Collaborator

e.g.:

grepl(r"(((?<![\\\[])(?:\\\\)*\\[a-zA-Z0-9<>]|(?<![\\\[])(?:\\\\)*[\^${(.*+?|]|(?<![\\\[])(?:\\\\)*\[[^\]]{2,}\]))", c("a[.]", "a\\[ab]", "[ab]", "\\w", "\\\\w", "\\\\\\w"), perl = TRUE)

@MichaelChirico
Copy link
Collaborator Author

it doesn't pass the tests

[ FAIL 11 | WARN 0 | SKIP 0 | PASS 39 ]

I don't doubt regex can be used -- in fact that's how I started writing this linter long ago. but as you see it quickly spirals out of control. I don't think rex::rex() will really be of much help in taming that ungodly beast of a regex either.

@AshesITR
Copy link
Collaborator

I still see no reason why fixed regexes shouldn't be a regular language, algorithmically.
I'll try to improve the regex a bit more before accepting defeat 😄

@AshesITR
Copy link
Collaborator

AshesITR commented Mar 30, 2022

@MichaelChirico I can't repro the failures:

is_not_regex <- function(str, skip_start = FALSE, skip_end = FALSE) {
  # .Call(lintr_is_not_regex, str, skip_start, skip_end)
  # Handle string quoting and escaping using R directly
  str <- vapply(str, function(elem) eval(parse(text = elem)), character(1L))

  rx_fixed_regex <- r"(((?<![\\\[])(?:\\\\)*\\[a-zA-Z0-9<>]|(?<![\\\[])(?:\\\\)*[\^${(.*+?|]|(?<![\\\[])(?:\\\\)*\[[^\]]{2,}\]))"
  !grepl(rx_fixed_regex, str, perl = TRUE)
}

FAIL 0 for me and not at all bad performance:
✔ | 50 | fixed_regex_linter [1.3s]

The C version took 200ms less time.
✔ | 50 | fixed_regex_linter [1.1s]

@MichaelChirico
Copy link
Collaborator Author

MichaelChirico commented Mar 30, 2022

I still see no reason why fixed regexes shouldn't be a regular language, algorithmically.

I agree, but it's still a mess.

Could you write it without r"" strings? Maybe I escaped it wrong.

@AshesITR
Copy link
Collaborator

sure:
"((?<![\\\\\\[])(?:\\\\\\\\)*\\\\[a-zA-Z0-9<>]|(?<![\\\\\\[])(?:\\\\\\\\)*[\\^${(.*+?|]|(?<![\\\\\\[])(?:\\\\\\\\)*\\[[^\\]]{2,}\\])"

@AshesITR
Copy link
Collaborator

I still see no reason why fixed regexes shouldn't be a regular language, algorithmically.

I agree, but it's still a mess.

My pain with compiling C code under non-admin Windows accounts is larger than with cooking up that regex.
If we use it, I'll be sure to decompose it into easily comprehensible parts. glue::glue() and rex::rex() will come in very handy there with keeping the escape mess under control without using raw strings.

@MichaelChirico
Copy link
Collaborator Author

I was missing the eval(parse()) part in mine, hence the failures.

OK. You're going to have to write a white paper in the comments for that regex.

To proceed it will be easier to do the following:

(1) Merge as is
(2) Add the R-only version in a PR
(3) Run .dev/compare_branches to check the differences between the implementations
(4) Iterate

@AshesITR
Copy link
Collaborator

I've pushed my implementation with comments to fixed_regex-R; PTAL.
compare_branches is a good idea! Do I need to create a sibling PR for that or is having the branch around sufficient?

@MichaelChirico
Copy link
Collaborator Author

I just checked and we would need to update the script to take two branches as an argument (it doesn't now -- master is the comparison)

@AshesITR
Copy link
Collaborator

Sounds like a worthy addition?

@MichaelChirico MichaelChirico added this to the 3.0.0 milestone May 16, 2022
@MichaelChirico
Copy link
Collaborator Author

oops. need approval again here :)

@AshesITR
Copy link
Collaborator

So the plan is change the merge target of #1032 to this branch and then merge both PRs?

@MichaelChirico
Copy link
Collaborator Author

Merge both to master in quick succession:

#1032 (comment)

@AshesITR AshesITR merged commit 4e17790 into master May 19, 2022
@AshesITR AshesITR deleted the fixed_regex branch May 19, 2022 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants