-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New fixed_regex_linter #1021
New fixed_regex_linter #1021
Conversation
I'm fine with C code, but I think we should first convert the |
the thinking is why bother doing any string processing in R when we can just jump straight to C instead
it should basically be a comparison of R-level and C-level for loops right?
my sense is that's more a problem when using exotic libraries / system dependencies but here is pretty pure C+R API. maybe windows users? |
Because
Maybe the R solution can do smart things with regexes to detect fixed regexes?
Windows users need RTools (a) installed and (b) properly configured to be able to install anything from source. Exotic libraries do increase the pain even more, though, since it requires operating |
Tried re-writing for an input for a longer pattern, of course, for typical usage, there will only be 1 or up to a handful of patterns found on a given expression, but the really bad scaling w.r.t. pattern length is more damning. |
I think this is much more important for raw strings, but we have several other linters that need a pass to update logic for raw strings; lets do that all at once instead |
If I understand correctly, the actual R string must match a regular expression like |
it's a fair amount more than that -- e.g. the linter handles escapes. see the tests |
Can you test my regex? I'm pretty sure regex escapes are handled by the negative lookbehind. Maybe it's needed in more places in the regex. |
e.g.: grepl(r"(((?<![\\\[])(?:\\\\)*\\[a-zA-Z0-9<>]|(?<![\\\[])(?:\\\\)*[\^${(.*+?|]|(?<![\\\[])(?:\\\\)*\[[^\]]{2,}\]))", c("a[.]", "a\\[ab]", "[ab]", "\\w", "\\\\w", "\\\\\\w"), perl = TRUE) |
it doesn't pass the tests
I don't doubt regex can be used -- in fact that's how I started writing this linter long ago. but as you see it quickly spirals out of control. I don't think |
I still see no reason why fixed regexes shouldn't be a regular language, algorithmically. |
@MichaelChirico I can't repro the failures: is_not_regex <- function(str, skip_start = FALSE, skip_end = FALSE) {
# .Call(lintr_is_not_regex, str, skip_start, skip_end)
# Handle string quoting and escaping using R directly
str <- vapply(str, function(elem) eval(parse(text = elem)), character(1L))
rx_fixed_regex <- r"(((?<![\\\[])(?:\\\\)*\\[a-zA-Z0-9<>]|(?<![\\\[])(?:\\\\)*[\^${(.*+?|]|(?<![\\\[])(?:\\\\)*\[[^\]]{2,}\]))"
!grepl(rx_fixed_regex, str, perl = TRUE)
} FAIL 0 for me and not at all bad performance: The C version took 200ms less time. |
I agree, but it's still a mess. Could you write it without |
sure: |
My pain with compiling C code under non-admin Windows accounts is larger than with cooking up that regex. |
I was missing the OK. You're going to have to write a white paper in the comments for that regex. To proceed it will be easier to do the following: (1) Merge as is |
I've pushed my implementation with comments to |
I just checked and we would need to update the script to take two branches as an argument (it doesn't now -- |
Sounds like a worthy addition? |
oops. need approval again here :) |
So the plan is change the merge target of #1032 to this branch and then merge both PRs? |
Merge both to |
Part of #962
Saved this one for last because I expected it to be a pain... the work setup is not very friendly to compiling packages with C/C++ code locally.
C/C++ means it comes with some overhead (?) of adding low-level code to the package. I'm not sure how comfortable you @AshesITR are with C/C++, but I think it's pretty straightforward as far as C code goes. Basically, I started writing R code that I knew would look 90% the same and actually be more natural (and faster) in C, so I ported it there.
NB: originally, it's implemented in C++ because google is very C++-heavy, but the code is 99% the same in C vs. C++ and R is C-friendlier, so made the switch here.
cc @jimhester here as well who has more experience with C/C++ and might weigh in on utility here.