New fixed_regex_linter #1021

MichaelChirico · 2022-03-28T22:43:37Z

Part of #962

Saved this one for last because I expected it to be a pain... the work setup is not very friendly to compiling packages with C/C++ code locally.

C/C++ means it comes with some overhead (?) of adding low-level code to the package. I'm not sure how comfortable you @AshesITR are with C/C++, but I think it's pretty straightforward as far as C code goes. Basically, I started writing R code that I knew would look 90% the same and actually be more natural (and faster) in C, so I ported it there.

NB: originally, it's implemented in C++ because google is very C++-heavy, but the code is 99% the same in C vs. C++ and R is C-friendlier, so made the switch here.

cc @jimhester here as well who has more experience with C/C++ and might weigh in on utility here.

AshesITR · 2022-03-29T12:08:12Z

I'm fine with C code, but I think we should first convert the STR_CONST nodes to the actual R string, to avoid having to handle R quoting and (raw) string literals.
One think to keep in mind is that installing packages with compiled code from source can be more challenging to the end user, so I'd ideally like to see a performance comparison of C vs. a pure R solution.

MichaelChirico · 2022-03-29T15:12:51Z

think we should first convert the STR_CONST nodes to the actual R string, to avoid having to handle R quoting and (raw) string literals.

the thinking is why bother doing any string processing in R when we can just jump straight to C instead

I'd ideally like to see a performance comparison of C vs. a pure R solution.

it should basically be a comparison of R-level and C-level for loops right?

installing packages with compiled code from source can be more challenging to the end user

my sense is that's more a problem when using exotic libraries / system dependencies but here is pretty pure C+R API. maybe windows users?

AshesITR · 2022-03-29T15:18:05Z

think we should first convert the STR_CONST nodes to the actual R string, to avoid having to handle R quoting and (raw) string literals.

the thinking is why bother doing any string processing in R when we can just jump straight to C instead

Because eval(xml_text(node)) will allow us to delegate the unescaping logic to R core without having to rewrite it in C, lessening the maintenance burden.

I'd ideally like to see a performance comparison of C vs. a pure R solution.

it should basically be a comparison of R-level and C-level for loops right?

Maybe the R solution can do smart things with regexes to detect fixed regexes?

installing packages with compiled code from source can be more challenging to the end user

my sense is that's more a problem when using exotic libraries / system dependencies but here is pretty pure C+R API. maybe windows users?

Windows users need RTools (a) installed and (b) properly configured to be able to install anything from source. Exotic libraries do increase the pain even more, though, since it requires operating pacman within the MSYS environment in RTools.
I've also had pain on macOS with specific libraries (OpenMP), but plain C code usually worked.
linux is the only OS I never had trouble building things from source.

MichaelChirico · 2022-03-30T01:56:19Z

Tried re-writing is_not_regex using pure R.

for an input str = "'^x'" with length 1, the C version is 10x faster; for an input str with length 100 (just rep() that str 100 times), the C version is 150x faster; length 1000 --> 600x.

for a longer pattern, str = paste(c(letters, LETTERS), collapse = ""), the C version is 120x faster for length-1 input; length 10-->400x; length 100 --> 800x.

of course, for typical usage, there will only be 1 or up to a handful of patterns found on a given expression, but the really bad scaling w.r.t. pattern length is more damning.

MichaelChirico · 2022-03-30T02:04:37Z

Because eval(xml_text(node)) will allow us to delegate the unescaping logic to R core without having to rewrite it in C, lessening the maintenance burden.

I think this is much more important for raw strings, but we have several other linters that need a pass to update logic for raw strings; lets do that all at once instead

AshesITR · 2022-03-30T05:52:59Z

If I understand correctly, the actual R string must match a regular expression like ((?<![\\\[])(?:\\\\)*\\[a-zA-Z0-9<>]|(?<![\\\[])(?:\\\\)*[\^${(.*+?|]|(?<![\\\[])(?:\\\\)*\[[^\]]{2,}\]), right?
If so, that should be very fast using grepl().

MichaelChirico · 2022-03-30T06:06:45Z

it's a fair amount more than that -- e.g. the linter handles escapes. see the tests

AshesITR · 2022-03-30T06:09:18Z

Can you test my regex? I'm pretty sure regex escapes are handled by the negative lookbehind. Maybe it's needed in more places in the regex.

AshesITR · 2022-03-30T06:12:50Z

e.g.:

grepl(r"(((?<![\\\[])(?:\\\\)*\\[a-zA-Z0-9<>]|(?<![\\\[])(?:\\\\)*[\^${(.*+?|]|(?<![\\\[])(?:\\\\)*\[[^\]]{2,}\]))", c("a[.]", "a\\[ab]", "[ab]", "\\w", "\\\\w", "\\\\\\w"), perl = TRUE)

MichaelChirico · 2022-03-30T18:03:18Z

it doesn't pass the tests

[ FAIL 11 | WARN 0 | SKIP 0 | PASS 39 ]

I don't doubt regex can be used -- in fact that's how I started writing this linter long ago. but as you see it quickly spirals out of control. I don't think rex::rex() will really be of much help in taming that ungodly beast of a regex either.

AshesITR · 2022-03-30T18:39:06Z

I still see no reason why fixed regexes shouldn't be a regular language, algorithmically.
I'll try to improve the regex a bit more before accepting defeat 😄

AshesITR · 2022-03-30T18:50:15Z

@MichaelChirico I can't repro the failures:

is_not_regex <- function(str, skip_start = FALSE, skip_end = FALSE) {
  # .Call(lintr_is_not_regex, str, skip_start, skip_end)
  # Handle string quoting and escaping using R directly
  str <- vapply(str, function(elem) eval(parse(text = elem)), character(1L))

  rx_fixed_regex <- r"(((?<![\\\[])(?:\\\\)*\\[a-zA-Z0-9<>]|(?<![\\\[])(?:\\\\)*[\^${(.*+?|]|(?<![\\\[])(?:\\\\)*\[[^\]]{2,}\]))"
  !grepl(rx_fixed_regex, str, perl = TRUE)
}

FAIL 0 for me and not at all bad performance:
✔ | 50 | fixed_regex_linter [1.3s]

The C version took 200ms less time.
✔ | 50 | fixed_regex_linter [1.1s]

MichaelChirico · 2022-03-30T18:54:23Z

I still see no reason why fixed regexes shouldn't be a regular language, algorithmically.

I agree, but it's still a mess.

Could you write it without r"" strings? Maybe I escaped it wrong.

AshesITR · 2022-03-30T18:54:59Z

sure:
"((?<![\\\\\\[])(?:\\\\\\\\)*\\\\[a-zA-Z0-9<>]|(?<![\\\\\\[])(?:\\\\\\\\)*[\\^${(.*+?|]|(?<![\\\\\\[])(?:\\\\\\\\)*\\[[^\\]]{2,}\\])"

AshesITR · 2022-03-30T18:57:45Z

I still see no reason why fixed regexes shouldn't be a regular language, algorithmically.

I agree, but it's still a mess.

My pain with compiling C code under non-admin Windows accounts is larger than with cooking up that regex.
If we use it, I'll be sure to decompose it into easily comprehensible parts. glue::glue() and rex::rex() will come in very handy there with keeping the escape mess under control without using raw strings.

MichaelChirico · 2022-03-30T19:11:55Z

I was missing the eval(parse()) part in mine, hence the failures.

OK. You're going to have to write a white paper in the comments for that regex.

To proceed it will be easier to do the following:

(1) Merge as is
(2) Add the R-only version in a PR
(3) Run .dev/compare_branches to check the differences between the implementations
(4) Iterate

AshesITR · 2022-03-30T19:20:41Z

I've pushed my implementation with comments to fixed_regex-R; PTAL.
compare_branches is a good idea! Do I need to create a sibling PR for that or is having the branch around sufficient?

MichaelChirico · 2022-03-30T19:23:19Z

I just checked and we would need to update the script to take two branches as an argument (it doesn't now -- master is the comparison)

AshesITR · 2022-03-30T19:25:17Z

Sounds like a worthy addition?

MichaelChirico · 2022-05-19T18:32:06Z

oops. need approval again here :)

AshesITR · 2022-05-19T18:41:33Z

So the plan is change the merge target of #1032 to this branch and then merge both PRs?

MichaelChirico · 2022-05-19T18:44:31Z

Merge both to master in quick succession:

#1032 (comment)

New fixed_regex_linter

f67e9a0

MichaelChirico added the google-linters label Mar 28, 2022

MichaelChirico mentioned this pull request Mar 28, 2022

Google linters for 3.0.0 release #962

Closed

38 tasks

MichaelChirico added 3 commits March 28, 2022 22:50

fix package name in so

67c9e1d

back-compatible read-only macro

f7954ba

Merge branch 'master' into fixed_regex

4663843

AshesITR mentioned this pull request Mar 30, 2022

implement fixed_regex_linter as plain R + regex #1032

Merged

MichaelChirico and others added 4 commits March 31, 2022 22:37

Merge branch 'master' into fixed_regex

033de02

Merge branch 'master' into fixed_regex

8b47bf8

Merge branch 'master' into fixed_regex

3a0739b

Merge branch 'master' into fixed_regex

b7957e7

MichaelChirico added 10 commits April 9, 2022 10:12

use one-string regex to eliminate false negatives

de4973e

copy tests over from R branch

0943393

T is SYMBOL not NUM_CONST

7486983

extend [$CHAR] handling to handle many more cases

06f8930

test for []]

f33e182

catch unicode escapes outside char classes too

d470a95

more cases that are false positives in the R-only branch

8a9c9f0

Merge branch 'master' into fixed_regex

1e94aa0

try different UTF-16 codepoints

ac2237e

try using the ?Quotes example code point

cdca617

AshesITR mentioned this pull request Apr 26, 2022

Upstreaming linters from Google's internal linting suite #884

Closed

76 tasks

MichaelChirico and others added 8 commits April 26, 2022 21:58

Merge branch 'master' into fixed_regex

968c2d3

more tests

c085a2e

Merge branch 'master' into fixed_regex

9fe1861

merge tests from R branch

ec2cf87

Merge branch 'master' into fixed_regex

4da8d25

source_expression

5b1dc59

typo

b0e62b7

restore tests with single escaping; add comment about perl=TRUE TODO

438ba37

MichaelChirico added this to the 3.0.0 milestone May 16, 2022

MichaelChirico added 4 commits May 16, 2022 16:28

Merge branch 'master' into fixed_regex

00519de

corresponding test

03b7111

Merge branch 'master' into fixed_regex

ca0be28

test parsing

d9d287c

AshesITR approved these changes May 19, 2022

View reviewed changes

AshesITR merged commit 4e17790 into master May 19, 2022

AshesITR deleted the fixed_regex branch May 19, 2022 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New fixed_regex_linter #1021

New fixed_regex_linter #1021

MichaelChirico commented Mar 28, 2022 •

edited

Loading

AshesITR commented Mar 29, 2022

MichaelChirico commented Mar 29, 2022

AshesITR commented Mar 29, 2022

MichaelChirico commented Mar 30, 2022 •

edited

Loading

MichaelChirico commented Mar 30, 2022 •

edited

Loading

AshesITR commented Mar 30, 2022 •

edited

Loading

MichaelChirico commented Mar 30, 2022 •

edited

Loading

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented Mar 30, 2022

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022 •

edited

Loading

MichaelChirico commented Mar 30, 2022 •

edited

Loading

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented May 19, 2022

AshesITR commented May 19, 2022

MichaelChirico commented May 19, 2022

New fixed_regex_linter #1021

New fixed_regex_linter #1021

Conversation

MichaelChirico commented Mar 28, 2022 • edited Loading

AshesITR commented Mar 29, 2022

MichaelChirico commented Mar 29, 2022

AshesITR commented Mar 29, 2022

MichaelChirico commented Mar 30, 2022 • edited Loading

MichaelChirico commented Mar 30, 2022 • edited Loading

AshesITR commented Mar 30, 2022 • edited Loading

MichaelChirico commented Mar 30, 2022 • edited Loading

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented Mar 30, 2022

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022 • edited Loading

MichaelChirico commented Mar 30, 2022 • edited Loading

AshesITR commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented Mar 30, 2022

AshesITR commented Mar 30, 2022

MichaelChirico commented May 19, 2022

AshesITR commented May 19, 2022

MichaelChirico commented May 19, 2022

MichaelChirico commented Mar 28, 2022 •

edited

Loading

MichaelChirico commented Mar 30, 2022 •

edited

Loading

MichaelChirico commented Mar 30, 2022 •

edited

Loading

AshesITR commented Mar 30, 2022 •

edited

Loading

MichaelChirico commented Mar 30, 2022 •

edited

Loading

AshesITR commented Mar 30, 2022 •

edited

Loading

MichaelChirico commented Mar 30, 2022 •

edited

Loading