stopping at first occurrency #232

skanskan · 2016-10-02T20:37:15Z

When working with vectors it would be great if functions such as stri_detect_regex had the option to stop when the first occurrency happen.
(Like grep -m parameter).

That would make many operations much faster, such as quickly detecting columns containing some string.

I've also asked the same for base R grep, but I guess they will need years to do it.

skanskan · 2016-10-03T08:56:56Z

maybe the name could be str_any()

gagolews · 2016-10-03T09:05:32Z

Sounds like a nice idea and not too much work. But:

What use cases do you see for that?
Provide exemplary inputs and desired outputs.

skanskan · 2016-10-03T09:39:20Z

Imagine I've just read a large csv file into a data.frame or data.table and I want to set properly the column classes.
I could identify the class of the columns by checking what columns contain a string with the form "xx/xx/xxxx" such as "11/09/2014".

If there are many rows grep needs a long time to process every vector.
But as soon as you detect a matching you don't need to continue checking. (supposing the user knows all the column contain the same class of elements).

Or in general this can be beneficial if you want to test many things in many vectors.

gagolews · 2016-10-03T09:44:17Z

Do you mean something like:

stri_detect_regex2(c("aaa", "bbb", "ccc"), "bbb")
## [1] FALSE, TRUE, NA # stop at first TRUE, the remainders are unknown
stri_detect_regex2(c("aaa", "bbb", "ccc"), "ddd")
## [1] FALSE, FALSE, FALSE

By the way, did you experience any real situation in practice, where stri_detect_regex was not fast enough?

skanskan · 2016-10-03T09:52:00Z

yes, I mean something like that.

Yes, I'm working with several datasets and in my tests, for this kind of work, stri_detect_regex is not faster than grep.

I've posted an example at stackoverflow
http://stackoverflow.com/questions/39817277/detecting-columns-containing-any-value-quickly-with-grep/39817603?noredirect=1#comment66932589_39817603

They give me some tricks but I thought it could be much better if grep had that option included.
As grep is not going to include it soon I guessed I would had more chances with stringi or stringr.

gagolews · 2016-10-03T09:54:40Z

OK, this looks easy to implement. Will do that "in due time"

skanskan · 2016-10-03T11:15:37Z

Thank you.

gagolews · 2019-02-07T10:43:56Z

?self-note? max_count argument to all stri_detect_* funs?
but the output might not be in order

gagolews · 2019-02-08T13:32:00Z

DONE.

> stri_detect_regex(c("aaa", "bbb", "ccc"), "bbb", max_count=1)
[1] FALSE  TRUE    NA
> stri_detect_regex(c("aaa", "bbb", "ccc"), "ddd", max_count=1)
[1] FALSE FALSE FALSE
> stri_detect_regex(c("abc", "def", "123", "ghi", "456", "789", "jkl"),
+                   "^[0-9]+$", max_count=1)
[1] FALSE FALSE  TRUE    NA    NA    NA    NA
> stri_detect_regex(c("abc", "def", "123", "ghi", "456", "789", "jkl"),
+                   "^[0-9]+$", max_count=2)
[1] FALSE FALSE  TRUE FALSE  TRUE    NA    NA
> stri_detect_regex(c("abc", "def", "123", "ghi", "456", "789", "jkl"),
+                   "^[0-9]+$", negate=TRUE, max

gagolews closed this as completed in 1fcf565 Feb 8, 2019

gagolews added a commit that referenced this issue Feb 8, 2019

more unit tests in #232

367ef53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stopping at first occurrency #232

stopping at first occurrency #232

skanskan commented Oct 2, 2016 •

edited

Loading

skanskan commented Oct 3, 2016

gagolews commented Oct 3, 2016

skanskan commented Oct 3, 2016

gagolews commented Oct 3, 2016

skanskan commented Oct 3, 2016

gagolews commented Oct 3, 2016

skanskan commented Oct 3, 2016

gagolews commented Feb 7, 2019

gagolews commented Feb 8, 2019

stopping at first occurrency #232

stopping at first occurrency #232

Comments

skanskan commented Oct 2, 2016 • edited Loading

skanskan commented Oct 3, 2016

gagolews commented Oct 3, 2016

skanskan commented Oct 3, 2016

gagolews commented Oct 3, 2016

skanskan commented Oct 3, 2016

gagolews commented Oct 3, 2016

skanskan commented Oct 3, 2016

gagolews commented Feb 7, 2019

gagolews commented Feb 8, 2019

skanskan commented Oct 2, 2016 •

edited

Loading