Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stopping at first occurrency #232

Closed
skanskan opened this issue Oct 2, 2016 · 9 comments
Closed

stopping at first occurrency #232

skanskan opened this issue Oct 2, 2016 · 9 comments

Comments

@skanskan
Copy link

skanskan commented Oct 2, 2016

When working with vectors it would be great if functions such as stri_detect_regex had the option to stop when the first occurrency happen.
(Like grep -m parameter).

That would make many operations much faster, such as quickly detecting columns containing some string.

I've also asked the same for base R grep, but I guess they will need years to do it.

@skanskan
Copy link
Author

skanskan commented Oct 3, 2016

maybe the name could be str_any()

@gagolews
Copy link
Owner

gagolews commented Oct 3, 2016

Sounds like a nice idea and not too much work. But:

  1. What use cases do you see for that?
  2. Provide exemplary inputs and desired outputs.

@skanskan
Copy link
Author

skanskan commented Oct 3, 2016

Imagine I've just read a large csv file into a data.frame or data.table and I want to set properly the column classes.
I could identify the class of the columns by checking what columns contain a string with the form "xx/xx/xxxx" such as "11/09/2014".

If there are many rows grep needs a long time to process every vector.
But as soon as you detect a matching you don't need to continue checking. (supposing the user knows all the column contain the same class of elements).

Or in general this can be beneficial if you want to test many things in many vectors.

@gagolews
Copy link
Owner

gagolews commented Oct 3, 2016

Do you mean something like:

stri_detect_regex2(c("aaa", "bbb", "ccc"), "bbb")
## [1] FALSE, TRUE, NA # stop at first TRUE, the remainders are unknown
stri_detect_regex2(c("aaa", "bbb", "ccc"), "ddd")
## [1] FALSE, FALSE, FALSE

By the way, did you experience any real situation in practice, where stri_detect_regex was not fast enough?

@skanskan
Copy link
Author

skanskan commented Oct 3, 2016

yes, I mean something like that.

Yes, I'm working with several datasets and in my tests, for this kind of work, stri_detect_regex is not faster than grep.

I've posted an example at stackoverflow
http://stackoverflow.com/questions/39817277/detecting-columns-containing-any-value-quickly-with-grep/39817603?noredirect=1#comment66932589_39817603

They give me some tricks but I thought it could be much better if grep had that option included.
As grep is not going to include it soon I guessed I would had more chances with stringi or stringr.

@gagolews
Copy link
Owner

gagolews commented Oct 3, 2016

OK, this looks easy to implement. Will do that "in due time"

@skanskan
Copy link
Author

skanskan commented Oct 3, 2016

Thank you.

@gagolews
Copy link
Owner

gagolews commented Feb 7, 2019

?self-note? max_count argument to all stri_detect_* funs?
but the output might not be in order

@gagolews
Copy link
Owner

gagolews commented Feb 8, 2019

DONE.

> stri_detect_regex(c("aaa", "bbb", "ccc"), "bbb", max_count=1)
[1] FALSE  TRUE    NA
> stri_detect_regex(c("aaa", "bbb", "ccc"), "ddd", max_count=1)
[1] FALSE FALSE FALSE
> stri_detect_regex(c("abc", "def", "123", "ghi", "456", "789", "jkl"),
+                   "^[0-9]+$", max_count=1)
[1] FALSE FALSE  TRUE    NA    NA    NA    NA
> stri_detect_regex(c("abc", "def", "123", "ghi", "456", "789", "jkl"),
+                   "^[0-9]+$", max_count=2)
[1] FALSE FALSE  TRUE FALSE  TRUE    NA    NA
> stri_detect_regex(c("abc", "def", "123", "ghi", "456", "789", "jkl"),
+                   "^[0-9]+$", negate=TRUE, max

gagolews added a commit that referenced this issue Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants