Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of regexp_count #13011

Closed
alamb opened this issue Oct 18, 2024 · 8 comments · Fixed by #13364
Closed

Improve performance of regexp_count #13011

alamb opened this issue Oct 18, 2024 · 8 comments · Fixed by #13364
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

Is your feature request related to a problem or challenge?

@Omega359 and @xinlifoobar added the regexp_count function in #12970

However, regexp_count seems to be significantly slower than regexp_match and regexp_like. (see #12080 (comment) for a benchmark)

Describe the solution you'd like

I would like to review the benchmark for regexp_count and figure out why it is slower and if so try to improve it

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Oct 18, 2024
@buraksenn
Copy link
Contributor

I'm not sure I can handle this but want to try. I can unassign if somebody is already working on this

@buraksenn
Copy link
Contributor

take

@Dandandan
Copy link
Contributor

I am wondering if this is not really caused by doing more work - regexp_match only needs to find the first match and can return early, while regexp_count needs to find all matches and return the count.

@Omega359
Copy link
Contributor

Possible @Dandandan however I don't think it would explain the difference with regexp_replace which I would think would have to do even more work.

@Dandandan
Copy link
Contributor

Yeah that is suspicious I agree, I think it is worth checking whether regexp_replace in the benchmark does replace all matches or just the first value.

@buraksenn buraksenn removed their assignment Oct 21, 2024
@jonathanc-n
Copy link
Contributor

I added the global flags for regexp_replace so it actually replaces all the matches:

regexp_count_1000 string
                        time:   [4.7333 ms 4.7585 ms 4.7831 ms]
                        change: [-1.3314% -0.5013% +0.3281%] (p = 0.24 > 0.05)
                        No change in performance detected.

Benchmarking regexp_count_1000 utf8view: Collecting 100 samples in estimated 5.1350 s (1
regexp_count_1000 utf8view
                        time:   [4.7126 ms 4.7726 ms 4.8517 ms]
                        change: [-1.2727% +0.1796% +1.9213%] (p = 0.84 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

Benchmarking regexp_like_1000: Collecting 100 samples in estimated 5.1855 s (2400 iterat
regexp_like_1000        time:   [2.1724 ms 2.1958 ms 2.2264 ms]
                        change: [-1.2309% -0.1167% +1.1757%] (p = 0.87 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

Benchmarking regexp_match_1000: Collecting 100 samples in estimated 5.0246 s (2000 itera
regexp_match_1000       time:   [2.5040 ms 2.5100 ms 2.5208 ms]
                        change: [-2.0363% -1.1304% -0.3542%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmarking regexp_replace_1000: Collecting 100 samples in estimated 5.3076 s (1700 ite
regexp_replace_1000     time:   [3.0762 ms 3.0808 ms 3.0855 ms]
                        change: [-3.0060% -1.5322% -0.4700%] (p = 0.01 < 0.05)
                        Change within noise threshold.

There is definitely a change compared to the benchmark taken by @xinlifoobar however the count function is still marginally slower.

@alamb
Copy link
Contributor Author

alamb commented Oct 24, 2024

So is the conclusion we can't reproduce the difference in benchmark numbers, and thus we should close this ticket?

@Omega359
Copy link
Contributor

It's still 50% slower than replace so I think it may be worth investigating at some point. Is it a huge blocker? Unlikely unless this is a hotspot in someone's processing pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants