Improve performance of `regexp_count` #13011

alamb · 2024-10-18T20:23:17Z

Is your feature request related to a problem or challenge?

@Omega359 and @xinlifoobar added the regexp_count function in #12970

However, regexp_count seems to be significantly slower than regexp_match and regexp_like. (see #12080 (comment) for a benchmark)

Describe the solution you'd like

I would like to review the benchmark for regexp_count and figure out why it is slower and if so try to improve it

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

buraksenn · 2024-10-21T13:12:43Z

I'm not sure I can handle this but want to try. I can unassign if somebody is already working on this

buraksenn · 2024-10-21T13:12:46Z

take

Dandandan · 2024-10-21T13:37:27Z

I am wondering if this is not really caused by doing more work - regexp_match only needs to find the first match and can return early, while regexp_count needs to find all matches and return the count.

Omega359 · 2024-10-21T13:38:38Z

Possible @Dandandan however I don't think it would explain the difference with regexp_replace which I would think would have to do even more work.

Dandandan · 2024-10-21T13:52:36Z

Yeah that is suspicious I agree, I think it is worth checking whether regexp_replace in the benchmark does replace all matches or just the first value.

jonathanc-n · 2024-10-22T23:00:13Z

I added the global flags for regexp_replace so it actually replaces all the matches:

regexp_count_1000 string
                        time:   [4.7333 ms 4.7585 ms 4.7831 ms]
                        change: [-1.3314% -0.5013% +0.3281%] (p = 0.24 > 0.05)
                        No change in performance detected.

Benchmarking regexp_count_1000 utf8view: Collecting 100 samples in estimated 5.1350 s (1
regexp_count_1000 utf8view
                        time:   [4.7126 ms 4.7726 ms 4.8517 ms]
                        change: [-1.2727% +0.1796% +1.9213%] (p = 0.84 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

Benchmarking regexp_like_1000: Collecting 100 samples in estimated 5.1855 s (2400 iterat
regexp_like_1000        time:   [2.1724 ms 2.1958 ms 2.2264 ms]
                        change: [-1.2309% -0.1167% +1.1757%] (p = 0.87 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

Benchmarking regexp_match_1000: Collecting 100 samples in estimated 5.0246 s (2000 itera
regexp_match_1000       time:   [2.5040 ms 2.5100 ms 2.5208 ms]
                        change: [-2.0363% -1.1304% -0.3542%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmarking regexp_replace_1000: Collecting 100 samples in estimated 5.3076 s (1700 ite
regexp_replace_1000     time:   [3.0762 ms 3.0808 ms 3.0855 ms]
                        change: [-3.0060% -1.5322% -0.4700%] (p = 0.01 < 0.05)
                        Change within noise threshold.

There is definitely a change compared to the benchmark taken by @xinlifoobar however the count function is still marginally slower.

alamb · 2024-10-24T13:59:09Z

So is the conclusion we can't reproduce the difference in benchmark numbers, and thus we should close this ticket?

Omega359 · 2024-10-24T14:16:20Z

It's still 50% slower than replace so I think it may be worth investigating at some point. Is it a huge blocker? Unlikely unless this is a hotspot in someone's processing pipeline

alamb added the enhancement New feature or request label Oct 18, 2024

alamb mentioned this issue Oct 18, 2024

feat: Add regexp_count function #12970

Merged

github-actions bot assigned buraksenn Oct 21, 2024

buraksenn removed their assignment Oct 21, 2024

Dimchikkk mentioned this issue Nov 11, 2024

improve performance of regexp_count #13364

Merged

comphead closed this as completed in #13364 Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `regexp_count` #13011

Improve performance of `regexp_count` #13011

alamb commented Oct 18, 2024

buraksenn commented Oct 21, 2024

buraksenn commented Oct 21, 2024

Dandandan commented Oct 21, 2024

Omega359 commented Oct 21, 2024

Dandandan commented Oct 21, 2024

jonathanc-n commented Oct 22, 2024

alamb commented Oct 24, 2024

Omega359 commented Oct 24, 2024

Improve performance of regexp_count #13011

Improve performance of regexp_count #13011

Comments

alamb commented Oct 18, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

buraksenn commented Oct 21, 2024

buraksenn commented Oct 21, 2024

Dandandan commented Oct 21, 2024

Omega359 commented Oct 21, 2024

Dandandan commented Oct 21, 2024

jonathanc-n commented Oct 22, 2024

alamb commented Oct 24, 2024

Omega359 commented Oct 24, 2024

Improve performance of `regexp_count` #13011

Improve performance of `regexp_count` #13011