Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking behavior when run in loop on multiple threads #1013

Closed
c-antin opened this issue Jun 16, 2023 · 30 comments
Closed

Blocking behavior when run in loop on multiple threads #1013

c-antin opened this issue Jun 16, 2023 · 30 comments

Comments

@c-antin
Copy link

c-antin commented Jun 16, 2023

What version of regex are you using?

latest (1.8.4)

Describe the bug at a high level.

When repeatedly replacing a string in a loop on multiple threads (CPU count -1) the threads seem to block (run in sequence), instead of running in parallel. It seems like something is blocking the threads (maybe shared memory?).
Tested on 2 different multi-core machines, both with Windows 10 64bit.

What are the steps to reproduce the behavior?

cargo run --release -- 1 (runtime 1s) vs cargo run --release -- 6 (runtime 6s)

use std::time::Instant;

use regex::Regex;

fn main() {
    let mut args = std::env::args();
    let _ = args.next();
    let arg1 = args.next();
    if arg1.is_none() {
        panic!("arg1 missing!");
    }
    let n = arg1.unwrap().parse().unwrap();
    let ts = Instant::now();
    let mut handles = Vec::with_capacity(n);
    for _ in 0..n {
        handles.push(std::thread::spawn(|| {
            let mut subject = "#".to_string();
            let search = Regex::new(&regex::escape("#")).unwrap();
            let replace = "benchmark#";

            let ts = Instant::now();
            for _ in 0..38000 {
                subject = search.replace_all(&subject, replace).to_string();
            }
            println!("thread {}", ts.elapsed().as_secs_f32());
        }));
    }

    for handle in handles {
        handle.join().unwrap();
    }
    println!("total {}", ts.elapsed().as_secs_f32());
}

What is the actual behavior?

The total runtime is the sequential runtime.

What is the expected behavior?

The total runtime should be at most the runtime of the slowest thread on a multi-core machine.

@c-antin
Copy link
Author

c-antin commented Jun 16, 2023

In contrast str::replace() behaves as expected (though slower overall):

use std::time::Instant;

fn main() {
    let mut args = std::env::args();
    let _ = args.next();
    let arg1 = args.next();
    if arg1.is_none() {
        panic!("arg1 missing!");
    }
    let n = arg1.unwrap().parse().unwrap();
    let ts = Instant::now();
    let mut handles = Vec::with_capacity(n);
    for _ in 0..n {
        handles.push(std::thread::spawn(|| {
            let mut subject = "#".to_string();
            let search = "#";
            let replace = "benchmark#";

            let ts = Instant::now();
            for _ in 0..38000 {
                subject = subject.replace(search, replace);
            }
            println!("thread {}", ts.elapsed().as_secs_f32());
        }));
    }

    for handle in handles {
        handle.join().unwrap();
    }
    println!("total {}", ts.elapsed().as_secs_f32());
}

@BurntSushi
Copy link
Member

I almost immediately marked this as a duplicate of #934, but then noticed that you aren't actually sharing a single regex across multiple threads. Instead, you're creating a new regex inside of each thread. So that's good.

OK, so first, the str::replace case:

$ time ./target/release/i1013 1
thread 2.8771846, len=342001
total 2.87751

real    2.880
user    2.752
sys     0.127
maxmem  10 MB
faults  0

$ time ./target/release/i1013 8
thread 3.0523975, len=342001
thread 3.0524435, len=342001
thread 3.0539224, len=342001
thread 3.0530257, len=342001
thread 3.0521066, len=342001
thread 3.0520856, len=342001
thread 3.054312, len=342001
thread 3.0543876, len=342001
total 3.0549417

real    3.058
user    23.736
sys     0.614
maxmem  11 MB
faults  0

$ time ./target/release/i1013 16
thread 3.056869, len=342001
thread 3.057645, len=342001
thread 3.0561466, len=342001
thread 3.0594864, len=342001
thread 3.0581493, len=342001
thread 3.056206, len=342001
thread 3.0597017, len=342001
thread 3.4257228, len=342001
thread 3.9589233, len=342001
thread 4.4013414, len=342001
thread 4.437958, len=342001
thread 4.438284, len=342001
thread 4.44464, len=342001
thread 4.4500113, len=342001
thread 4.453276, len=342001
thread 4.549546, len=342001
total 4.5521426

real    4.555
user    59.188
sys     0.602
maxmem  20 MB
faults  0

$ time ./target/release/i1013 32
[.. snip ..]

real    8.356
user    3:01.36
sys     1.979
maxmem  43 MB
faults  0

I added a len field in the output to get an idea of how big the haystack is here, since it grows with each iteration and thus each iteration is (likely) slower than the previous iteration. I note here that with 8 threads (my CPU has 16 physical cores, with 8 of them capable of hyperthreading) the timing is overall slightly slower than the single threaded case. So I'd say already, your benchmark looks a little suspicious to me. If the threads were doing truly independent work, I'd expect 8 threads to give a speed boost. I then bumped it up to 16, and then you can really start to see a slow down. I did it again at 32, and again, another major slow down.

OK, so let's move to the regex example:

$ time ./target/release/i1013 1
thread 0.55535185, len=342001
total 0.5558321

real    0.558
user    0.294
sys     0.264
maxmem  10 MB
faults  0

$ time ./target/release/i1013 8
thread 0.636936, len=342001
thread 0.63673854, len=342001
thread 0.6364426, len=342001
thread 0.6364415, len=342001
thread 0.6370899, len=342001
thread 0.63735807, len=342001
thread 0.63703513, len=342001
thread 0.63717663, len=342001
total 0.6389583

real    0.642
user    2.780
sys     2.247
maxmem  15 MB
faults  0

$ time ./target/release/i1013 16
thread 0.66286546, len=342001
thread 0.66296595, len=342001
thread 0.6636208, len=342001
thread 0.6610685, len=342001
thread 0.66311973, len=342001
thread 0.6657458, len=342001
thread 0.6727337, len=342001
thread 0.6765716, len=342001
thread 0.74356234, len=342001
thread 0.8505939, len=342001
thread 0.8857657, len=342001
thread 0.8865663, len=342001
thread 0.88712204, len=342001
thread 0.89316446, len=342001
thread 0.9211197, len=342001
thread 0.9504156, len=342001
total 0.957126

real    0.960
user    9.415
sys     2.684
maxmem  29 MB
faults  0

$ time ./target/release/i1013 32
[.. snip ..]

real    3.114
user    43.941
sys     21.019
maxmem  56 MB
faults  0

So here, 8 threads is a little slower than 1 thread, which about lines up with str::replace. And then 16 threads get slower and 32 threads get even slower.

So from where I'm sitting, it looks like str::replace and Regex::replace_all share the same problem: your benchmark doesn't scale.

My next step was to run with 1 thread and 64 threads (to really provoke any contention issues) under a profiler and see where they differ. From a quick look under perf, they look about the same. The 64 thread case is spending a bit more time to_string and reserve, which makes sense, because you'd expect some contention here to probably come from the allocator. Because you're allocating memory like a wild man here. My take is that this is probably more of an allocation benchmark than a string replacement benchmark.

The fact that the profiles look the same leads me to make some guesses:

  • Whatever is causing the contention is either not easily visible in the profile or is something that I'm overlooking.
  • That the key difference is in the kernel. Indeed, if you look at sys time, it looks pretty damn high to me.

Okay, so to test my hypothesis, I'll take a look at this program under strace. If it's kernel time, it's probably (not necessarily) going to show up there:

$ strace --follow-forks --summary-only ./target/release/i1013 1
strace: Process 2691969 attached
thread 0.5947787, len=342001
total 0.59618706
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 94.29    0.239578      239578         1           futex
  5.32    0.013516           4      3068           madvise
  0.18    0.000449           1       415           mprotect
  0.16    0.000419           2       140           munmap
  0.05    0.000123           0       151           mmap
  0.00    0.000000           0         6           read
  0.00    0.000000           0         2           write
  0.00    0.000000           0         5           close
  0.00    0.000000           0         1           poll
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         6           rt_sigaction
  0.00    0.000000           0         5           rt_sigprocmask
  0.00    0.000000           0         2           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         6           sigaltstack
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         2           sched_getaffinity
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         5           openat
  0.00    0.000000           0         4           newfstatat
  0.00    0.000000           0         2           set_robust_list
  0.00    0.000000           0         2           prlimit64
  0.00    0.000000           0         2           getrandom
  0.00    0.000000           0         2           rseq
  0.00    0.000000           0         1           clone3
------ ----------- ----------- --------- --------- ------------------
100.00    0.254085          66      3836         2 total

$ strace --follow-forks --summary-only ./target/release/i1013 64
[.. snip ..]
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 89.36 1592.561657        9893    160968           madvise
  5.73  102.202700        3894     26245           mprotect
  4.47   79.705961      724599       110         9 futex
  0.24    4.313051        8184       527           mmap
  0.16    2.817502        5994       470           munmap
  0.02    0.274966        1069       257           rt_sigprocmask
  0.01    0.205649        3163        65           write
  0.00    0.038651         198       195           sigaltstack
  0.00    0.038407         600        64           clone3
  0.00    0.017135         259        66           sched_getaffinity
  0.00    0.014628         225        65           getrandom
  0.00    0.014367         221        65           set_robust_list
  0.00    0.012189         187        65           rseq
  0.00    0.000772         128         6           read
  0.00    0.000754         150         5           openat
  0.00    0.000146          29         5           close
  0.00    0.000027           4         6           rt_sigaction
  0.00    0.000015           5         3           brk
  0.00    0.000011           5         2           pread64
  0.00    0.000011           2         4           newfstatat
  0.00    0.000009           9         1           poll
  0.00    0.000009           4         2           prlimit64
  0.00    0.000005           5         1           set_tid_address
  0.00    0.000004           2         2         1 arch_prctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
------ ----------- ----------- --------- --------- ------------------
100.00 1782.218626        9419    189201        11 total

OK, so your program is absolutely getting hammered by madvise here. To me, that means this isn't regex related and is really just a property of your benchmark. And perhaps not something that is well supported by your system's allocator and/or kernel. But my experience in terms of figuring out why exactly madvise is slow is limited. My own debugging journey mostly stops here. There is one thing left I can do: run an experiment to confirm the problem isn't in the regex engine by removing allocation from the equation.

So what I did is moved your replacement code to the beginning of the program before spawning threads. And then in each iteration in each thread, I just searched the corresponding subject string and counted the number of matches. But I didn't do any allocation. Here's the code:

use std::{sync::Arc, time::Instant};

use regex::Regex;

fn main() {
    let mut args = std::env::args();
    let _ = args.next();
    let arg1 = args.next();
    if arg1.is_none() {
        panic!("arg1 missing!");
    }
    let n = arg1.unwrap().parse().unwrap();

    const SUBJECT_COUNT: usize = 38_000;
    let mut subjects = vec!["".to_string(); SUBJECT_COUNT];
    let mut subject = "#".to_string();
    let search = Regex::new(&regex::escape("#")).unwrap();
    let replace = "benchmark#";
    for i in 0..SUBJECT_COUNT {
        subjects[i] = subject.clone();
        subject = search.replace_all(&subject, replace).to_string();
    }
    let subjects = Arc::new(subjects);

    let ts = Instant::now();
    let mut handles = Vec::with_capacity(n);
    for _ in 0..n {
        let subjects = Arc::clone(&subjects);
        handles.push(std::thread::spawn(move || {
            let search = Regex::new(&regex::escape("#")).unwrap();
            let mut len = 0;
            let ts = Instant::now();
            for i in 0..SUBJECT_COUNT {
                assert_eq!(1, search.find_iter(&subjects[i]).count());
                len += subjects[i].len();
            }
            println!("thread {}, len={}", ts.elapsed().as_secs_f32(), len);
        }));
    }

    for handle in handles {
        handle.join().unwrap();
    }
    println!("total {}", ts.elapsed().as_secs_f32());
}
$ ./target/release/i1013 1
thread 0.27373123, len=6497867000
total 0.27405682

$ ./target/release/i1013 8
total 0.3998035

$ ./target/release/i1013 16
[.. snip ..]
total 0.83051115

$ ./target/release/i1013 32
[.. snip ..]
total 3.1729655

Wow, okay, that was unexpected. Do we have the same syscall problem?

$ strace --follow-forks --summary-only ./target/release/i1013 64
total 11.690092
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 18.71    0.077004         364       211           mmap
 15.81    0.065070        1251        52         3 futex
 12.34    0.050803           1     28427           brk
  9.77    0.040229         204       197           mprotect
  9.10    0.037454         576        65           rseq
  7.57    0.031170         479        65           set_robust_list
  7.44    0.030637         196       156           munmap
  7.08    0.029141         149       195           sigaltstack
  3.89    0.016013          62       257           rt_sigprocmask
  3.59    0.014775         223        66           getrandom
  2.41    0.009905         152        65           write
  2.08    0.008571         133        64           clone3
  0.09    0.000385         385         1           execve
  0.04    0.000152           2        66           sched_getaffinity
  0.02    0.000072          18         4           openat
  0.02    0.000067          13         5           read
  0.01    0.000027           6         4           newfstatat
  0.01    0.000026           6         4           close
  0.01    0.000026           4         6           rt_sigaction
  0.00    0.000018          18         1         1 access
  0.00    0.000012           6         2         1 arch_prctl
  0.00    0.000011          11         1           poll
  0.00    0.000010           5         2           pread64
  0.00    0.000010           5         2           prlimit64
  0.00    0.000006           0        64           madvise
  0.00    0.000004           4         1           set_tid_address
------ ----------- ----------- --------- --------- ------------------
100.00    0.411598          13     29983         5 total

Hmmm... Nope. sys time looks to be quite short overall. So back to the profiler. Looking at the profiler, basically all of the time is being spent inside the core memchr loop... And that's exactly what I would expect:

memchr-01

And then zooming in shows it's doing what I expect it to:

memchr-02

So at this point, my brain is heading towards memory access time and/or CPU cache effects. I usually attempt to measure those in aggregate with perf stat, but I'm not seeing any appreciable difference between 1 thread and 64 threads.

At this point, I'm pretty sure this is not a problem in the regex crate. I'll also note that I find your benchmark design to be extremely pathological. Your core loop is doing more work on each iteration, and is essentially growing the replacement string bigger and bigger. At least the rate of growth is linear, but it's still really weird.

How did you get to this point? Does the benchmark you've shared here actually model your real world use case accurately?

@hodgesds
Copy link

I believe your test is biased towards benchmarking allocations rather than actual work of the regex.

First use a bpftrace script to measure allocations (please double check the syntax as it's late Friday).

single core:

$  bpftrace -e 'tracepoint:kmem:*alloc* {@[probe,cpid] = count(); }' -c 'target/release/regex 1'
Attaching 5 probes...
thread 1.1783528
total 1.179049

@[tracepoint:kmem:mm_page_alloc_zone_locked, 2655106]: 252
@[tracepoint:kmem:kmalloc, 2655106]: 573
@[tracepoint:kmem:kmem_cache_alloc, 2655106]: 1727
@[tracepoint:kmem:mm_page_alloc, 2655106]: 375304

multiple cores:

$ bpftrace -e 'tracepoint:kmem:*alloc* {@[probe,cpid] = count(); }' -c 'target/release/regex 4'
Attaching 5 probes...
thread 4.1351075
thread 4.135858
thread 4.137098
thread 4.1385818
total 4.1394463

@[tracepoint:kmem:mm_page_alloc_zone_locked, 2654615]: 5110
@[tracepoint:kmem:kmem_cache_alloc, 2654615]: 16079
@[tracepoint:kmem:kmalloc, 2654615]: 16385
@[tracepoint:kmem:mm_page_alloc, 2654615]: 1282717

Observe much higher allocations

Next run perf stat for some generic perf data:

$ perf stat  target/release/regex 1
thread 1.1153394
total 1.116027

 Performance counter stats for 'target/release/regex 1':

          1,113.55 msec task-clock                #    0.995 CPUs utilized
               437      context-switches          #  392.438 /sec
                17      cpu-migrations            #   15.266 /sec
           372,328      page-faults               #  334.361 K/sec
     3,597,145,405      cycles                    #    3.230 GHz
       168,577,155      stalled-cycles-frontend   #    4.69% frontend cycles idle
     1,547,334,224      stalled-cycles-backend    #   43.02% backend cycles idle
     4,575,551,050      instructions              #    1.27  insn per cycle
                                                  #    0.34  stalled cycles per insn
       737,651,519      branches                  #  662.431 M/sec
         2,936,354      branch-misses             #    0.40% of all branches

       1.119224078 seconds time elapsed

       0.598586000 seconds user
       0.508347000 seconds sys
    
$  perf stat  target/release/regex 4
thread 3.923317
thread 3.93273
thread 3.9353752
thread 3.9382114
total 3.9394941

 Performance counter stats for 'target/release/regex 4':

         15,129.91 msec task-clock                #    3.837 CPUs utilized
            10,494      context-switches          #  693.593 /sec
                50      cpu-migrations            #    3.305 /sec
         1,386,717      page-faults               #   91.654 K/sec
    39,830,593,731      cycles                    #    2.633 GHz
     2,120,262,502      stalled-cycles-frontend   #    5.32% frontend cycles idle
     5,361,095,960      stalled-cycles-backend    #   13.46% backend cycles idle
    17,993,713,805      instructions              #    0.45  insn per cycle
                                                  #    0.30  stalled cycles per insn
     2,895,171,170      branches                  #  191.354 M/sec
        18,850,435      branch-misses             #    0.65% of all branches

       3.942821196 seconds time elapsed

       7.553263000 seconds user
       7.103674000 seconds sys

Backend cycles are stalled heavy on the single core run, which could imply that the program is waiting on memory access. It's less stalled on the run with more cores as there is more contention across the cores. Also, in general more allocations -> more cycles.

Similar to the profiles above perf record can be used:

perf record --call-graph dwarf -g  target/release/regex 4
thread 3.9480581
thread 3.9485035
thread 3.9549568
thread 3.9550252
total 3.9564981
[ perf record: Woken up 1702 times to write data ]
[ perf record: Captured and wrote 458.627 MB perf.data (56707 samples) ]

View report with perf report --inline -v -g:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 58K of event 'cycles'
# Event count (approx.): 36623009650
#
# Children      Self  Command  Shared Object         Symbol                                                                                                                                                          
# ........  ........  .......  ....................  ......................................................................................................................................................................
#
    99.90%     0.00%  regex    libc.so.6             [.] 0x00007f08b168d65b
            |
            ---0x7f08b168d65b
               0x7f08b160c3e9
               std::sys::unix::thread::Thread::new::thread_start
               |
                --99.89%--core::ops::function::FnOnce::call_once{{vtable.shim}}
                          |
                           --99.89%--std::sys_common::backtrace::__rust_begin_short_backtrace
                                     |
                                     |--40.47%--regex::re_unicode::Regex::replace_all
                                     |          |
                                     |          |--10.52%--<regex::re_unicode::Matches as core::iter::traits::iterator::Iterator>::next
                                     |          |          |
                                     |          |           --10.50%--<regex::re_trait::Matches<R> as core::iter::traits::iterator::Iterator>::next
                                     |          |                     |
                                     |          |                      --10.29%--memchr::memchr::x86::avx::memchr
                                     |          |
                                     |          |--7.85%--alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
                                     |          |          |
                                     |          |           --7.82%--alloc::raw_vec::finish_grow
                                     |          |                     |
                                     |          |                      --7.78%--realloc
                                     |          |                                |
                                     |          |                                |--6.86%--0x7f08b161bf12
                                     |          |                                |          |
                                     |          |                                |          |--2.22%--0x7f08b16d3845
                                     |          |                                |          |
                                     |          |                                |           --1.61%--0x7f08b16d383b
                                     |          |                                |
                                     |          |                                 --0.58%--0x7f08b161bed1
                                     |          |                                           |
                                     |          |                                            --0.51%--0x7f08b161bafd
                                     |          |
                                     |          |--5.61%--0x7f08b16d3845
                                     |          |          |
                                     |          |           --1.00%--asm_exc_page_fault
                                     |          |                     |
                                     |          |                      --0.91%--exc_page_fault
                                     |          |                                do_user_addr_fault
                                     |          |                                |
                                     |          |                                 --0.81%--handle_mm_fault

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

It seems only Windows related (tested on another Windows 10 machine and on a macOS machine). It's probably a race for memory. Any tips on reducing the heap allocations without changing the overall result?

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

I've done further testing with the onig crate (had to install LLVM + LIBCLANG_PATH=...\LLVM\bin + default-features = false):
cargo run --release -- 1:

thread 11.220713
total 11.22415

cargo run --release -- 5:

thread 13.780354
thread 13.904895
thread 13.947433
thread 14.0239725
thread 14.159876
total 14.164065

Even though it's 11x slower, there is only some contention (as expected), instead of a sequential runtime of 55sec.

use std::time::Instant;

use onig::Regex;

fn main() {
    let mut args = std::env::args();
    let _ = args.next();
    let arg1 = args.next();
    if arg1.is_none() {
        panic!("arg1 missing!");
    }
    let n = arg1.unwrap().parse().unwrap();
    let ts = Instant::now();
    let mut handles = Vec::with_capacity(n);
    for _ in 0..n {
        handles.push(std::thread::spawn(|| {
            let mut subject = "#".to_string();
            let search = Regex::new("#").unwrap();
            let replace = "benchmark#";

            let ts = Instant::now();
            for _ in 0..38000 {
                subject = search.replace_all(&subject, replace).to_string();
            }
            println!("thread {}", ts.elapsed().as_secs_f32());
        }));
    }

    for handle in handles {
        handle.join().unwrap();
    }
    println!("total {}", ts.elapsed().as_secs_f32());
}

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

When testing aho-corasick, I get the sequential runtime again (all on Windows 10):
cargo run --release -- 1:

thread 0.87699443
total 0.878341

cargo run --release -- 5:

thread 4.460018
thread 4.474072
thread 4.494944
thread 4.545341
thread 4.5797343
total 4.582171
use std::time::Instant;

use aho_corasick::AhoCorasick;

fn main() {
    let mut args = std::env::args();
    let _ = args.next();
    let arg1 = args.next();
    if arg1.is_none() {
        panic!("arg1 missing!");
    }
    let n = arg1.unwrap().parse().unwrap();
    let ts = Instant::now();
    let mut handles = Vec::with_capacity(n);
    for _ in 0..n {
        handles.push(std::thread::spawn(|| {
            let mut subject = "#".to_string();
            let search = &["#"];
            let replace = &["benchmark#"];

            let ts = Instant::now();
            let ac = AhoCorasick::new(search).unwrap();
            for _ in 0..38000 {
                subject = ac.try_replace_all(&subject, replace).unwrap();
            }
            println!("thread {}", ts.elapsed().as_secs_f32());
        }));
    }

    for handle in handles {
        handle.join().unwrap();
    }
    println!("total {}", ts.elapsed().as_secs_f32());
}

@BurntSushi
Copy link
Member

@c-antin Can you please try the second program in my comment? The one that moves the allocs out of the loop. It seems kind of a critical data point here right?

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

@BurntSushi Tested your code (allocs out of loop) with the following results:
cargo run --release -- 1:

thread 0.4410655, len=6497867000
total 0.4435104

cargo run --release -- 5:

thread 0.5343989, len=6497867000
thread 0.5340821, len=6497867000
thread 0.53457254, len=6497867000
thread 0.53373957, len=6497867000
thread 0.54389656, len=6497867000
total 0.5460506

So only some contention, hence the find per thread runs in parallel.

@BurntSushi
Copy link
Member

Hmm okay. That in turn argues in favor of allocation being the problem. Or perhaps the combination of allocation and regex searching. And in particular, I wonder if the use of AVX2 in the regex searching is causing problems somehow. Your onig and str::replace examples probably do not use AVX2. They also are on their own slower, which in theory could reduce contention on the allocator and give the appearance of better scaling behavior? Maybe?

OK, so let's test the AVX2+alloc idea. As you've noted, aho-corasick exhibits the same problem. But that is not quite an interesting data point, since it's going to execute the same strategy as the regex: a memmem search from the memchr crate, which will use AVX2 to accelerate the search. The problem with AVX2 is that on some CPUs it can lead to down-clocking. So I wonder if there is something bad going on where it is ping-ponging between the allocator and an AVX2 search routine. But it's kind of a wild guess. But we can test it.

First, let's try to strip this down to its nuts & bolts. We'll use memchr directly and implement our own replacement routine:

use std::time::Instant;

use memchr::memchr_iter;

fn main() {
    let mut args = std::env::args();
    let _ = args.next();
    let arg1 = args.next();
    if arg1.is_none() {
        panic!("arg1 missing!");
    }
    let n = arg1.unwrap().parse().unwrap();
    let ts = Instant::now();
    let mut handles = Vec::with_capacity(n);
    for _ in 0..n {
        handles.push(std::thread::spawn(|| {
            let mut subject = "#".to_string();
            let ts = Instant::now();
            for _ in 0..38000 {
                subject = replace_hash(&subject);
            }
            println!("thread {}", ts.elapsed().as_secs_f32());
        }));
    }

    for handle in handles {
        handle.join().unwrap();
    }
    println!("total {}", ts.elapsed().as_secs_f32());
}

#[inline(never)]
fn replace_hash(haystack: &str) -> String {
    let mut new = String::with_capacity(haystack.len());
    let mut last_match = 0;
    for offset in memchr_iter(b'#', haystack.as_bytes()) {
        new.push_str(&haystack[last_match..offset]);
        new.push_str("benchmark#");
        last_match = offset + 1;
    }
    new.push_str(&haystack[last_match..]);
    new
}

And in your Cargo.toml:

memchr = { version = "2.5", features = ["std"] }

Can you try the above program? Does it have the same scaling problem? Hopefully it does. If so, change the memchr entry in your Cargo.toml to this:

memchr = { version = "2.5", default-features = false }

This disables the std feature in particular, which means memchr can't use AVX2. Instead, it will just use standard SSE2 instructions. Now re-build and re-run. What is the scaling behavior?

Finally, let's try one last example that doesn't use memchr at all, and instead just uses a normal iterator loop over the bytes one-at-a-time to find a match:

use std::time::Instant;

fn main() {
    let mut args = std::env::args();
    let _ = args.next();
    let arg1 = args.next();
    if arg1.is_none() {
        panic!("arg1 missing!");
    }
    let n = arg1.unwrap().parse().unwrap();
    let ts = Instant::now();
    let mut handles = Vec::with_capacity(n);
    for _ in 0..n {
        handles.push(std::thread::spawn(|| {
            let mut subject = "#".to_string();
            let ts = Instant::now();
            for _ in 0..38000 {
                subject = replace_hash(&subject);
            }
            println!("thread {}", ts.elapsed().as_secs_f32());
        }));
    }

    for handle in handles {
        handle.join().unwrap();
    }
    println!("total {}", ts.elapsed().as_secs_f32());
}

#[inline(never)]
fn replace_hash(haystack: &str) -> String {
    let mut new = String::with_capacity(haystack.len());
    let mut last_match = 0;
    while let Some(offset) = memchr(b'#', haystack.as_bytes(), last_match) {
        new.push_str(&haystack[last_match..offset]);
        new.push_str("benchmark#");
        last_match = offset + 1;
    }
    new.push_str(&haystack[last_match..]);
    new
}

fn memchr(needle: u8, haystack: &[u8], start: usize) -> Option<usize> {
    haystack[start..].iter().position(|&b| b == needle).map(|o| start + o)
}

What is the scaling behavior of this program?

So there should be three different programs to try here.

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

memchr with feature std on Windows 10 64bit:
cargo run --release 1:

thread 1.0436482
total 1.0453712

cargo run --release 5:

thread 3.5881858
thread 3.6013646
thread 3.6201432
thread 3.626615
thread 3.631054
total 3.635428

We have partially sequential runs, though not as bad as regex.

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

memchr no std on Windows 10 64bit:
cargo run --release 1:

thread 1.0176221
total 1.0185086

cargo run --release 5:

thread 3.5486186
thread 3.5496693
thread 3.5511684
thread 3.5966206
thread 3.6153443
total 3.6199574

No change.

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

iter on Windows 10 64bit:
cargo run --release 1:

thread 5.061595
total 5.063283

cargo run --release 5:

thread 6.931339
thread 6.9519405
thread 7.039762
thread 7.042991
thread 7.1338468
total 7.1413355

About 2s extra runtime. Comparable behavior to str::replace and onig.

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

memchr with std on macOS (!same machine as Win10 = same CPU!):
cargo run --release 1:

thread 0.4384527
total 0.43863383

cargo run --release 5:

thread 0.5838767
thread 0.6257434
thread 0.64514965
thread 0.74490994
thread 0.7479834
total 0.7481185

About 0.3s extra runtime.

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

memchr without std on macOS (!same machine as Win10 = same CPU!):
cargo run --release 1:

thread 0.41908023
total 0.41929638

cargo run --release 5:

thread 0.6537725
thread 0.6840306
thread 0.73337364
thread 0.73414886
thread 0.8091553
total 0.80930233

About 0.4s extra runtime.

@BurntSushi
Copy link
Member

I'm out of ideas here. From what I can see, you're able to reproduce the issue with just memchr. There is not a lot going on in memchr other than using vector instructions and making the search "fast." It could "just" be the case that using a faster search exacerbates contention issues.

I think at this point you're probably on your own here. I can't reproduce your specific problem and I don't have any other obvious ideas to try. Like the fact that you're able to reproduce the problem even after disabling std in the memchr crate means that it should be taking this code path: https://github.com/BurntSushi/memchr/blob/be564d425bfcabd2c9925a8a360f4a62b866bb90/src/memchr/x86/mod.rs#L83-L92

There's nothing interesting happening there. It's just calling a function. There's no CPU feature check and no AVX registers being used. There are no locks or any other synchronization primitives involved here. The only other sources of contention that I can see are in the allocations and CPU cache effects. I note that even in your str::replace and onig examples, you are still getting overall slowdowns that are, in an absolute sense, seemingly comparable to memchr. That seems like a strong hint to me. It's just that relatively speaking, the slowdown is more striking with memchr, possibly not because it is provoking anything interesting, but because memchr is itself faster.

Anyway, the point is, we've whittled your benchmark down to some pretty simple code that is provoking the issue.

So getting back to probably the most important question in this issue:

Any tips on reducing the heap allocations without changing the overall result?

This isn't possible to meaningfully answer as far as I can tell. I don't think you've said what the actual problem you're trying to solve is. Does your real program really allocate new strings that keep growing without bound? I think you need to describe your actual problem.

memchr with std on macOS (!same machine as Win10 = same CPU!):

Okay now that's interesting. This to me is another data point pointing the finger at the allocator. macOS is going to be using a different allocator than Windows.

@hodgesds
Copy link

hodgesds commented Jun 17, 2023

You could trying writing in a shim custom allocator that keeps track of allocations. Although you'd probably want thread local counters to avoid unnecessary overhead. Alternatively, jemalloc could be used (I tested it a little bit with similar results) with it's default statistics.

@V0ldek
Copy link

V0ldek commented Jun 17, 2023

By suggestion by @cloud11665 on Twitter I ran the non-allocating version of @BurntSushi on my Ryzen 7950X (32 threads) and I'm getting the same pattern:

./target/release/regex-bench 1
...
total 0.13975821

./target/release/regex-bench 4
...
total 0.18820013

./target/release/regex-bench 8
...
total 0.24301866


./target/release/regex-bench 16
...
total 1.7194413


./target/release/regex-bench 32
...
total 2.8915215

The memchr version also gets worse with many threads:

./target/release/regex-bench-memchr 1
...
total 0.23130777

./target/release/regex-bench-memchr 4
...
total 0.2585047

./target/release/regex-bench-memchr 8
...
total 0.29258442

./target/release/regex-bench-memchr 16
...
total 0.6513686

./target/release/regex-bench-memchr 32
...
total 1.781904

I think it's pretty interesting that for 1-8 threads the slowdown is barely noticable. Sometimes for 4 threads I get better execution times than for 1.

So this doesn't seem to be an issue with AVX downclocking, it's clearly the allocator. Maybe we should test a different allocator and see its perf profile?

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

This isn't possible to meaningfully answer as far as I can tell. I don't think you've said what the actual problem you're trying to solve is. Does your real program really allocate new strings that keep growing without bound? I think you need to describe your actual problem.

I ported this benchmark from PHP (equivalent loop with growing str_replace/preg_replace) and ran it in parallel without contention on the same machine. When trying the same with Rust I discovered this blocking behavior. Only difference: it's threads (Rust) vs processes (PHP) and regex vs PCRE2.

@BurntSushi
Copy link
Member

Also, in the replace_hash function when using the memchr crate, you might be able to observe something interesting by reserving a higher capacity that avoids a realloc in the loop below it:

    let mut new = String::with_capacity(haystack.len() + 50);

If allocs are the problem, this might be an interesting debugging step because it would in theory reduce the number of allocs by half I think?

I ported this benchmark from PHP (equivalent loop with growing str_replace/preg_replace) and ran it in parallel without contention on the same machine. When trying the same with Rust I discovered this blocking behavior. Only difference: it's threads (Rust) vs processes (PHP) and regex vs PCRE2.

Okay so this is "just" a benchmark and not actually modeling anything in the real world? If so, I'd say this is probably a bad benchmark without any other context.

@V0ldek thanks! I think the key here is that neither you nor me are reproducing the much more significant slowdown being observed by @c-antin on his Windows machine. I wonder if other Windows users can reproduce this.

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

It seems to be thread and Windows related.
If spawned in separate processes, the runtime is as expected (some contention, but not sequential):
cargo run --release -- 1:

process 1.1345332
total 1.1544021

cargo run --release -- 5:

process 1.420188
process 1.440525
process 1.4621232
process 1.4780271
process 1.4917529
total 1.5082042

main.rs:

use std::{process::Command, time::Instant};

fn main() {
    let mut args = std::env::args();
    let _ = args.next();
    let arg1 = args.next();
    if arg1.is_none() {
        panic!("arg1 missing!");
    }
    let n = arg1.unwrap().parse().unwrap();
    let ts = Instant::now();
    let mut handles = Vec::with_capacity(n);
    for _ in 0..n {
        handles.push(std::thread::spawn(|| {
            let mut child = Command::new("./target/release/compute.exe")
                .spawn()
                .unwrap();
            child.wait().unwrap();
        }));
    }

    for handle in handles {
        handle.join().unwrap();
    }
    println!("total {}", ts.elapsed().as_secs_f32());
}

bin/compute.rs:

use std::time::Instant;

use regex::Regex;

fn main() {
    let mut subject = "#".to_string();
    let search = Regex::new(&regex::escape("#")).unwrap();
    let replace = "benchmark#";

    let ts = Instant::now();
    for _ in 0..38000 {
        subject = search.replace_all(&subject, replace).to_string();
    }
    println!("process {}", ts.elapsed().as_secs_f32());
}

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

Okay so this is "just" a benchmark and not actually modeling anything in the real world? If so, I'd say this is probably a bad benchmark without any other context.

I wouldn't say it's a bad benchmark. Old-school webapps did nothing but str_replace of an ever-growing String of HTML tags.

@V0ldek
Copy link

V0ldek commented Jun 17, 2023

I think the key here is that neither you nor me are reproducing the much more significant slowdown being observed by @c-antin on his Windows machine. I wonder if other Windows users can reproduce this.

Oh, sure, I missed this consideration. It might be worth saying that my results are from Linux but via WSL. Here are the results ran natively on Windows:

.\target\release\regex-bench.exe 1
total 0.14726941

.\target\release\regex-bench.exe 4
total 0.1904817

.\target\release\regex-bench.exe 8
total 0.2505401

.\target\release\regex-bench.exe 16
total 1.5162753

.\target\release\regex-bench.exe 32
total 3.470512

and the memchr bench:

.\target\release\regex-bench-memchr.exe 1
total 0.3882568

.\target\release\regex-bench-memchr.exe 4
total 0.98220855

.\target\release\regex-bench-memchr.exe 8
total 1.9249403

.\target\release\regex-bench-memchr.exe 16
total 7.0616555

.\target\release\regex-bench-memchr.exe 32
total 16.557264

I think that's confirmation enough, clearly some shenanigans with allocator + specifically Windows.

@BurntSushi
Copy link
Member

The multi-process versus multi-thread problem makes sense. In the multi-process case, each process is using its own allocator, while a single allocator is likely shared among multiple threads in the multi-thread case.

I wouldn't say it's a bad benchmark. Old-school webapps did nothing but str_replace of an ever-growing String of HTML tags.

I wrote my fair share of old school web apps. In PHP no less. I don't ever recall seeing code that did 38,000 substitutions. :-) Anywho, we'll probably have to disagree about this. But I think I have my actual answer here: you don't seem to be trying to solve a real problem here other than porting an existing benchmark to a new environment, right? If so, then I don't think there is much I can do to help you "optimize" the program, because it really depends on what your benchmarking parameters are.

I'm going to close this issue for now because while we perhaps don't have concrete confirmation of what the problem is, it seems safe to say the problem is not in the regex engine. Anyone should feel free to update this issue if they discover more things. :)

@BurntSushi
Copy link
Member

@V0ldek Yeah I agree. That looks like pretty compelling evidence to me.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2023
@V0ldek
Copy link

V0ldek commented Jun 17, 2023

@c-antin

I think this is an interesting find, you should consider opening an issue in rust-lang/rust about performance issues with allocation from multi-threaded programs under Windows. If anyone knows what's actually going on under the hood it's them.

The performance impact is massive, which means that it's possibly widely impactful.

@BurntSushi
Copy link
Member

@V0ldek Rust itself just uses the system allocator, which is usually maintained by different folks. In this case, the allocator used in what I'm guessing is MSVC. So probably the Windows folks.

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

As suggested by @V0ldek I opened an issue here.

@code-ape
Copy link

code-ape commented Jun 17, 2023

Did anyone try to rule out the CPU OS thread scheduler? Apparently I missed a lot in the 2 hours since this was posted, but threw together a slight alteration of the benchmark to pin the threads individual CPUs and measure work over time. I suspect this is a scheduler issue based on my experience with low latency stuff in my professional life.

Demo repo if anyone is interested: https://github.com/code-ape/regex-bench-rs

@c-antin
Copy link
Author

c-antin commented Jun 17, 2023

mimalloc did the trick, see rust-lang/rust#112744 (comment)

@V0ldek
Copy link

V0ldek commented Jun 17, 2023

Yup, with mimalloc I get the following timings in the original code from @c-antin:

.\target\release\regex-bench.exe 1
total 0.3116931

.\target\release\regex-bench.exe 4
total 0.30242363

.\target\release\regex-bench.exe 8
total 0.3190105

.\target\release\regex-bench.exe 16
total 0.3425554

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants