-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocking behavior when run in loop on multiple threads #1013
Comments
In contrast use std::time::Instant;
fn main() {
let mut args = std::env::args();
let _ = args.next();
let arg1 = args.next();
if arg1.is_none() {
panic!("arg1 missing!");
}
let n = arg1.unwrap().parse().unwrap();
let ts = Instant::now();
let mut handles = Vec::with_capacity(n);
for _ in 0..n {
handles.push(std::thread::spawn(|| {
let mut subject = "#".to_string();
let search = "#";
let replace = "benchmark#";
let ts = Instant::now();
for _ in 0..38000 {
subject = subject.replace(search, replace);
}
println!("thread {}", ts.elapsed().as_secs_f32());
}));
}
for handle in handles {
handle.join().unwrap();
}
println!("total {}", ts.elapsed().as_secs_f32());
} |
I almost immediately marked this as a duplicate of #934, but then noticed that you aren't actually sharing a single regex across multiple threads. Instead, you're creating a new regex inside of each thread. So that's good. OK, so first, the
I added a OK, so let's move to the regex example:
So here, 8 threads is a little slower than 1 thread, which about lines up with So from where I'm sitting, it looks like My next step was to run with 1 thread and 64 threads (to really provoke any contention issues) under a profiler and see where they differ. From a quick look under The fact that the profiles look the same leads me to make some guesses:
Okay, so to test my hypothesis, I'll take a look at this program under
OK, so your program is absolutely getting hammered by So what I did is moved your replacement code to the beginning of the program before spawning threads. And then in each iteration in each thread, I just searched the corresponding subject string and counted the number of matches. But I didn't do any allocation. Here's the code: use std::{sync::Arc, time::Instant};
use regex::Regex;
fn main() {
let mut args = std::env::args();
let _ = args.next();
let arg1 = args.next();
if arg1.is_none() {
panic!("arg1 missing!");
}
let n = arg1.unwrap().parse().unwrap();
const SUBJECT_COUNT: usize = 38_000;
let mut subjects = vec!["".to_string(); SUBJECT_COUNT];
let mut subject = "#".to_string();
let search = Regex::new(®ex::escape("#")).unwrap();
let replace = "benchmark#";
for i in 0..SUBJECT_COUNT {
subjects[i] = subject.clone();
subject = search.replace_all(&subject, replace).to_string();
}
let subjects = Arc::new(subjects);
let ts = Instant::now();
let mut handles = Vec::with_capacity(n);
for _ in 0..n {
let subjects = Arc::clone(&subjects);
handles.push(std::thread::spawn(move || {
let search = Regex::new(®ex::escape("#")).unwrap();
let mut len = 0;
let ts = Instant::now();
for i in 0..SUBJECT_COUNT {
assert_eq!(1, search.find_iter(&subjects[i]).count());
len += subjects[i].len();
}
println!("thread {}, len={}", ts.elapsed().as_secs_f32(), len);
}));
}
for handle in handles {
handle.join().unwrap();
}
println!("total {}", ts.elapsed().as_secs_f32());
}
Wow, okay, that was unexpected. Do we have the same syscall problem?
Hmmm... Nope. And then zooming in shows it's doing what I expect it to: So at this point, my brain is heading towards memory access time and/or CPU cache effects. I usually attempt to measure those in aggregate with At this point, I'm pretty sure this is not a problem in the regex crate. I'll also note that I find your benchmark design to be extremely pathological. Your core loop is doing more work on each iteration, and is essentially growing the replacement string bigger and bigger. At least the rate of growth is linear, but it's still really weird. How did you get to this point? Does the benchmark you've shared here actually model your real world use case accurately? |
I believe your test is biased towards benchmarking allocations rather than actual work of the regex. First use a single core:
multiple cores:
Observe much higher allocations Next run
Backend cycles are stalled heavy on the single core run, which could imply that the program is waiting on memory access. It's less stalled on the run with more cores as there is more contention across the cores. Also, in general more allocations -> more cycles. Similar to the profiles above
View report with
|
It seems only Windows related (tested on another Windows 10 machine and on a macOS machine). It's probably a race for memory. Any tips on reducing the heap allocations without changing the overall result? |
I've done further testing with the onig crate (had to install LLVM + LIBCLANG_PATH=...\LLVM\bin + default-features = false):
Even though it's 11x slower, there is only some contention (as expected), instead of a sequential runtime of 55sec. use std::time::Instant;
use onig::Regex;
fn main() {
let mut args = std::env::args();
let _ = args.next();
let arg1 = args.next();
if arg1.is_none() {
panic!("arg1 missing!");
}
let n = arg1.unwrap().parse().unwrap();
let ts = Instant::now();
let mut handles = Vec::with_capacity(n);
for _ in 0..n {
handles.push(std::thread::spawn(|| {
let mut subject = "#".to_string();
let search = Regex::new("#").unwrap();
let replace = "benchmark#";
let ts = Instant::now();
for _ in 0..38000 {
subject = search.replace_all(&subject, replace).to_string();
}
println!("thread {}", ts.elapsed().as_secs_f32());
}));
}
for handle in handles {
handle.join().unwrap();
}
println!("total {}", ts.elapsed().as_secs_f32());
} |
When testing aho-corasick, I get the sequential runtime again (all on Windows 10):
use std::time::Instant;
use aho_corasick::AhoCorasick;
fn main() {
let mut args = std::env::args();
let _ = args.next();
let arg1 = args.next();
if arg1.is_none() {
panic!("arg1 missing!");
}
let n = arg1.unwrap().parse().unwrap();
let ts = Instant::now();
let mut handles = Vec::with_capacity(n);
for _ in 0..n {
handles.push(std::thread::spawn(|| {
let mut subject = "#".to_string();
let search = &["#"];
let replace = &["benchmark#"];
let ts = Instant::now();
let ac = AhoCorasick::new(search).unwrap();
for _ in 0..38000 {
subject = ac.try_replace_all(&subject, replace).unwrap();
}
println!("thread {}", ts.elapsed().as_secs_f32());
}));
}
for handle in handles {
handle.join().unwrap();
}
println!("total {}", ts.elapsed().as_secs_f32());
} |
@c-antin Can you please try the second program in my comment? The one that moves the allocs out of the loop. It seems kind of a critical data point here right? |
@BurntSushi Tested your code (allocs out of loop) with the following results:
So only some contention, hence the |
Hmm okay. That in turn argues in favor of allocation being the problem. Or perhaps the combination of allocation and regex searching. And in particular, I wonder if the use of AVX2 in the regex searching is causing problems somehow. Your onig and OK, so let's test the AVX2+alloc idea. As you've noted, First, let's try to strip this down to its nuts & bolts. We'll use use std::time::Instant;
use memchr::memchr_iter;
fn main() {
let mut args = std::env::args();
let _ = args.next();
let arg1 = args.next();
if arg1.is_none() {
panic!("arg1 missing!");
}
let n = arg1.unwrap().parse().unwrap();
let ts = Instant::now();
let mut handles = Vec::with_capacity(n);
for _ in 0..n {
handles.push(std::thread::spawn(|| {
let mut subject = "#".to_string();
let ts = Instant::now();
for _ in 0..38000 {
subject = replace_hash(&subject);
}
println!("thread {}", ts.elapsed().as_secs_f32());
}));
}
for handle in handles {
handle.join().unwrap();
}
println!("total {}", ts.elapsed().as_secs_f32());
}
#[inline(never)]
fn replace_hash(haystack: &str) -> String {
let mut new = String::with_capacity(haystack.len());
let mut last_match = 0;
for offset in memchr_iter(b'#', haystack.as_bytes()) {
new.push_str(&haystack[last_match..offset]);
new.push_str("benchmark#");
last_match = offset + 1;
}
new.push_str(&haystack[last_match..]);
new
} And in your memchr = { version = "2.5", features = ["std"] } Can you try the above program? Does it have the same scaling problem? Hopefully it does. If so, change the memchr = { version = "2.5", default-features = false } This disables the Finally, let's try one last example that doesn't use use std::time::Instant;
fn main() {
let mut args = std::env::args();
let _ = args.next();
let arg1 = args.next();
if arg1.is_none() {
panic!("arg1 missing!");
}
let n = arg1.unwrap().parse().unwrap();
let ts = Instant::now();
let mut handles = Vec::with_capacity(n);
for _ in 0..n {
handles.push(std::thread::spawn(|| {
let mut subject = "#".to_string();
let ts = Instant::now();
for _ in 0..38000 {
subject = replace_hash(&subject);
}
println!("thread {}", ts.elapsed().as_secs_f32());
}));
}
for handle in handles {
handle.join().unwrap();
}
println!("total {}", ts.elapsed().as_secs_f32());
}
#[inline(never)]
fn replace_hash(haystack: &str) -> String {
let mut new = String::with_capacity(haystack.len());
let mut last_match = 0;
while let Some(offset) = memchr(b'#', haystack.as_bytes(), last_match) {
new.push_str(&haystack[last_match..offset]);
new.push_str("benchmark#");
last_match = offset + 1;
}
new.push_str(&haystack[last_match..]);
new
}
fn memchr(needle: u8, haystack: &[u8], start: usize) -> Option<usize> {
haystack[start..].iter().position(|&b| b == needle).map(|o| start + o)
} What is the scaling behavior of this program? So there should be three different programs to try here. |
We have partially sequential runs, though not as bad as regex. |
No change. |
About 2s extra runtime. Comparable behavior to |
About 0.3s extra runtime. |
About 0.4s extra runtime. |
I'm out of ideas here. From what I can see, you're able to reproduce the issue with just I think at this point you're probably on your own here. I can't reproduce your specific problem and I don't have any other obvious ideas to try. Like the fact that you're able to reproduce the problem even after disabling There's nothing interesting happening there. It's just calling a function. There's no CPU feature check and no AVX registers being used. There are no locks or any other synchronization primitives involved here. The only other sources of contention that I can see are in the allocations and CPU cache effects. I note that even in your Anyway, the point is, we've whittled your benchmark down to some pretty simple code that is provoking the issue. So getting back to probably the most important question in this issue:
This isn't possible to meaningfully answer as far as I can tell. I don't think you've said what the actual problem you're trying to solve is. Does your real program really allocate new strings that keep growing without bound? I think you need to describe your actual problem.
Okay now that's interesting. This to me is another data point pointing the finger at the allocator. macOS is going to be using a different allocator than Windows. |
You could trying writing in a shim custom allocator that keeps track of allocations. Although you'd probably want thread local counters to avoid unnecessary overhead. Alternatively, jemalloc could be used (I tested it a little bit with similar results) with it's default statistics. |
By suggestion by @cloud11665 on Twitter I ran the non-allocating version of @BurntSushi on my Ryzen 7950X (32 threads) and I'm getting the same pattern: ./target/release/regex-bench 1
...
total 0.13975821
./target/release/regex-bench 4
...
total 0.18820013
./target/release/regex-bench 8
...
total 0.24301866
./target/release/regex-bench 16
...
total 1.7194413
./target/release/regex-bench 32
...
total 2.8915215 The ./target/release/regex-bench-memchr 1
...
total 0.23130777
./target/release/regex-bench-memchr 4
...
total 0.2585047
./target/release/regex-bench-memchr 8
...
total 0.29258442
./target/release/regex-bench-memchr 16
...
total 0.6513686
./target/release/regex-bench-memchr 32
...
total 1.781904 I think it's pretty interesting that for 1-8 threads the slowdown is barely noticable. Sometimes for 4 threads I get better execution times than for 1. So this doesn't seem to be an issue with AVX downclocking, it's clearly the allocator. Maybe we should test a different allocator and see its perf profile? |
I ported this benchmark from PHP (equivalent loop with growing str_replace/preg_replace) and ran it in parallel without contention on the same machine. When trying the same with Rust I discovered this blocking behavior. Only difference: it's threads (Rust) vs processes (PHP) and |
Also, in the let mut new = String::with_capacity(haystack.len() + 50); If allocs are the problem, this might be an interesting debugging step because it would in theory reduce the number of allocs by half I think?
Okay so this is "just" a benchmark and not actually modeling anything in the real world? If so, I'd say this is probably a bad benchmark without any other context. @V0ldek thanks! I think the key here is that neither you nor me are reproducing the much more significant slowdown being observed by @c-antin on his Windows machine. I wonder if other Windows users can reproduce this. |
It seems to be thread and Windows related.
main.rs: use std::{process::Command, time::Instant};
fn main() {
let mut args = std::env::args();
let _ = args.next();
let arg1 = args.next();
if arg1.is_none() {
panic!("arg1 missing!");
}
let n = arg1.unwrap().parse().unwrap();
let ts = Instant::now();
let mut handles = Vec::with_capacity(n);
for _ in 0..n {
handles.push(std::thread::spawn(|| {
let mut child = Command::new("./target/release/compute.exe")
.spawn()
.unwrap();
child.wait().unwrap();
}));
}
for handle in handles {
handle.join().unwrap();
}
println!("total {}", ts.elapsed().as_secs_f32());
} bin/compute.rs: use std::time::Instant;
use regex::Regex;
fn main() {
let mut subject = "#".to_string();
let search = Regex::new(®ex::escape("#")).unwrap();
let replace = "benchmark#";
let ts = Instant::now();
for _ in 0..38000 {
subject = search.replace_all(&subject, replace).to_string();
}
println!("process {}", ts.elapsed().as_secs_f32());
} |
I wouldn't say it's a bad benchmark. Old-school webapps did nothing but |
Oh, sure, I missed this consideration. It might be worth saying that my results are from Linux but via WSL. Here are the results ran natively on Windows: .\target\release\regex-bench.exe 1
total 0.14726941
.\target\release\regex-bench.exe 4
total 0.1904817
.\target\release\regex-bench.exe 8
total 0.2505401
.\target\release\regex-bench.exe 16
total 1.5162753
.\target\release\regex-bench.exe 32
total 3.470512 and the memchr bench: .\target\release\regex-bench-memchr.exe 1
total 0.3882568
.\target\release\regex-bench-memchr.exe 4
total 0.98220855
.\target\release\regex-bench-memchr.exe 8
total 1.9249403
.\target\release\regex-bench-memchr.exe 16
total 7.0616555
.\target\release\regex-bench-memchr.exe 32
total 16.557264 I think that's confirmation enough, clearly some shenanigans with allocator + specifically Windows. |
The multi-process versus multi-thread problem makes sense. In the multi-process case, each process is using its own allocator, while a single allocator is likely shared among multiple threads in the multi-thread case.
I wrote my fair share of old school web apps. In PHP no less. I don't ever recall seeing code that did 38,000 substitutions. :-) Anywho, we'll probably have to disagree about this. But I think I have my actual answer here: you don't seem to be trying to solve a real problem here other than porting an existing benchmark to a new environment, right? If so, then I don't think there is much I can do to help you "optimize" the program, because it really depends on what your benchmarking parameters are. I'm going to close this issue for now because while we perhaps don't have concrete confirmation of what the problem is, it seems safe to say the problem is not in the regex engine. Anyone should feel free to update this issue if they discover more things. :) |
@V0ldek Yeah I agree. That looks like pretty compelling evidence to me. |
I think this is an interesting find, you should consider opening an issue in rust-lang/rust about performance issues with allocation from multi-threaded programs under Windows. If anyone knows what's actually going on under the hood it's them. The performance impact is massive, which means that it's possibly widely impactful. |
@V0ldek Rust itself just uses the system allocator, which is usually maintained by different folks. In this case, the allocator used in what I'm guessing is MSVC. So probably the Windows folks. |
Did anyone try to rule out the Demo repo if anyone is interested: https://github.com/code-ape/regex-bench-rs |
|
Yup, with .\target\release\regex-bench.exe 1
total 0.3116931
.\target\release\regex-bench.exe 4
total 0.30242363
.\target\release\regex-bench.exe 8
total 0.3190105
.\target\release\regex-bench.exe 16
total 0.3425554 |
What version of regex are you using?
latest (1.8.4)
Describe the bug at a high level.
When repeatedly replacing a string in a loop on multiple threads (CPU count -1) the threads seem to block (run in sequence), instead of running in parallel. It seems like something is blocking the threads (maybe shared memory?).
Tested on 2 different multi-core machines, both with Windows 10 64bit.
What are the steps to reproduce the behavior?
cargo run --release -- 1
(runtime 1s) vscargo run --release -- 6
(runtime 6s)What is the actual behavior?
The total runtime is the sequential runtime.
What is the expected behavior?
The total runtime should be at most the runtime of the slowest thread on a multi-core machine.
The text was updated successfully, but these errors were encountered: