-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute time doesn't seem to scale well with increasing number of threads past a certain point #92
Comments
There’s only 16 physical cores, right? So using more than 16 threads will not help for compute bound workloads. That sort of thing helps for I/O bound workloads, where the threads are not all busy at the same time. |
This was on a 16 core/32 thread machine |
Primarily I would like to see better performance when increasing from 16 threads on a 16 physical core machine to 32 threads on a 32 physical core machine. By doubling the threads (and # of physical cores in total) I was only getting a modest speed up of ~20% |
Perhaps memory contention when reading from the same memory locations? Or is more GC happening? |
Hmm, that's a good thought about memory being read from the same locations... that will certainly be happening sometimes. I do randomize the solves, so that should help in that adjacent points regions won't often be getting solved at the same time (and overlapping portions of the inputs are less likely to be read simultaneously as a result) There's also plenty of GC for all of the inputs for each Circuitscape solve. |
I'm seeing this too. JULIA_NUM_THREADS of 8-12 seems to be the sweet spot for large processes. Larger values (on machines with enough cores, of course) don't appear to give a performance gain and reduce performance in some cases. |
On further real-world testing, JULIA_NUM_THREADS of 6 seems to be quicker than 8 on large (multi-day) processes |
I have noticed scaling issues with Omniscape's multithreading, where once the number of threads gets to be high enough, compute time actually starts to increase. The problem I was using is quite large, and I'm running it on an expensive VM, so below, instead of recording the actually compute times, I'm showing the projected compute time from ProgressMeter.jl after letting the job run for a while until the ETA stabilizes. These benchmarks were run on an Azure VM with 64 logical cores (and 32 physical cores) and 256GB RAM:
☝🏻 This mostly makes sense, as only using physical cores could make for more efficient use of the processors.
It gets a bit stranger when switching to use a 32 logical core VM (16 physical cores), with 128GB RAM. Both VMs use Intel Xeon processors, so there shouldn't be any difference in single-thread processor speed. I would expect using 63 logical cores would be faster than using 31, and I'd also expect, base on the above, that on a machine with 32 logical cores and 16 physical cores, using 16 threads would similarly outperform using 31 threads. Indeed, that is not the case. Using 16 threads ran about as fast as 31 threads, not faster. The 16 threads job is also not much slower than the 32 threads job above.
This Omniscape run used a moving window size of 668, so about 1.4M pixels per Circuitscape solve, this means that Circuitscape solve time is >>> overhead from parallel processing.
I'm hoping there may be ways to make Omniscape scale more favorably with increasing number of threads. Things like continental-scale analyses may not be possible at this time given these numbers. The best solution may involve hierarchical parallel processing, but maybe there are some simpler steps that could be taken to improve scaling.
cc @ViralBShah @ranjanan
The text was updated successfully, but these errors were encountered: