-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomic operations on array elements #32455
Comments
Ah, I see, this differs from #29943. That is asking for atomic operations on pointers, while this is about doing atomic operations via pointers to other structures like arrays. I will try to clarify the title. |
This recently came up here: https://discourse.julialang.org/t/parallelizing-a-histogram/41260/9 Would be great to have |
Quoting my comment in the discourse thread:
|
Came across this - that's on a GPU, though: https://devblogs.nvidia.com/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/ |
A slowdown of 10x may actually be quite acceptable in some cases, compared to the alternative. Obviously, when you deal with things like small histograms (histograms are just a nice example use case for this), it's most efficient for every thread to histogram separately in 1st level cache, and then merge after. But if the histogram is large (multi-dimensional, many bins), atomics with a 10x overhead (compared to non-atomic writes may still be faster) than maintaining one histogram per thread, especially if the number of threads is large. Just a hunch, of course. I guess it'll come down to memory bandwidths, latencies and caching behavior in the individual use case ... hard to predict. @tfk, to you think the |
I agree there might be some situations that this can be beneficial. I uploaded the code because I thought it'd be great if people can explore the use of it. But, let me also note that my benchmark is about single-thread performance. If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs. So, I'd guess 10x is more like a lower bound of the overhead.
I didn't see anything suspicious about it when I quickly look at LLVM IR. (BTW, you might want to use @tkf to ping me, to avoid pinging another user.) |
Yes, I've been thinking about that - however, if the access pattern is scattered access to a fairly large amount of memory, the threads would, for the most part, not share cache lines, right? |
Ah, yes, you might be right. |
For a histogram-type operation, I'd expect a (histogram-shaped distribution) of cache conflicts, resulting in a large slowdown for each additional thread you add—starting at 10x slower just for going from single-threaded to 1 thread-but-threadsafe-with atomics. Unlike @tkf, I have no numbers though, and I'm just somewhat parroting his comment. This is actually relevant, because I'm currently thinking about not (yet) having atomics for individual array elements (just too likely to be slow and/or incorrect), so that you should perhaps instead just use a lock over the whole array. https://docs.google.com/document/d/e/2PACX-1vT7Ibthj9WyM8s5bcQbiKsVK6MtvzqmnPFMy-bcjZLlbqv55a0_sTJ99AkbvPIZk3t7MbhZ57NzaIzC/pub |
(note: the idealized cost of lock is on the order of a couple atomic operations, since that's about all it takes to implement one: it's important to mentally realize that the cost of a lock and an atomic are therefore somewhat nearly the same.) |
Well, you kinda dash my hopes for efficient scattered atomic adds (maybe I can try to do some parallel benchmarking to get us numbers, though). :-) But I'm very excited about the ideas in your manifest! |
I wonder what's the best strategy for implementing histogram-like function with a very large output array using threads. Maybe (1) create a batch of local index list (for the entries to be incremented) in each thread grouped by sub-regions of the output array and then (2) send it to the "writer" tasks that write it to the sub-region of the output array? This way, I guess we can distribute the contention a bit by dividing it by the number of sub-regions (the contention in this strategy would be coming from the queue for communicating with each writer task). |
EDIT: I messed something up here, see below for corrections. Using a global lock is not fast, for this use case. @vtjnash, you were right, of course: Using a global lock is indeed fastest. Here's a benchmark I ran https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790 I'm filling a histogram with 201 x 201 x 201 bins with 10^6 I defined Base.@propagate_inbounds function atomic_addindex!(A::Array{Int64}, v::U, I...) where {U}
T = Int64
i = LinearIndices(size(A))[I...]
v_conv = convert(T, v)
ptr = pointer(A, i)
Base.Threads.llvmcall("%ptr = inttoptr i64 %0 to i64*\n%rv = atomicrmw add i64* %ptr, i64 %1 acq_rel\nret i64 %rv\n", T, Tuple{Ptr{T},T}, ptr, v_conv)::T
A
end I hope that's correct. With
So with the global lock, I get a 43x speedup on 64 threads - I wouldn't have expected it to scale that well. The random number generation itself is about 20 ms using a thread and 630 μs using all 64 threads, so it's only a relatively small fraction of the total time. When I run with
So if there's a bit of work to do (computing the random numbers and hist-bin index), the overhead of Yay for global locks! :-) EDIT: I messed something up here, see below for corrections. Using a global lock is not fast, for this use case. |
@oschulz I looked at your code on Gist. Your Where is your |
@oschulz I confirmed that without using I added sanity checks https://gist.github.com/biochem-fan/19072418e90dd783227a347efe892860/revisions and repaired function names. The results with 24 threads on Xeon E5-2643 v2 are:
So, atomic add is the fastest. |
Hi @biochem-fan , thanks for catching this. No idea what I did back there, looks like I definitely mixed the function names up and my In any case, I just updated my Gist (https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790) to use Here's some benchmarks on a 128-core machine (2x AMD EPYC 7662 64-core). The locking approach clearly doesn't scale, while (We also have to take reduced CPU clock boost under heavy multi-threading into account, CPU ran at 3.3 GHz during Using 1 threads (mean time):
Using 2 threads (mean time):
Using 4 threads (mean time):
Using 8 threads (mean time):
Using 16 threads (mean time):
Using 32 threads (mean time):
Using 64 threads (mean time):
Using 128 threads (mean time):
I don't understand the results for 128 threads, though - even though things are spread over two NUMA domains now, it should be faster than with 64 threads, at least the pure RNG generation should scale well. I did use thread pinning, so each thread should have a separate physical core (no hyper-threading). Can Julia somehow only schedule 64 threads even if started with 128? |
@oschulz Thank you very much for confirmation and more test results. Indeed your result for 128 threads is interesting. Unfortunately I don't have access to machines with so many cores and I cannot try it. This new results demonstrate that Until |
Of course! I agree that having something like Alternatively, this could go into a little package. I'm not sure how fragile this would be, with the LLVM intrinsics, regarding newer Julia (and therefore newer LLVM versions). The advantage would be that other packages could use it without depending on a newer Julia version. |
Maybe a little package (e.g. "AtomicArrays.jl") would be best for now. |
Does #37847 actually add atomic array accesses? I don't see any tests for this functionality, although I could just be overlooking them. |
No, I eventually noticed that there aren't often algorithms that would benefit from it. Though internally, we use structures that benefit from it: the access pattern used by IdDict (and others), internally already do benefit from something similar. However, I'm already proposing making that particular usage (concurrent-reader-with-writer, single-writer-with-lock) access pattern safe by default, such that it wouldn't require anything different. But since that can now be simulated with an atomic-release fence, it isn't strictly necessary to add either. |
I don't think this issue is fully addressed. Let me reopen this issue since I believe it's reasonable to keep tracking it (but, of course, I'm open to be convinced otherwise). I do agree that supporting atomic operations on array element can be misleading for people who are not familiar with parallel programming. The best approach is to create per-task mutable states and merge them wisely. I'm very serious about this to the degree that I build a GitHub org JuliaFolds and a series of posts (https://juliafolds.github.io/data-parallelism/) around this principle. Having said that, I believe that atomic operations on arrays are an important ingredient for trickier and interesting parallel and concurrent programs. I think addressing this issue would be important for Julia to keep staying relevant in high-performance technical computing.
I don't believe this is the case. There are plenty of practical nonblocking algorithms that require, e.g., CAS on an entry of an array. For example:
These data structures would be relevant in high-performance programming access pattern is sparse (= very little contention) and/or resulting object is relatively large compared to available memory. An extreme example is GPU where there are just so many threads and so moderate sized object (in CPU standard) cannot be allocated per-thread basis. This necessitates the use of atomics when computing, e.g., histogram. Indeed, GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell | NVIDIA Developer Blog does not even have "non-atomic" example and explains why adding atomics to a more level of memory was beneficial. Edit: I forgot to mention that when it comes to asynchronous updates on a large data structure, Chaotic Relaxation (Chazan and Miranker, 1969) is also an interesting example. There are also relatively recent examples of similar concepts in stochastic gradient descent (Recht et al., 2011) and parallel graph analytics (Lenharth, Nguyen, and Pingali (2016)). |
FWIW, just as a stopgap measure until we get this in |
Closed by AtomicMemory on 1.11/1.12 |
The atomic methods for
Atomic
call thepointer_from_objref
and call the function on withllvmcall
. If we expose the pointer api, we can do atomic operation on array more easily. Are there any reason to dispatch the atomic operation only onAtomic
type?The text was updated successfully, but these errors were encountered: