-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace spinocks by mutexes in runtime modules #7485
Comments
Adding some more information about why would be helpful here -- do we have data showing it would be a clear improvement? If so, why did we use spinlocks in the first place? |
I believe when the spinlocks were used we didn't have good fine-grained lock primitives in the runtime and were dependent on heavier-weight OS locks. halide_mutex now adaptively spins, so it should be fine to replace spinlocks with it. |
Spinlocks in user-space are generally a bad idea, unless you are on a "real-time" priority scenario and can more or less ensure you won't be preempted, wich is never really the case in time-sharing operating systems. There's even a recent post/rant from Linus Torvalds about spinlocks in userspace: |
Spin locks are used to implement mutexes in fast_sync, so for that case, we need to replace them with futexes, where futexes are available. Probably also worth looking at the recent switch_to stuff in the Abseil mutex implementation for Linux. The main reason for spinlocks originally was mutexes required allocation and initialization, and effectively a heavier dependency. Our mutex implementation is a fixed size small value that is zero initialized. We should be able to make our mutex implementation superior to spinlocks in all cases that are under the same OS domain. (A spinlock, or other polled mechanism, may still be the only option at the ISA level between say a CPU and a GPU, or a DSP and a GPU where there is only shared memory available as an abstraction.) |
@zvookin Would it be possible to use the Halide/src/runtime/scoped_mutex_lock.h Line 12 in d031408
@abadams mentioned to me that it implements an adaptive mutex (spins just for a bit, then defers to a mutex). |
Yes, I'm told that on modern mutex implementations for Linux, there is no spinning on the futex. If a locking operation is contended, the code immediately enters a path the OS is aware of. (I doubt it goes straight to a system call of course.) I hope to revisit the fast_sync implementation at some point before the heat death of the universe, but would also be happy to collaborate with someone else on improving it. There is also the long standing parallel execution issue. Ideally I'd like Halide to have best in class support for all of this. There are possibilities at the language level as well. E.g. a scheduling directive giving an explicit amount of parallelism to apply to a given loop. |
Yeah, I am also not entirely convinced that spinning even if for just a bit is a good idea. My understanding is that the main benefits of an adaptive mutex comes from keeping the "lock" in user-space as to avoid context-switch just to acquire it when it is not being held. There's really no good general way to come up with an educated guess as to how many spins would be needed on average for an arbitrary case, so, when in doubt, why not 0 spins? Or is there a good heuristic that has been tested in all sorts of workloads and systems that I am unaware of? Anyway, I think we should only offer |
It's easy enough to change the current implementation and rebenchmark. E.g. if I remove spinning from our mutex the mullapudi2016_histogram test gets a full 4x slower (!) for both the manual and automatic schedule. I picked that benchmark because it uses fine-grained parallelism. Looking at the less extreme local laplacian app, removing spinning increases runtime from 26ms to 30ms for the manual schedule and from 33ms to 81ms for the automatic schedule. This is all on ubuntu. The big problem with our mutex is that it does a full sched_yield in the spin loop, which is great if you're the only one using the system and are trying to produce impressive benchmarking numbers, but a massive waste of CPU cycles in production scenarios. Doing something like replacing the sched_yield with the x86 pause instruction totally craters performance, giving similar slowdowns to the above. |
Also for reference, I believe the current system is based on this: https://webkit.org/blog/6161/locking-in-webkit/ |
Yup, I did browse that WebKit blog post, but the microbenchmarks seemed a bit contrived and artificial.
Ok, spinning it is, then! Did you try to different spin count values too? One parallel thread per core (no oversubscription)? No other processes competing for the CPU while running the histogram program?
That's a good point; Linus even mentions on the forum message I referenced:
Spinning is hard. Coming up with a good benchmark for it is even harder. |
Yeah one thread per core and no other work happening on the machine. I have not tried different spin count values. Currently it's 40. It's frustrating to me that the correct thing to do is apparently a lot slower on a meaningful whole-algorithm benchmark (local laplacian), than the thing Linus says is better. Reality isn't matching theory and I'm not sure why. |
Linus is probably basing his claims on more grounded real world scenarios reported in the wild, when there are other "busy" processes competing for a slice of the scheduler. I suppose doing a bit of spinning is fine for us since in the end we do end up doing the responsible thing and calling the os for help. Any process using Halide is sort of expected to be CPU intensive at times and users will probably not have many other competing processes running along. Let's just change the spin count from 40 to 42 ;) In all seriousness, I wonder how spinning with pause for a bit, then a few more times with yield, before calling the mutex, would perform |
For what is worth, I found this article on spinlocks from the perspective of real-time audio synthesis: The article describes an implementation of a spinlock with exponential back-off that is allegedly suitable for "real-time" processing. The context is well defined and explained: one real-time priority thread for the audio synthesis and one normal priority thread for parameter changes and such, running on a system without any other processes contending for the scheduler and cores; only the normal priority thread can spinlock, while the real-time thread only tries to lock and moves on to do something else when it fails. This is quite different than our case here with something like mullapudi2016_histogram, but maybe there are some interesting tidbits there to take home. |
I'm trying to triage outstanding correctness bugs, so I'm removing bug tag because this doesn't crash or produce incorrect output. Let's use the "performance" for what we might call "performance bugs". |
Title says all.
d3d12: #7489
metal: #7532
cuda:
opencl:
webgpu:
gpu_device_selection:
tracing:
The text was updated successfully, but these errors were encountered: