-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lock-free queue for scheduler message queue #9710
Conversation
|
narrowed it to rt::io tests
|
valgrind told me the Abort was from not enough file descriptors. I bumped fd up and now I get:
|
So the above failure was caused by the previous fd limit failures not cleaning up directories in tmp. I had to rm these two directories:
and now all tests pass.
I'd still be very wary of this change though. |
Thanks! I'm out of town at the moment but will give this a close look this week. |
I found SleeperList, which looked like a copy of the MessageQueue code. So I tried replacing it with mpsc_queue::Queue, but then I get crashes in the AtomicPtr::get refcount != 0 check. I haven't been able to track down which AtomicPtr is causing this yet, but it is troubling because I can't explain why or how this change would expose what looks like a bug in AtomicPtr. I suspect the bug is in my code, not AtomicPtr, but this sure is a head scratcher.
|
I haven't given this a whole lot of thought, but there are two high-level comments which I was wondering about:
|
@alexcrichton I was wondering if SleeperList needed to be multi-consumer and whether that might be the issue. If that is the case, it explains the crashes. I will implement another lock-free queue for SleeperList. I also have thought about using the type system to enforce the thread-safe properties of the queue. I think something similar to comm::stream() which makes a Pusher that is Cloneable and Sendable and a Popper that isn't. I'll probably need some help with the intricacies of the type system. |
Tests are now passing with sleeper list using the multiple producer multiple consumer bounded queue. I used this algorithm: http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue. The downside is this queue is bounded at powers of 2 capacity. However, it is very fast and easy to understand. I wasn't sure what I could do safely inside the scheduler when a push to the sleeper list failed because the queue was full, so I just put an assert! in for now and made the queue fairly large. Usually there would be a spin with back-off yielding loop. |
Forgive me taking so long on this. It looks very impressive, but I'm wary of hastily merging concurrency code these days. I think it's not too big a problem for the SleeperList to have a bound, at least for now. At the moment the number of work stealing schedulers is fixed at startup, putting a bound on the size of the SleeperList. In the future that number will be able to increase dynamically, but I still don't think that having a reasonable limit on the number of schedulers is a major problem. This does convert the SleeperList from a stack to a queue. I'm not sure if this is good or bad. I chose a stack originally with the vague idea that it would be better for cache reasons to keep using threads that were most recently active, but it may not matter at all. This also makes all pushes to the scheduler message queue allocate, so there's going to be a performance tradeoff here. This will probably scale a lot better when message queues are contested, but be slower to push/pop. A minor nit, but I would prefer that the mpsc queue is encapsulated in the scheduler-specific MessageQueue type. Partly just for the naming, but also to make it easier to swap out MessageQueue implementations. Because of the special license this won't pass Have you measured any difference in performance with this implementation? I'm going to run some tests locally here. |
It would be appropriate to put links to the original C source directly in the Rust source. Can you do so? |
I agree, this merge should not be taken lightly. I have run tests through valgrind's helgrind and drd tools, without noticing any problems however the output is rather noisy even without this patch so I could be missing things. I'd feel better if Rust had http://clang.llvm.org/docs/ThreadSanitizer.html support and/or was helgrind clean. I will look into implementing a lock-free stack for SleeperList, I missed that detail. The allocation per queue push is a concern. It is possible to use a bounded free-list to reduce this, but it would complicate the code and I'd want to see measurements of the allocation being an issue first because jemalloc is hard to beat. I like encapsulating MessageQueue, perhaps allowing a runtime flag that swaps between implementations could be a good way to go forward with merging this. That minimizes risk and makes testing easier than rebuilding all of Rust from this branch. I have tried, but so far failed to find a good measure of performance for this. Benchmarking is hard. I think it would be fairly trivial to come up with some micro-benchmarks, but I was hoping to see a difference in real-world applications. I tried running the stage1 tests and benchmarks and also apachebench with various levels of concurrency against rust-http server hoping to see some improvement, but I didn't. I suspect with running the tests and benchmarks the scheduler overhead was too small. As for rust-http I believe I talked to you on irc about the issues with libuv and io on multiple threads. rust-http server really only utilizes a single core when io bound because of libuv. Perhaps Servo would be a good test? |
I'll update this branch when I get a chance, but until then the code also exists in stand-alone form at https://github.com/toffaletti/rust-code with links to the original algorithms, if that helps. On second thought, I think the code I'm linking to is more C style pseudo-code. I'll try to dig up some actual compilable code (I believe some were posted on the mailing list) if you like. https://github.com/toffaletti/rust-code/blob/master/mpmc_bounded_queue.rs#L29 I'll also explain the padding in a comment in the code, but I wanted to bring up an issue with it here. The padding is to prevent false sharing. For it to work, the padding has to be at least the size of the cpu cache line. I hard coded it at 64 bytes because I'm not aware of a good way to determine the actual value at compile or runtime. I've seen reference to some older server class cpus using 128 byte cache lines. I have no idea what ARM uses, but I think most modern desktop/server cpus use 64. |
I think microbenchmarks are probably the best bet for measuring any results at this point. As you say, most of our real world code is going to swamp the scheduler itself with additional inefficiencies; we probably don't have anything that would actually be impacted by the scalability improvements this should bring (nor do I even have a massively multi-core machine to test on right now). Last time I was tuning the scheduler I used this stupid benchmark and was just optimizing profiles to remove as much of the scheduler overhead as possible. I was also using mutrace to measure lock contention, which this change should reduce to zero. |
@toffaletti I think don't worry about the lock-free stack. There's no evidence a stack is better here. |
This does turn the |
Oh, the biggest source of contention is actually probably the |
I'll take a shot at the WorkQueue after I get some benchmark and testing in place to measure these changes. I think it'll be about 2-3 weeks before I'll have something more to show. |
My first attempt at implementing the chase-lev deque is pretty ugly. https://github.com/toffaletti/rust-code/blob/master/chase_lev_deque.rs Interestingly I believe I found a bug in the take function from the C11 implementation in http://www.di.ens.fr/~zappa/readings/ppopp13.pdf. They used unsigned ints for top/bottom while the original algorithm used signed ints. The overflow caused by bottom = bottom - 1 when bottom is 0 would cause the check top <= bottom to pass and take would return a value even when the queue was supposed to be empty. I suspect they never noticed this bug because they were storing ints and weren't distinguishing between 0 as a valid return value and 0 as empty. I also implemented the deque in C++11 https://github.com/toffaletti/libten/blob/ws_sched/include/ten/work_deque.hh and plugged it into a task scheduler which has a decent test suite. I haven't seen any crashes yet, but helgrind/drd do report some races. I probably have a bug in the memory ordering for array initialization and/or array grow. |
I've not forgotten about this PR. I have stress tested it here quite a bit without any failures, so I'm feeling pretty good about it, but I've yet to read the code for the data structures closely, which I want to do still. |
2u | ||
} else { | ||
// use next power of 2 as capacity | ||
2f64.pow(&((capacity as f64).log2().ceil())) as uint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can use uint::next_power_of_two.
Reopened as #10080 |
@toffaletti You may also be interested in #8568 |
} | ||
|
||
struct State<T> { | ||
pad0: [u8, ..64], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be less than 64 bytes, you can save a bit of memory by using 64 minus the size of the padded object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust has no sizeof that works at compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I hit that too with my padding. Since it's hard-coded, though, you can safely change the size to 56, knowing that buffer and mask will fill the remaining space. I admit it's a pretty minor issue, I wish that my first comment on your patch had been something more helpful than this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't work on 32 bit machines though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@toffaletti you could do something with #[cfg(target_word_size = "32")]
and #[cfg(target_word_size = "64")]
to conditionally chose a definition of the structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also have to conditionally define the constructor. I chose to keep the code simple instead of saving the 16-32 bytes per cpu core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I should have said 60 bytes. I forgot about the possibility of the cache line boundary being in between buffer and mask.
Fix `needless_borrow` false positive rust-lang#9710 Fixes rust-lang#9710 changelog: fix `needless_borrow` false positive rust-lang#9710
I implemented the lock-free queue algorithm from Dmitry Vyukov http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue as a potential replacement for std::rt::message_queue::MessageQueue.
make check-stage2
passes, but I've done little else to validate this change and it is definitely a change that shouldn't be taken lightly. I included the license Dmitry asked me to use when I implemented a C++11 version of this queue a while ago. If the license is an issue, I suggest reaching out to Dmitry. I believe he works at Google now and recently rewrote Go's scheduler.