Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
perf(mpsc): rewrite and optimize wait queue (#22)
This branch rewrites the MPSC channel wait queue implementation (again), in order to improve performance. This undoes a decently large amount of the perf regression from PR #20. In particular, I've made the following changes: * Simplified the design a bit, and reduced the number of CAS loops in both the notify and wait paths * Factored out fast paths (which touch the state variable without locking) from the notify and wait operations into separate functions, and marked them as `#[inline(always)]`. If we weren't able to perform the operation without actually touching the linked list, we call into a separate `#[inline(never)]` function that actually locks the list and performs the slow path. This means that code that uses these functions still has a function call in it, but a few instructions for performing a CAS can be inlined and the function call avoided in some cases. This *significantly* improves performance! * Separated the `wait` function into `start_wait` (called the first time a node waits) and `continue_wait` (called if the node is woken, to handle spurious wakeups). This allows simplifying the code for modifying the waker so that we don't have to pass big closures around. * Other miscellaneous optimizations, such as cache padding some variables that should have been cache padded. ## Performance Comparison These benchmarks were run against the current `main` branch (f77d534). ### async/mpsc_reusable ``` async/mpsc_reusable/ThingBuf/10 time: [43.953 us 44.522 us 45.057 us] change: [+0.0419% +1.7594% +3.5099%] (p = 0.05 < 0.05) Change within noise threshold. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low severe 2 (2.00%) low mild 1 (1.00%) high mild 1 (1.00%) high severe async/mpsc_reusable/ThingBuf/50 time: [140.91 us 142.24 us 143.53 us] change: [-31.201% -29.539% -27.824%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild async/mpsc_reusable/ThingBuf/100 time: [250.31 us 255.03 us 259.68 us] change: [-18.966% -17.190% -15.202%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe ``` ### async/mpsc_integer ``` async/mpsc_integer/ThingBuf/10 time: [208.99 us 215.30 us 221.32 us] change: [+0.6957% +3.8603% +6.9400%] (p = 0.02 < 0.05) Change within noise threshold. async/mpsc_integer/ThingBuf/50 time: [407.46 us 412.74 us 418.31 us] change: [-39.128% -36.567% -33.267%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 2 (2.00%) low mild 4 (4.00%) high mild 7 (7.00%) high severe async/mpsc_integer/ThingBuf/100 time: [534.35 us 541.42 us 548.91 us] change: [-44.820% -41.502% -37.120%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) low mild 3 (3.00%) high mild 7 (7.00%) high severe ``` ### async/spsc/try_send_reusable ``` async/spsc/try_send_reusable/ThingBuf/100 time: [12.310 us 12.353 us 12.398 us] thrpt: [8.0656 Melem/s 8.0952 Melem/s 8.1236 Melem/s] change: time: [-7.5146% -7.1996% -6.8566%] (p = 0.00 < 0.05) thrpt: [+7.3613% +7.7582% +8.1252%] Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild async/spsc/try_send_reusable/ThingBuf/500 time: [46.691 us 46.778 us 46.871 us] thrpt: [10.668 Melem/s 10.689 Melem/s 10.709 Melem/s] change: time: [-9.4767% -9.2760% -9.0811%] (p = 0.00 < 0.05) thrpt: [+9.9881% +10.224% +10.469%] Performance has improved. Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild async/spsc/try_send_reusable/ThingBuf/1000 time: [89.763 us 90.757 us 91.843 us] thrpt: [10.888 Melem/s 11.018 Melem/s 11.140 Melem/s] change: time: [-9.4302% -8.8637% -8.2018%] (p = 0.00 < 0.05) thrpt: [+8.9346% +9.7257% +10.412%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low mild 3 (3.00%) high mild 8 (8.00%) high severe async/spsc/try_send_reusable/ThingBuf/5000 time: [415.34 us 417.89 us 420.42 us] thrpt: [11.893 Melem/s 11.965 Melem/s 12.038 Melem/s] change: time: [-13.113% -12.774% -12.411%] (p = 0.00 < 0.05) thrpt: [+14.170% +14.644% +15.093%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 7 (7.00%) high mild async/spsc/try_send_reusable/ThingBuf/10000 time: [847.35 us 848.63 us 849.98 us] thrpt: [11.765 Melem/s 11.784 Melem/s 11.802 Melem/s] change: time: [-11.345% -10.820% -10.388%] (p = 0.00 < 0.05) thrpt: [+11.592% +12.133% +12.796%] Performance has improved. Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe ``` ### async/spsc/try_send_integer ``` async/spsc/try_send_integer/ThingBuf/100 time: [7.2254 us 7.2467 us 7.2690 us] thrpt: [13.757 Melem/s 13.799 Melem/s 13.840 Melem/s] change: time: [-13.292% -12.912% -12.520%] (p = 0.00 < 0.05) thrpt: [+14.312% +14.826% +15.330%] Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild async/spsc/try_send_integer/ThingBuf/500 time: [34.358 us 34.477 us 34.582 us] thrpt: [14.458 Melem/s 14.503 Melem/s 14.553 Melem/s] change: time: [-18.539% -18.312% -18.072%] (p = 0.00 < 0.05) thrpt: [+22.058% +22.417% +22.758%] Performance has improved. async/spsc/try_send_integer/ThingBuf/1000 time: [69.107 us 69.273 us 69.434 us] thrpt: [14.402 Melem/s 14.436 Melem/s 14.470 Melem/s] change: time: [-17.759% -17.604% -17.444%] (p = 0.00 < 0.05) thrpt: [+21.130% +21.365% +21.594%] Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild async/spsc/try_send_integer/ThingBuf/5000 time: [349.44 us 353.41 us 357.81 us] thrpt: [13.974 Melem/s 14.148 Melem/s 14.309 Melem/s] change: time: [-14.832% -14.252% -13.447%] (p = 0.00 < 0.05) thrpt: [+15.537% +16.621% +17.415%] Performance has improved. Found 13 outliers among 100 measurements (13.00%) 5 (5.00%) high mild 8 (8.00%) high severe async/spsc/try_send_integer/ThingBuf/10000 time: [712.89 us 732.58 us 754.24 us] thrpt: [13.258 Melem/s 13.650 Melem/s 14.027 Melem/s] change: time: [-16.082% -15.161% -14.129%] (p = 0.00 < 0.05) thrpt: [+16.454% +17.870% +19.164%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe ``` I'm actually not really sure why this also improved the `try_send` benchmarks, which don't touch the wait queue...but I'll take it! Signed-off-by: Eliza Weisman <[email protected]>
- Loading branch information