-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy-initialize the global reference pool to reduce its overhead when unused #4178
Conversation
c36a48f
to
7683a50
Compare
This gets a big 👍 from me; I'd been wondering about lock free structures myself though hadn't ever thought about it in depth. I'll try to test out the performance of this tomorrow, I suspect it will be similar to #4174 in perf in the empty case and probably quite good in the busy case too... |
I am more sceptical about the busy case, because this will heap-allocate for each and every deferred reference count update. So while the synchronization is cheap, at least the CPU time should significantly increase due to the additional allocator work. |
Note that there is one difference insofar the lock-free implementation will keep the same performance if used intermittently. I do feel that this will only be relevant to the embedding use case though and hence we might want to prefer the lazy version that is fast when never used or always used. (In the end, I can also see working around slow |
7683a50
to
2e2e5d9
Compare
@davidhewitt I thought about this some more I really think we should not place a fair mutex into our Python-Rust entry point, i.e. Just biting the bullet and using parking_lot might be a simple solution but I think @alex would be quite unhappy about the dependency (and so would I). So I tried a bit harder and pushed a blocking implementation here that sadly gets another tunable (block capacity) but it would allow us to amortize allocation overhead. Of course, it does imply quite a bit of complicated lock-free code that might actually be worse than a dependency. (I would not propose to merge the code as it is even if the performance is satisfactory. If we do want this, we should factor out the lock-free blocked stealing stack into a generic separately documented and tested data structure. I also had a look for dependency offering this but did not find any, especially the stealing operation appears to be missing from the (mostly defunct) crates in this space.) |
CodSpeed Performance ReportMerging #4178 will not alter performanceComparing Summary
Benchmarks breakdown
|
I'm not hard opposed to I think the vast majority of extensions (citation needed!) probably are not dropping references while the GIL is not held. |
2e2e5d9
to
ddd9f10
Compare
I put the data structure itself into a separate crate available at https://github.com/adamreichold/lockfree-collector and where it does pass a simple smoke test running under Miri. (Strangely enough, Miri does not seem to care for |
Obligatory: consider filing a bug with miri with a reproducer :-)
…On Sat, May 18, 2024 at 2:14 PM Adam Reichold ***@***.***> wrote:
I put the data structure itself into a separate crate available at
https://github.com/adamreichold/lockfree-collector
and where it does pass a simple smoke test running under Miri. (Strangely
enough, Miri does not seem to care for fetch_update and but is fine with
an open-coded compare_exchange_weak loop.)
—
Reply to this email directly, view it on GitHub
<#4178 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAGBAE22OPWTTFFEZC6LDZC6K6TAVCNFSM6AAAAABHSB7DZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYHEYDMMJZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
All that is necessary for evil to succeed is for good people to do nothing.
|
Indeed. Did so immediately: rust-lang/miri#3613 (Albeit I am not really sure if the closure doesn't indeed introduced additional requirements (in the form of LLVM attributes attached to its arguments or captures) that the concurrent access violate.) |
cd9a49f
to
0f41e8b
Compare
If your goal is to have a feature to opt-out of the pool That said, I am excited by the new development on the lock free datastructure and intend to go read |
3cee3f0
to
714f4d4
Compare
I updated the lock-free collector to newer deallocate blocks (just like our current |
👍 I can hopefully test this tomorrow at the PyCon sprints. Today we had a wave of new contributors |
3063128
to
e4110f7
Compare
With #4200 merged and this branch rebased the comparison is now fair (as both branches exercise the pool again) so I ran the macOS numbers quickly this morning (Python 3.12, MBP m1):
So at least for macOS (as we somewhat expect) this makes a big difference. I'd be keen to proceed with this as it hopefully removes the user need for the more drastic cfg flags to rip the pool out entirely. |
This is the no-op test, right? Did you run anything (besides our single-threaded benchmark which Codspeed ran here as well and which did not regress after I made the lock-free collector reuse allocations) anything which actually exercises the pool? Should I give adding a multi-threaded benchmark here a try?
Agreed. I guess the main issue is where to put the Should I just move the crate into the I think having it here would simplify the release process but further complicate our already complicated CI. I don't the data structure itself will see too many releases, so personally, I would just put into the organisation but as a separate repository and publish version 0.1.0 so we can use that here. |
e4110f7
to
9123994
Compare
9123994
to
8d87c83
Compare
So I added some benchmarks for dropping without the GIL and doing that using multiple threads and I guess I should have started with that. No performance optimization without measurement. After doing that, I am dropping the lock-free collector on the floor as its overhead of constantly allocating new blocks if there is contention makes it really slow. (This is admittedly the worst case and common patterns like using Rayon will exhibit collection concurrently with enqueuing new decrements, but I guess it does happen somewhere and the performance profile is just too bad.) I then replaced it by a commit that uses So here are the results of the pushed benchmarks:
I am not sure if the benchmarks are stable enough or if they will need to be added to the Codspeed exceptions. This also still leaves a fair (i.e. queuing) mutex in there on macOS which is a throughput hazard (but certainly not as much as my ill-designed lock-free collector) for which I currently have no other solution than to go back to using |
Ah, that is unfortunate. I do not think it wasted effort as I've wondered about lockfree options for a while and am pleased we've experimented with them, even if the outcome was unsuccessful.
Just for completeness, I tested on macOS this branch with replacing the std mutex with parking lot. I get a 2x improvement on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined with the removal of the GILPool entirely (when no gil-refs
feature enabled), hopefully this means that 0.22 call overheads are much less of a bother for users!
@@ -18,6 +18,7 @@ rust-version = "1.63" | |||
cfg-if = "1.0" | |||
libc = "0.2.62" | |||
memoffset = "0.9" | |||
once_cell = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roll on MSRV 1.70 :)
As an alternative approach to the discussion in #4174 to at least have something concrete to discuss w.r.t. lock-free approaches.
This ensures that we never need more than a single atomic access (albeit acquire-release instead of relaxed) to check for a dirty pool without adding dependencies or depending on implementation details of std.
If the pool is never used, this should be as cheap as having a separate flag. However, if the pool is used, this does add costs to each deferred reference count, i.e. allocating the node and also freeing it when applied. This could be alleviated by allocating blocks of pointers instead of boxing them individually, but I think I would want to use a dependency for that.