-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pursue Elimination of RingBuffer Pooling #1908
Comments
Here are some tests with
JMC while running We can't use these queues as-is since they are single-consumer and we need multi-consumer for However, the results do suggest that we can get a data structure to achieve performance without object pooling. |
It's quite odd Spsc is failing since everything should be serialized, i.e., only 1 thread offering values to the internal queues at a time and only 1 thread collecting from all queues, right? Thread hopping should not affect Spsc. I can only speculate, but when I did some backpressure-related code lately, I often didn't get the consumption phase right at first: the case where upstream produces and downstream requests at the same time, both may end up in the drain loop. This is why I used the emitting flag in onBackpressureBlock or in backpressure-aware ReplaySubject. |
Yes it should work and the other spsc/mpsc queue impls work so it suggests an issue with those impls (they aren't yet officially released). |
@benjchristensen with which params are you running jmh + jfr? |
I manually attach JFR after the test passes the warmup and hits the first iteration. I increase the integration time to some longer number (been doing 25 seconds) so it runs far longer than the 60 seconds I capture via JFR.
I start flight recording when the first iteration starts:
|
Just a silly oversight on my part. These are unbounded queues and our use cases requires them to be bounded. Thus, the if (!queue.offer(on.next(o))) {
throw new MissingBackpressureException();
} If we use one of these unbounded queues we'll need to add the overhead of tracking the size, or modify the queue implementation to correctly do that for us. |
ConcurrentLinkedQueue takes a performance hit compared with the JCTools implementations for
It's okay for
|
I modified The only unit test failing is Scroll to the right to see the
observeOn Flight Recorder 1.0.1 Modified to use SPSC Linked List merge Flight Recorder 1.0.1 Modified to use SPSC Linked List The flight recorder stuff, particularly for It does seem though that the trade-offs in performance and object allocation are not severe and thus probably warrants the change to allow avoiding object pooling. I'll submit a PR with the changes for evaluation and further experimentation. |
PR #1909 contains the code for the previous perf tests. |
A note while we're working on this ... whatever decision we make here I think should target 1.1 since it will affect memory and GC behavior so is larger than a patch. I also want to give enough time to really think through this ... sleeping on whatever solutions we work on since this is very nuanced and the tradeoffs need to be thought through. |
I've been exploring weak and phantom references tonight while exploring whether it's possible to continue using object pooling but without manual release. I have it working, and performance is good, but at high velocity on JMH tests the ReferenceQueue is filling up in an unbounded manner. Anyone have good experience using WeakReferences for an object pool with high throughput? I wish there was a better way to get a callback from the WeakReference than it being put into a ReferenceQueue. |
The problem with pooling is that it can now become the congestion point: multiple threads might want to grab an element from the pool at the same time. The alternative is reference counting: public void onNext(Object o) {
Queue<Object> q = queue;
if (q != null && q.acquire()) { // i.e., cas 1 -> 2
q.offer(o);
q.release(); // i.e., decrementAndGet
}
}
public void unsubscribe() {
Queue<Object> q = queue;
if (q != null) {
queue = null;
q.release();
}
} Could be achieved via AtomicInteger. When borrowed, the queue starts out in 1. Use toggles this between 1 and 2. If an unsubscription happens, the party that goes from 1 to 0 will put the queue back into the pool. The drawback is that now you have extra bookkeeping which may halve the throughput in RxRingBufferPerf. |
Sure, but if it's a concurrent queue, which it is, then that contention should not be severe. It's using an I agree it is a point of contention, but if that contention is cheaper than the object allocation/collection then it's a net win.
This could work but is effectively synchronizing unsubscribe and queue usage (cheaper than mutex, but still a form of synchronizing) which definitely has a cost, both performance and cognitive every time this queue is used. Hence us pursuing either a solution that doesn't need to be pooled, or my pursuit of weak references so the release to the pool is automated. |
I think false sharing isn't really an issue by us because how te queus are used: the producer is likely to win the drain race and emit data, or the queue is practically full and only 1 element is put for each 1 element taken (request(1) in merge and observeOn) and the two sides are far away to each other and no false sharing happens. So an unpadded queue of 128 capacity takes up 1024+16+32 bytes nicely fitting in L1. The padded version takes 65k. |
I have some some benchmarking on different solutions that have been posted, including removing the pooling altogether and here are the numbers. Note that we now have a queue size of 128, not 1024 like we used to.
|
Here is a comparison of no pooling with size at 128 vs 1024:
The drop from 1024 to 128 makes it work okay with no pooling. At 1024 we needed the pooling. |
Any of you able to check performance and object allocation behavior using #1944 so it's not just my judgement on this? Is removing the object pool okay while using SpscArrayQueue as long as we stay at size 128? Can you get WeakReference pooling to work better than the "no pooling" performance? |
@akarnokd based on the perf numbers above what do you think we should do? |
The no-pool 128 version seems to be the best choice generally. Maybe the low spscCreateUseDestroy1 case can be improved by removing the pad and complex hierarchy from the main class. |
Should we move forward with this change in 1.0.x or should it be 1.1? I have not seen any evidence in my testing to suggest that this is a significant change in GC behavior ... which surprised me and is why I'm hesitant and want confirmation. |
By the way, nothing we're doing here should affect Android since we use a LinkedList without pooling on Android. |
What versions of Android are supported? |
We use the SpscRingBuffer for non-Android since it uses |
Here is the size for Android: https://github.com/ReactiveX/RxJava/blob/1.x/src/main/java/rx/internal/util/RxRingBuffer.java#L264 and the list impl choice when The conditional check for If you want to make a change for Android would you like to submit a PR? |
On Android we generally prefer the up-front allocation over lazy allocation of somewhat disposable wrapper objects (such as that inside of a LinkedList). This is for the benefit of comparatively poor handling of short lived objects as well as avoiding the GC as much as possible in our 60fps world. We also definitely wouldn't want to pay the cost of having the array expand unless it exceeded the 16 value. |
It will never exceed 16 due to how the backpressure approach works, it will throw a Often the buffer is not needed, or it's only needed for a single value (using Observable like a Future) so on Android what is better, allocating a 16 size array and maybe only using 1 spot, or allocating linked nodes only as needed but potentially allocating more in a slow/contended stream of data? Do you want to submit a PR with the changes done as you'd like them for Android? |
I have added #1969 with Flight Recorder tests showing the impact of removing pooling. It's not catastrophic but it also shows signs of being a potential problem. This is exactly the type of metrics that pushed me to add object pooling. That said, because we've dropped from 1024 to 128 the impact is not anywhere near as bad as it was before. |
I have submitted another variant in #2189 It makes different tradeoffs to allow for object pooling to work in most use cases, normal GC in edge cases (haven't found a consistent one yet, only theoretical, but I haven't tried hard yet), while maintaining more-or-less the same performance characteristics as the current 1.x code. I have NOT battle-tested this and intend on sleeping on it then reviewing again, but wanted to post to trigger discussions and get a review on the direction and trade-offs. |
Problem definition:
A refresher on what has been attempted so far while exploring this:
The behavior we're trying to deal with is:
Here are performance numbers of the various tests I've done:
|
I'd like to move forward on something since we do currently have a correctness issue. Unless there is some performance or functional issue I am not yet aware of, I suggest we move forward with #2189 since it seems to work while retaining performance, despite not being the most elegant of approaches. Once we've merged and released to fix the correctness issue, other more elegant solutions can continue to be pursued. I'd love for something such as the WriterReaderPhaser to work and retain performance. |
As a fun aside, and because this is happening over the holidays, here are some pages from a book I just read my girls the other day that totally made me think of this issue and many others like it :-) Book at http://www.amazon.com/Most-Magnificent-Thing-Ashley-Spires-ebook/dp/B00IZH626G Spoiler ... she eventually succeeds, though still with a few imperfections :-) |
Just to let you know I have added an Spsc which might fit the bill here: |
Quite clever; no CAS, no tombstone and no copying. However, correct me if I'm wrong, but it seems the code on line 135 may read beyond the buffer if offset is at the last element. The other thing I see is that the queue should disallow offering Object[] values because it would confuse the poll. Better yet, when a new array needs to be communicated, wrap it into a private holder class so it is not confused with any other type; resize should be infrequent enough to not cause significant overhead. |
Since the queues are limited to RxRingBuffer.SIZE, it might be worth considering this growable queue doesn't grow in several steps but jumps immediately to its maximum value. In my two-phased version, I also triggered a growth after cerain number of elements have been offered; this helped eliminating the cost of CAS for long running queues. Here, if the poll() can know no more resize can happen, an |
Fixed as per your suggestions in JCTools/JCTools#43 |
The RxRingBuffer has been fixed and the JCTools queues have been upgraded in 1.0.5. Pooling seems to be the only way to increase the performance of single-shot merging of values. |
The use of object pooling significantly complicates the
unsubscribe
behavior and when it is safe to release. This issue is to track research on how we can eliminate pooling to improve or fix correctness while still maintaining good performance (and memory allocation and GC behavior).The
merge
andobserveOn
use cases are good to track performance and already have JMH benchmarks in place.Here is how to run the tests:
Here are results for
merge
andobserveOn
comparing use of pooledSpmcArrayQueue
vsSynchronizedQueue
(a synchronized LinkedList).The
**
markers indicate where performance degradation happened.The
SynchronizedQueue
was only ever intended for use by environments withoutsun.misc.Unsafe
(such as Android) so it is worth exploring other alternatives that don't involve a ring buffer (and allocation overhead) but are thread-safe for the single-produce-multi-consumer use cases the RxRingBuffer is used in (and then rename to RxQueue or something like that).The text was updated successfully, but these errors were encountered: