Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized observeOn/subscribeOn #2603

Closed
wants to merge 4 commits into from

Conversation

akarnokd
Copy link
Member

@akarnokd akarnokd commented Feb 4, 2015

Doing observeOn/subscribeOn on these is essentially the same operation.

Benchmark results: (i7 4770k, Java 1.8u31, Windows 7 x64)

gradlew benchmarks "-Pjmh=-f 1 -tu s -bm thrpt -wi 10 -i 10 -r 5 .*ComputationSchedulerPerf.*"

Benchmark                        (size)    1.x Score  Score error   PR  Score  Score error
CompSchedulerPerf.observeOn           1   182479,416    11686,648  210536,292     9760,937  
CompSchedulerPerf.observeOn          10   174911,567    13659,846  179245,342    12725,807  
CompSchedulerPerf.observeOn         100    30997,516      958,587   27000,106      623,127  
CompSchedulerPerf.observeOn        1000     6701,672      314,471    8623,296      703,255  
CompSchedulerPerf.observeOn       10000      927,207       98,571     985,409       32,412  
CompSchedulerPerf.observeOn      100000      111,968        1,176     110,597        2,883  
CompSchedulerPerf.observeOn     1000000       11,790        0,195      11,512        0,146  
CompSchedulerPerf.subscribeOn         1   202557,014    13692,298  204616,231    11272,351  
CompSchedulerPerf.subscribeOn        10   180090,498    14058,995  190338,591     1822,349  
CompSchedulerPerf.subscribeOn       100   113625,737    32012,148  114659,399    50494,246  
CompSchedulerPerf.subscribeOn      1000    35600,359     2975,926   36856,078     1770,347  
CompSchedulerPerf.subscribeOn     10000     4220,548      311,551    3942,293      368,052  
CompSchedulerPerf.subscribeOn    100000      487,187       18,150     472,594       22,474  
CompSchedulerPerf.subscribeOn   1000000       52,191        0,349      50,250        0,510  

Unfortunately, the benchmark results were quite hectic even with more warmup and iteration. I'd say the changes give +10% for the size = 1 case, but running the same code twice (observeOn 1, subscribeOn 1) gives inconsistent values. I suspect the main cause is the GC.

@akarnokd
Copy link
Member Author

akarnokd commented Feb 5, 2015

New benchmark results (i7 920, Java 1.8u31, Windows 7 x64)

Benchmark      (size)    Score 1.x            v2     v2 error   Diff %   Diff x
observeOn           1    145464,015   164994,196     3694,358    13,43     1,13
observeOn          10    132737,962   137155,992     1012,799     3,33     1,03
observeOn         100     25681,480    45319,683      514,267    76,47     1,76
observeOn        1000      3422,822    12182,024      460,586   255,91     3,55
observeOn       10000       498,887     1582,275       18,479   217,16     3,17
observeOn      100000        85,889      155,932        2,873    81,55     1,81
observeOn     1000000         8,661       15,959        1,313    84,26     1,84
subscribeOn         1    155442,402   162559,432     1046,801     4,58     1,04
subscribeOn        10    144919,758   152370,842      753,133     5,14     1,05
subscribeOn       100    107249,189   138655,111      836,759    29,28     1,29
subscribeOn      1000     27860,564    57548,020    11424,426   106,56     2,06
subscribeOn     10000      3481,585     9069,788      229,033   160,51     2,60
subscribeOn    100000       362,142      980,154       28,928   170,65     2,70
subscribeOn   1000000        35,890       92,740        1,033   158,40     2,58

There are 2x-3x througput improvements. I've verified there aren't any exceptions thrown and the last emitted value is always the size - 1 (all values should have been delivered). Note that these improvements are due to overhead reduction. With Blackhole.consumeCPU(5), the numbers look like this:

Benchmark      (size)           1.x    1.x error           v2     v2 error
observeOn           1    140296,417      441,165   160466,648    18522,034
observeOn          10    133460,788     2365,310   135523,349     1877,668
observeOn         100     25786,907     1948,696    47624,460      925,317
observeOn        1000      3559,582       81,327    11185,990      988,573
observeOn       10000       552,615       93,469     1370,051      321,379
observeOn      100000        79,281        2,550      141,401        4,252
observeOn     1000000         8,083        0,922       13,593        1,866
subscribeOn         1    155594,824     1630,359   159551,161     2583,546
subscribeOn        10    142827,468     2120,922   146452,153     1323,202
subscribeOn       100     93820,703    21969,555   102171,399    51531,245
subscribeOn      1000     23061,551     2348,392    37658,325     3900,619
subscribeOn     10000      2798,071      144,547     5128,935      331,227
subscribeOn    100000       278,345       11,796      523,751       50,789
subscribeOn   1000000        28,944        1,060       54,256        1,252

Most likely, the improvement comes from the change to isUnusubscribed reading a volatile field instead of always entering the unbiasable synchronized blocks.

@akarnokd akarnokd changed the title Optimized Scalar synchronous value observeOn/subscribeOn Optimized observeOn/subscribeOn Feb 5, 2015
@daschl
Copy link
Contributor

daschl commented Feb 5, 2015

@akarnokd great stuff, I'll try it out early next week and report some numbers :)

@akarnokd
Copy link
Member Author

akarnokd commented Feb 5, 2015

Now on: i7 4770k, Java 1.8u31, Windows 7 x64

Benchmark      (size)    1.x Score  Score error        Score  Score error
observeOn           1   182479,416    11686,648   206216,641     4097,849  
observeOn          10   174911,567    13659,846   183143,285     6557,560  
observeOn         100    30997,516      958,587    71018,915     2110,252  
observeOn        1000     6701,672      314,471    13325,539     1940,749  
observeOn       10000      927,207       98,571     1648,631      250,874  
observeOn      100000      111,968        1,176      200,704        7,900  
observeOn     1000000       11,790        0,195       19,285        0,630  
subscribeOn         1   202557,014    13692,298   205876,054     9574,248  
subscribeOn        10   180090,498    14058,995   194964,365     7460,966  
subscribeOn       100   113625,737    32012,148   149761,195   104683,280  
subscribeOn      1000    35600,359     2975,926    87861,668    17543,179  
subscribeOn     10000     4220,548      311,551    13208,973     2794,952  
subscribeOn    100000      487,187       18,150     1695,555      226,512  
subscribeOn   1000000       52,191        0,349      180,335       30,122

@benjchristensen
Copy link
Member

I see these improvements ... still reviewing the code:

Benchmark                                   (size)   Mode   Samples          1.x        2603    
r.s.ComputationSchedulerPerf.observeOn           1  thrpt         5   104110.926  115707.286
r.s.ComputationSchedulerPerf.observeOn          10  thrpt         5   100723.402  105825.148
r.s.ComputationSchedulerPerf.observeOn         100  thrpt         5    24609.763   65571.461
r.s.ComputationSchedulerPerf.observeOn        1000  thrpt         5     3212.434   13020.027
r.s.ComputationSchedulerPerf.observeOn        2000  thrpt         5     2057.133    
r.s.ComputationSchedulerPerf.observeOn        3000  thrpt         5     1807.644    
r.s.ComputationSchedulerPerf.observeOn        4000  thrpt         5     2000.223    
r.s.ComputationSchedulerPerf.observeOn       10000  thrpt         5      955.002    1555.493  
r.s.ComputationSchedulerPerf.observeOn      100000  thrpt         5       96.628     160.218  
r.s.ComputationSchedulerPerf.observeOn     1000000  thrpt         5        9.508      16.559  
r.s.ComputationSchedulerPerf.subscribeOn         1  thrpt         5   114212.000  118885.516  
r.s.ComputationSchedulerPerf.subscribeOn        10  thrpt         5   112376.809  112270.024  
r.s.ComputationSchedulerPerf.subscribeOn       100  thrpt         5    88433.002  104240.739  
r.s.ComputationSchedulerPerf.subscribeOn      1000  thrpt         5    31503.640   64446.984  
r.s.ComputationSchedulerPerf.subscribeOn      2000  thrpt         5    16292.135    
r.s.ComputationSchedulerPerf.subscribeOn      3000  thrpt         5    11372.297    
r.s.ComputationSchedulerPerf.subscribeOn      4000  thrpt         5     9927.774    
r.s.ComputationSchedulerPerf.subscribeOn     10000  thrpt         5     3932.988    8200.048
r.s.ComputationSchedulerPerf.subscribeOn    100000  thrpt         5      437.626    1439.069
r.s.ComputationSchedulerPerf.subscribeOn   1000000  thrpt         5       43.104     146.385


public abstract class ObjectPool<T> {
private Queue<T> pool;
private final int maxSize;

private Scheduler.Worker schedulerWorker;
private final ScheduledExecutorService schedulerWorker;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should combine all periodic maintenance-related threads into a single service: ObjectPool, IO scheduler's cleanup pool, JDK 6 computation scheduler's purge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that affect memory, caches, etc to have an extra thread doing this stuff instead of just being the existing event loops?

I ask because we accidentally were running extra threads in a test with Netty+RxJava and it decreased throughput by 40%. That was far higher workload because metrics were being captured on the other threads, but it makes me less certain about choices involving putting work on other threads.

final Subscriber<? super T> child = this.child;
final NotificationLite<T> on = this.on;
@SuppressWarnings("rawtypes")
final AtomicLongFieldUpdater<ObserveOnSubscriber> counter = COUNTER_UPDATER;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, what benefit does this give to assign this reference? Why use counter instead of COUNTER_UPDATER directly?

Same for the other assignments here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the volatile access in the loop, the JIT optimization that would hoist them into registers is not allowed (i.e., field access couldn't be moved before a volatile read) and would just re-read them all the time.

@akarnokd
Copy link
Member Author

@benjchristensen what is the verdict on this PR?

@benjchristensen
Copy link
Member

I think it's only the discussion about exposing EventLoopScheduler publicly which I don't think should happen.

@akarnokd
Copy link
Member Author

The optimization is built upon the direct access which is not possible if the class is package private. I could create a Schedulers.scheduleSingle(Scheduler, Action0) that calls schedule direct and since they are both in the same package, no need to expose EventLoopScheduler.

@benjchristensen
Copy link
Member

So why don't you just move it to the rx.internal packages as you suggested before?

@benjchristensen
Copy link
Member

Can you please rebase this, and move that file to rx.internal so it is not made part of the public API?

@akarnokd
Copy link
Member Author

Okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants