Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0.13 Release Candidate #3076

Closed
benjchristensen opened this issue Jul 14, 2015 · 27 comments
Closed

1.0.13 Release Candidate #3076

benjchristensen opened this issue Jul 14, 2015 · 27 comments

Comments

@benjchristensen
Copy link
Member

Update: After production canary testing on the Netflix API we have gained confidence with merge and are now ready to release 1.0.13. We did however choose to rollback changes to replay() and cache() and will pursue them in the next release. The discussion in the comments below shares the details.


This is a pretty significant release as it has major changes to merge and replay, along with the new rx.Single type.

The merge changes are the most sensitive, as they affect almost all applications (either through merge directly, or the ubiquitous use of flatMap).

I'm going to wait a few days before releasing to give people a chance to use the SNAPSHOT. Use the snapshot from a Gradle build like this:

repositories {
    maven { url 'https://oss.jfrog.org/libs-snapshot' }
}

dependencies {
    compile 'io.reactivex:rxjava:1.0.13-SNAPSHOT'
}

This release has quite a few bug fixes and some new functionality. Items of note are detailed here with the list of changes at the bottom.

merge

The merge operator went through a major rewrite to fix some edge case bugs in the previous version. This has been sitting for months going through review and performance testing due to the importance and ubiquity of its usage. It is believed this rewrite is now production ready and achieves the goal of being more correct (no known edge cases at this time) while retaining comparable performance and memory usage.

Special thanks to @akarnokd for this as merge is a challenging one to implement.

window fix and behavior change

Unsubscription bugs were fixed in window. Along the way it also resulted in a fix to one of the window overloads that had a functional discrepancy.

window(Func0<? extends Observable<? extends TClosing>> closingSelector)

This is a small behavior change that corrects it. If you use this overload, please review the change to ensure your application is not affected by an assumption of the previously buggy behavior: #3039

Note that this behavior change only affects that particular overload while the broader bug fixes affect all window overloads.

rx.Single

After much discussion it was decided to add a new type to represent an Observable that emits a single item. Much bike-shedding led to the name Single. This was chosen because Future, Promise and Task are overused and already have nuanced connotations that differ from rx.Single, and we didn't want long, obnoxious names with Observable as a prefix or suffix. Read the issue thread if you want to dig into the long debates.

If you want to understand the reasoning behind adding this type, you can read about it in this comment.

In short, request/response semantics are so common that it was decided worth creating a type that composes well with an Observable but only exposes request/response. The difference in behavior and comparability was also deemed worth having an alternative to Future. In particular, a Single is lazy whereas Future is eager. Additionally, merging of Singles becomes an Observable, whereas combining Futures always emits another Future.

Note that the API is added in an @Experimental state. We are fairly confident this will stick around, but are holding final judgement until it is used more broadly. We will promote to a stable API in v1.1 or v1.2.

Examples below demonstrate use of Single.

// Hello World
Single<String> hello = Single.just("Hello World!");
hello.subscribe(System.out::println);

// Async request/response
Single<String> one = getData(1);
Single<String> two = getOtherData(2);

// merge request/responses into an Observable of multiple values (not possible with Futures)
Observable<String> merged = one.mergeWith(two);

// zip request/responses into another Single (similar to combining 2 Futures)
Single<String> zipped = one.zipWith(two, (a, b) -> a + b);

// flatMap to a Single
Single<String> flatMapSingle = one.flatMap(v -> {
    return getOtherData(5);
});

// flatMap to an Observable
Observable<Integer> flatMapObservable = one.flatMapObservable(v -> {
    return Observable.just(1, 2, 3);
});

// toObservable
Observable<String> toObservable = one.toObservable();

// toSingle
Single<Integer> toSingle = Observable.just(1).toSingle();

public static Single<String> getData(int id) {
    return Single.<String> create(s -> {
        // do blocking IO
        s.onSuccess("data_" + id);
    }).subscribeOn(Schedulers.io());
}

public static Single<String> getOtherData(int id) {
    return Single.<String> create(s -> {
        // simulate non-blocking IO
        new Thread(() -> {
            try {
                s.onSuccess("other_" + id);
            } catch (Exception e) {
                s.onError(e);
            }
        }).start();
    });
}
ConnectableObservable.autoConnect

A new feature was added to ConnectableObservable similar in behavior to refCount(), except that it doesn't disconnect when subscribers are lost. This is useful in triggering an "auto connect" once a certain number of subscribers have subscribed.

The JavaDocs and unit tests are good places to understand the feature.

Deprecated onBackpressureBlock

The onBackpressureBlock operator has been deprecated. It will not ever be removed during the 1.x lifecycle, but it is recommended to not use it. It has proven to be a common source of deadlocks and is difficult to debug. It is instead recommended to use non-blocking approaches to backpressure, rather than callstack blocking. Approaches to backpressure and flow control are discussed on the wiki.

Changes

  • Pull 3012 rx.Single
  • Pull 2983 Fixed multiple calls to onStart.
  • Pull 2970 Deprecated onBackpressureBlock
  • Pull 2997 Fix retry() race conditions
  • Pull 3028 Delay: error cut ahead was not properly serialized
  • Pull 3042 add backpressure support for defaultIfEmpty()
  • Pull 3049 single: add toSingle method to Observable
  • Pull 3055 toSingle() should use unsafeSubscribe
  • Pull 3023 ConnectableObservable autoConnect operator
  • Pull 2928 Merge and MergeMaxConcurrent unified and rewritten
  • Pull 3039 Window with Observable: fixed unsubscription and behavior
  • Pull 3045 ElementAt request management enhanced
  • Pull 3048 CompositeException extra NPE protection
  • Pull 3052 Reduce test failure likelihood of testMultiThreadedWithNPEinMiddle
  • Pull 3031 Fix OperatorFlatMapPerf.flatMapIntPassthruAsync Perf Test
  • Pull 2975 Deprecate and rename two timer overloads to interval
  • Pull 2982 TestSubscriber - add factory methods
  • Pull 2995 switchOnNext - ensure initial requests additive and fix request overflow
  • Pull 2972 Fixed window(time) to work properly with unsubscription, added
  • Pull 2990 Improve Subscriber readability
  • Pull 3018 TestSubscriber - fix awaitTerminalEventAndUnsubscribeOnTimeout
  • Pull 3034 Instantiate EMPTY lazily
  • Pull 3033 takeLast() javadoc fixes, standardize parameter names (count instead of num)
  • Pull 3043 TestSubscriber javadoc cleanup
  • Pull 3065 add Subscribers.wrap
  • Pull 3091 Fix autoConnect calling onStart twice.
  • Pull 3092 Single.toObservable

I will be adding examples and more comprehensive release notes for rx.Single.


These were reverted and won't be included in 1.0.13:

  • Pull 2969 Operator cache() now supports backpressure
  • Pull 3047 Operator replay() now supports backpressure
@wendigo
Copy link

wendigo commented Jul 14, 2015

Nice release :)

@daschl
Copy link
Contributor

daschl commented Jul 15, 2015

FWIW, I bumped the couchbase SDK to 1.0.13-SNAPSHOT and all the unit and integration tests look good. Looking forward to the 1.0.13 release, will be shipping with 2.2.0 :)

@benjchristensen
Copy link
Member Author

This SNAPSHOT badly failed our production canary testing. It seems be causing our application threads to block and grow overtime.

@abersnaze is currently doing the tedious bisect + canary test process to figure out which commit is causing the issue.

Until we can determine what it is we will not be releasing this version.

@akarnokd
Copy link
Member

Sorry to hear this. Can you give more details about the failure?
My guesses are:

  • window with boundary observable factory contains behavior change: at every window end condition, now the factory is called
  • the anomaly @davidmoten encountered with retry is now amplified
  • you are using custom operators which don't properly honor backpressure and the rewritten operators expose this in some ways.

@benjchristensen
Copy link
Member Author

All we know so far is that threadActiveCount on the app is now constantly increasing, and the only change is the RxJava jar. Here is the comparison between baseline and canary:

canary

@gfx
Copy link

gfx commented Jul 16, 2015

Really looking forward to this release 👏

BTW, why isn't the version v1.1.0? This version seems to include significant enhancements.

@benjchristensen
Copy link
Member Author

why isn't the version v1.1.0

So far all new functionality has been added with @Experimental or @Beta annotations. The intent is for them to stabilize and prove themselves and then at the 1.1.0 release promote them to "final"/"stable". Thus, the "significant enhancements" aren't considered formally released until we remove the annotations. This approach allows us to continue moving quickly without piling up changes for months, and avoid increment the minor number every few weeks (which no one wants ... since it implies more change than is really happening).

The change of merge is the only change in this release that really has made me cautious, as last time we touched merge "bad things happened" :-)

We definitely are nearing a 1.1.0 release as we have had several @Beta and @Experimental items for a few months now.

Does anyone think anything in this release should be removed from a 1.0.13 release and held off until we're ready for 1.1.0?

@gfx
Copy link

gfx commented Jul 16, 2015

@benjchristensen Thanks to detailed explanations. Single is what I thought is significant. I agree with you ;)

@abersnaze
Copy link
Contributor

turned out to be cache()

@benjchristensen
Copy link
Member Author

Thank you @abersnaze for figuring this out. #2969 will need to be reverted, which is proposed in #3081.

At this point we don't yet know what the issue is. Do we proceed with 1.0.13 without cache() changes, or do we hold up a release until we figure that out?

@akarnokd
Copy link
Member

I'd hold up the release because if there is indeed a problem with cache(), fixing it surely involves less drastic changes than a full revert.

@benjchristensen
Copy link
Member Author

Do you have time to help figure it out in the coming days? Also, we don't need to wait a month until 1.0.14. The delay since 1.0.12 is primarily my fault because I've been busy elsewhere. We can definitely move faster.

Another concern I have is your comment about replay().autoConnect(): #3081 (comment) Do you only suspect the new autoConnect() behavior, or all the changes on replay?

@akarnokd
Copy link
Member

I can spare some time to fix cache() but without the usage pattern, I can only guess. Certainly, releasing as is increases a likelihood the problem manifests somewhere else where the devs can create a report with a unit test reproducing the problem.

Both backpressure-aware replay() and cache() were built quite similarly (autoConnect() is not relevant from this aspect), therefore, you could change the Observable.cache() method implementation to use replay().autoConnect() and see if the same problem happens or not.

@benjchristensen
Copy link
Member Author

without the usage pattern, I can only guess

We are trying to build a unit test. I believe the issue we're seeing comes from use of cache() in Hystrix: https://github.com/Netflix/Hystrix/blob/990394e98ba7472b96d90d7246e785d0966ade9a/hystrix-core/src/main/java/com/netflix/hystrix/AbstractCommand.java#L481

We use Hystrix at very high volume, and cache() will be used by a large percentage of those calls. This explains why this impacted our canary so significantly.

releasing

Should we revert both cache() and replay() changes for now, release 1.0.13 with the Single, merge and other changes, then followup with 1.0.14 including cache() and replay()?

@akarnokd
Copy link
Member

I've looked at cache() again and it was prone to child-thrown exceptions, breaking the caching process completely. I've fixed it and replay() in #3088 along with other small changes.

@abersnaze
Copy link
Contributor

I ran two canaries. The first was with the #3088 and the second with cache() replaced replay().autoConnect(). Both showed signed of request threads locking up. At this point I would like to move forward with a smaller but stable build now. Pushing off backpressure on cache/replay to the 1.0.14 won't make the feature any later and will allow people access to the bug fixes in 1.0.13 that we know are okay available sooner.

image

@benjchristensen
Copy link
Member Author

Agreed. Let's revert these and move forward with 1.0.13 and release 1.0.14 when these are fixed.

@abersnaze
Copy link
Contributor

I've updated the #3081 to rollback both cache and replay.

@akarnokd
Copy link
Member

@benjchristensen Could you apply, release and undo the "undo" in #3081 so I don't need to start from scratch again?

One final test I'd conduct is to revert cache() to its previous form and add .onBackpressureBuffer() to it. If this hangs, it is likely there is a backpressure bug in the consumers of cache(). If it passes, my guess is that the downstream doesn't tolerate live synchronous replays (cache() used to connect first, then accept the child Subscriber).

@akarnokd
Copy link
Member

One other thing I can think of is the effect of the onStart patch. Some existing code relied on the fact that it will be called twice and is now under-requesting at the beginning.

@schrepfler
Copy link

I think the handling of this issue, distinguishing between experimental vs stable APIs, stability vs features philosophy, usage of Atlas and canary builds would make a nice addition to the Netflix tech blog. So many good practices demonstrated in one place, it's exemplary.

@abersnaze
Copy link
Contributor

@akarnokd the test of commit 96786bb + .onBackpressureBuffer() also fails the canary. What kind of bug should we be looking for? Not requesting enough?

@akarnokd
Copy link
Member

I'd check all places which call request but with an odd amount, for example, request(RxRingBuffer.SIZE / 2) in your code base.

@benjchristensen
Copy link
Member Author

replay and cache changes reverted: #3081

@akarnokd
Copy link
Member

Please consider including fix #3091 as well into 1.0.13.

@benjchristensen
Copy link
Member Author

Merged. The current 1.x branch is what I intend on releasing shortly.

@benjchristensen
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants