-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3369] [CORE] [STREAMING] Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator #10413
Conversation
Test build #48112 has finished for PR 10413 at commit
|
@srowen I've been thinking about all the RDD APIs last night, and one thought I have is that if the goal is to push more people towards DataFrame/Dataset, then maybe it wouldn't be crazy to not change any of the existing RDD APIs to ensure maximum compatibility. |
I think it probably cuts the other way - if you want people to use dataframes then keeping the RDDs as-is is not that important. I don't see that RDDs are going away and fixing this API seems natural for 2.x. EDIT: I didn't really finish the thought -- the issue here is that this is fixing a moderate problem for Java callers. By having to return an Iterable it requires staging the entire output of a flatMap or mapPartitions in memory. This one is worth fixing. I certainly agree that breaking changes still need a decent reason, but 2.x is the right time to make those. This will be an interesting point of debate then when it comes to removing deprecated methods. |
Test build #48131 has finished for PR 10413 at commit
|
Test build #48179 has finished for PR 10413 at commit
|
Test build #2248 has finished for PR 10413 at commit
|
@rxin I wanted to sound you out on this one -- do you feel strongly against it or are the reasons above a little persuasive? |
Test build #48365 has finished for PR 10413 at commit
|
Let me think about it a little bit more... |
Test build #48606 has finished for PR 10413 at commit
|
Test build #48651 has finished for PR 10413 at commit
|
Test build #48655 has finished for PR 10413 at commit
|
…e producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.
Test build #49028 has finished for PR 10413 at commit
|
Test build #49030 has finished for PR 10413 at commit
|
I'd like to raise the issue of whether to proceed with this change. @rxin what's your current thinking? I tend to favor it. Pinging, say, @pwendell or @vanzin or @andrewor14 for an opinion. |
Hm, I'd like to proceed with this as I am in favor of the change. I don't want to do so over strong-er objections. I don't see any additional opinions here. @rxin what do you think -- are there modifications to this that could make things more compatible? documentation? |
I think this change is pretty important, returning an Iterable means store the entire output in memory that really limits what you can do. |
If there's another vote in favor -- and no strong objection surfacing -- I'd like to proceed soon to merge this. |
Merged to master |
…s inconsistent with Scala's Iterator->Iterator Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable. CC rxin pwendell for API change; tdas since it also touches streaming. Author: Sean Owen <[email protected]> Closes apache#10413 from srowen/SPARK-3369.
Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.
CC @rxin @pwendell for API change; @tdas since it also touches streaming.