-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-3290 [GRAPHX] No unpersist callls in SVDPlusPlus #4234
Conversation
Test build #26194 has started for PR 4234 at commit
|
I don't think this will always have the desired effect. In the cases where you unpersist upstream RDDs before materializing their results, it's equivalent to never caching anything at all. Rather than val start = System.currentTimeMillis
val a = sc.parallelize(Array(1)).map(x => { Thread.sleep(5000); x}).cache()
a.count() // Use a once
val b = a.map(identity).cache() // Use a again
// Problem: b has not been computed yet, so a hasn't been used
a.unpersist()
b.count() // Now compute b using a
System.currentTimeMillis - start // => 10 seconds the correct pattern is to materialize a result (here val start = System.currentTimeMillis
val a = sc.parallelize(Array(1)).map(x => { Thread.sleep(5000); x}).cache()
a.count() // Use a once
val b = a.map(identity).cache()
b.count() // materialize b (cache and compute it), forcing a to be used again
a.unpersist()
System.currentTimeMillis - start // => 5 seconds |
Test build #26194 has finished for PR 4234 at commit
|
Test PASSed. |
@ankurdave Ah of course. Hm, the thing is there is a loop of transformations here, so in theory you'd have to save up all the RDD references and unpersist at the very end -- but even that may be insufficient as the return value isn't even materialized before it's returned. Also, operations like What's the right thing here? don't |
@srowen Caching here is necessary since we do reuse some uncached datasets, so I don't think there's a good solution for uncaching in this kind of situation. Ideally Spark would handle this by refcounting RDDs, but the best we can do for now is probably to force materialization as PageRank and Pregel do. |
OK, like add some |
Adding |
@ankurdave Hm, so I looked again and quickly realized the things being cached are |
@srowen The standard way to materialize a Graph is just to do |
Test build #26432 has started for PR 4234 at commit
|
Test build #26432 has finished for PR 4234 at commit
|
Test PASSed. |
@ankurdave What do you think of the result here? |
Test build #27357 has started for PR 4234 at commit
|
Test build #27357 has finished for PR 4234 at commit
|
Test FAILed. |
Jenkins, retest this please. |
Test build #27358 has started for PR 4234 at commit
|
Test build #27358 has finished for PR 4234 at commit
|
Test PASSed. |
@ankurdave unless you have more comments, I think it can be merged in a bit. The pattern here seems sound: with X cached, derive Y from X, cache Y, materialize Y, unpersist X |
LGTM |
This just unpersist()s each RDD in this code that was cache()ed. Author: Sean Owen <[email protected]> Closes #4234 from srowen/SPARK-3290 and squashes the following commits: 66c1e11 [Sean Owen] unpersist() each RDD that was cache()ed (cherry picked from commit 0ce4e43) Signed-off-by: Ankur Dave <[email protected]>
Thanks! Merged into master & branch-1.3. |
This just unpersist()s each RDD in this code that was cache()ed.