Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-3290 [GRAPHX] No unpersist callls in SVDPlusPlus #4234

Closed
wants to merge 1 commit into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Jan 27, 2015

This just unpersist()s each RDD in this code that was cache()ed.

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26194 has started for PR 4234 at commit c0311bb.

  • This patch merges cleanly.

@ankurdave
Copy link
Contributor

I don't think this will always have the desired effect. In the cases where you unpersist upstream RDDs before materializing their results, it's equivalent to never caching anything at all. Rather than

val start = System.currentTimeMillis
val a = sc.parallelize(Array(1)).map(x => { Thread.sleep(5000); x}).cache()
a.count() // Use a once
val b = a.map(identity).cache() // Use a again
// Problem: b has not been computed yet, so a hasn't been used
a.unpersist()
b.count() // Now compute b using a
System.currentTimeMillis - start // => 10 seconds

the correct pattern is to materialize a result (here b) before unpersisting its upstream RDDs:

val start = System.currentTimeMillis
val a = sc.parallelize(Array(1)).map(x => { Thread.sleep(5000); x}).cache()
a.count() // Use a once
val b = a.map(identity).cache()
b.count() // materialize b (cache and compute it), forcing a to be used again
a.unpersist()
System.currentTimeMillis - start // => 5 seconds

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26194 has finished for PR 4234 at commit c0311bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26194/
Test PASSed.

@srowen
Copy link
Member Author

srowen commented Jan 28, 2015

@ankurdave Ah of course. Hm, the thing is there is a loop of transformations here, so in theory you'd have to save up all the RDD references and unpersist at the very end -- but even that may be insufficient as the return value isn't even materialized before it's returned. Also, operations like outerJoinVertices seem to cache() vertices internally too. It's a bit like the old days of memory management!

What's the right thing here? don't cache() in this method at all since lower layers do some of that management? force materialization?

@ankurdave
Copy link
Contributor

@srowen Caching here is necessary since we do reuse some uncached datasets, so I don't think there's a good solution for uncaching in this kind of situation.

Ideally Spark would handle this by refcounting RDDs, but the best we can do for now is probably to force materialization as PageRank and Pregel do.

@srowen
Copy link
Member Author

srowen commented Jan 29, 2015

OK, like add some count() calls to make it persist the RDDs? or just treat this as wont-fix?

@ankurdave
Copy link
Contributor

Adding count() calls would be good.

@srowen
Copy link
Member Author

srowen commented Jan 29, 2015

@ankurdave Hm, so I looked again and quickly realized the things being cached are Graphs, and they consist of several RDDs. It's possible to add several count() methods for vertices and edges. There could be a generic materialize() method which internally counts. Both add an API method to an abstract class here. That works although might be viewed as overkill? or just what has to be done? wanted to check before I take a shot at it. This may be outside my expertise as I haven't worked with graphx.

@ankurdave
Copy link
Contributor

@srowen The standard way to materialize a Graph is just to do graph.vertices.count(); graph.edges.count(). Due to the dependency structure this in fact materializes everything inside the Graph. A materialize() method that does this might be useful, but it does seem a bit like overkill.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26432 has started for PR 4234 at commit 350bb2b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26432 has finished for PR 4234 at commit 350bb2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26432/
Test PASSed.

@srowen
Copy link
Member Author

srowen commented Feb 8, 2015

@ankurdave What do you think of the result here?

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27357 has started for PR 4234 at commit 66c1e11.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27357 has finished for PR 4234 at commit 66c1e11.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27357/
Test FAILed.

@srowen
Copy link
Member Author

srowen commented Feb 12, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27358 has started for PR 4234 at commit 66c1e11.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27358 has finished for PR 4234 at commit 66c1e11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27358/
Test PASSed.

@srowen
Copy link
Member Author

srowen commented Feb 12, 2015

@ankurdave unless you have more comments, I think it can be merged in a bit. The pattern here seems sound: with X cached, derive Y from X, cache Y, materialize Y, unpersist X

@ankurdave
Copy link
Contributor

LGTM

@asfgit asfgit closed this in 0ce4e43 Feb 14, 2015
asfgit pushed a commit that referenced this pull request Feb 14, 2015
This just unpersist()s each RDD in this code that was cache()ed.

Author: Sean Owen <[email protected]>

Closes #4234 from srowen/SPARK-3290 and squashes the following commits:

66c1e11 [Sean Owen] unpersist() each RDD that was cache()ed

(cherry picked from commit 0ce4e43)
Signed-off-by: Ankur Dave <[email protected]>
@ankurdave
Copy link
Contributor

Thanks! Merged into master & branch-1.3.

@srowen srowen deleted the SPARK-3290 branch February 14, 2015 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants