[Spark-16579][SparkR] add install.spark function #14258

junyangq · 2016-07-19T09:13:03Z

What changes were proposed in this pull request?

Add an install_spark function to the SparkR package. User can run install_spark() to install Spark to a local directory within R.

Updates:

Several changes have been made:

install.spark()
- check existence of tar file in the cache folder, and download only if not found
- trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
- use 2.0.0
sparkR.session()
- can install spark when not found in SPARK_HOME

How was this patch tested?

Manual tests, running the check-cran.sh script added in #14173.

SparkQA · 2016-07-19T09:20:39Z

Test build #62518 has finished for PR 14258 at commit 9d52d19.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-19T10:10:40Z

Test build #62519 has finished for PR 14258 at commit 89efb04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-07-19T16:35:51Z

cc @felixcheung @mengxr @sun-rui

shivaram · 2016-07-19T16:38:29Z

Thanks @junyangq I'll take a look at this today. One question I had is about adding install_spark as a fallback option in sparkR.session if the jars are not found. Can we add that in this PR or do you want to do that in a different PR ?

shivaram · 2016-07-19T16:48:53Z

R/pkg/R/install.R

+  packageName <- paste0(version, "-bin-hadoop", hadoop_version)
+  if (is.null(local_dir)) {
+    local_dir <- getOption("spark.install.dir",
+                           rappdirs::app_dir("spark"))$cache()


This will mean we need to add rappdirs as a dependency of SparkR ?

Yes. Perhaps we can do the implementation to avoid such dependency?

yeah it looks like we just want the cache dir from https://github.com/hadley/rappdirs/blob/6b42011053ec9db2068de3f93d3be0a9197b4043/R/cache.r#L42

Sure - will do that!

junyangq · 2016-07-19T16:52:45Z

I can add that in this PR @shivaram

felixcheung · 2016-07-19T21:20:54Z

R/pkg/R/install.R

+# Functions to install Spark in case the user directly downloads SparkR
+# from CRAN.
+
+#' Download and Install Spark to Local Directory


This might be a bit confusing? "If I have the SparkR package running why do I have to install Spark"?

Yeah it could be, so would "Spark Core" clear off some confusion?

SparkQA · 2016-07-21T06:15:31Z

Test build #62660 has finished for PR 14258 at commit 98087ad.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-21T07:00:32Z

Test build #62662 has finished for PR 14258 at commit 0db89b7.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-21T08:41:41Z

Test build #62667 has finished for PR 14258 at commit 503cb9f.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2016-07-21T15:20:21Z

R/pkg/R/install.R

+#' @note install_spark since 2.1.0
+install_spark <- function(hadoop_version = NULL, mirror_url = NULL,
+                          local_dir = NULL) {
+  version <- paste0("spark-", spark_version_default())


no need to create a function for spark version. just use "packageVersion("SparkR")". SparkR version should 1:1 map to Spark version

Sounds good. Thanks :)

… instead

SparkQA · 2016-07-21T20:05:24Z

Test build #62688 has finished for PR 14258 at commit 78d6f91.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-26T02:47:04Z

Test build #62860 has finished for PR 14258 at commit e4fe002.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-28T19:27:50Z

Test build #62982 has finished for PR 14258 at commit 64756de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-07-28T21:22:56Z

@junyangq I just ran the CRAN checks locally and I see the problem you ran into in #14357 -- The problem is that if we try to run tests which depend on a Java-side change in master but not in 2.0.0 then they will fail. I think this shouldn't be a problem in the long run in the sense that we should match the SparkR and Spark versions closely. However for the first cut I'm fine with re-opening #14357 and say disabling some of the tests temporarily ?

junyangq · 2016-07-28T22:28:02Z

@shivaram Would rebuild after cleaning solved the problem, as in #14357? So you meant disabling those tests in this PR first?

shivaram · 2016-07-29T03:43:47Z

Not sure clean + rebuild will solve the problem here. The problem here is that we load the Spark 2.0.0 JARs using install_spark (i.e. that didn't have the fix in #14095) and we use R test code in the master branch which has the updated unit test. Or in other words need to use R code which doesn't have the test. (i.e. branch-2.0) and the master branch cannot be used with Spark 2.0.0 JARs.

This may not be a big problem if we only enable CRAN checks on branch-2.0, but it seems like disabling the tests as in #14357 is an easy way to avoid confusion for now.

junyangq · 2016-07-29T07:06:14Z

I see - since the local test uses the downloaded JARs, not the one built from master source files, but the test code used is from master - this causes the problem. I will first disable the tests and run CRAN checks.

This is because change of output from 2.0 to 2.1. The downloaded JARs are 2.0, while the test code in the master branch assumes new output format.

junyangq · 2016-07-30T00:38:00Z

@mengxr @shivaram @felixcheung Thank you for the discussion and review of this PR. For compatibility of SparkR version and the JARs, perhaps it's better to send the PR and try to merge into branch 2.0 first, and come back to this later when the new version of Spark package is available for download.

felixcheung · 2016-07-30T02:30:39Z

R/pkg/R/install.R

@@ -36,7 +36,7 @@
 #' \code{without-hadoop}.
 #'
 #' @param hadoopVersion Version of Hadoop to install. Default is \code{"2.7"}. It can take other
-#'                      version number in the format of "int.int".
+#'                      version number in the format of "x.y" where x and y are integer.


"are integers"?

Yes, thanks!

felixcheung · 2016-07-30T02:52:36Z

For my comment on #14258 (comment)

Like this:

spark/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

Line 767 in fa4bc8e

private[deploy] def isShell(res: String): Boolean = {

junyangq · 2016-08-01T19:43:48Z

@felixcheung Sorry I still didn't get there. It seems that internally it checks via args.primaryResource. I was wondering if there is a good way to access to that in sparkR. Thanks!

shivaram · 2016-08-03T16:37:40Z

@junyangq Is #14448 different from this PR or is it the same one on branch-2.0 ? I can just merge this into two branches, so we dont need a new PR I think

junyangq · 2016-08-04T16:00:48Z

@shivaram There is only one additional minor change there. The reason I opened #14448 on branch-2.0 is because we download the 2.0 jars, and there are some api changes from 2.0 to current master (e.g. showString), so I guess it would cause some problem if we use the master R code with 2.0 jars.

shivaram · 2016-08-04T18:20:21Z

I see - So I was thinking that we could merge this into master as well as its not going to fail any tests or affect any users building SparkR from source -- I dont think we make any promises about the master branch to users. As long as the same code works in branch-2.0 then we can just backport this (if we do want a separate PR for branch-2.0 thats fine, but its just easier to keep all the code review on one PR)

@felixcheung @mengxr Any other comments on this ?

junyangq · 2016-08-05T17:25:45Z

Sounds good to me. It doesn't fail tests except for the cran one if you delete --no-test.

felixcheung · 2016-08-07T02:31:18Z

I think we should go ahead with this and get some feedback from the community if we could, as early as possible.
LGTM - we could see if we could improve on how to detect if running from shell later.

felixcheung · 2016-08-07T02:34:36Z

R/pkg/R/sparkR.R

@@ -365,6 +365,23 @@ sparkR.session <- function(
    }
    overrideEnvs(sparkConfigMap, paramMap)
  }
+  # do not download if it is run in the sparkR shell
+  if (!grepl(".*shell\\.R$", Sys.getenv("R_PROFILE_USER"), perl = TRUE)) {
+    if (!nzchar(master) || is_master_local(master)) {


shouldn't we also fail if master != local but SPARK_HOME is not defined or spark jar is not in SPARK_HOME?

to clarify, i mean this check isn't restricted to local only, right?

Ah, since the install function is restricted to local for now, perhaps we only have to do the check for local master? If so, then I think it would be better to flip the order of if statements...

SparkQA · 2016-08-09T19:46:05Z

Test build #63455 has finished for PR 14258 at commit d84ba06.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-09T20:21:04Z

Test build #63456 has finished for PR 14258 at commit 3aeb4eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-08-10T00:13:21Z

LGTM

shivaram · 2016-08-10T18:17:38Z

Thanks @junyangq and @felixcheung -- LGTM. Merging this to master and branch-2.0
We should add some tests to this and enable the checks to run on every PR. But we can do this as a part of SPARK-16577

Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R. Updates: Several changes have been made: - `install.spark()` - check existence of tar file in the cache folder, and download only if not found - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option - use 2.0.0 - `sparkR.session()` - can install spark when not found in `SPARK_HOME` Manual tests, running the check-cran.sh script added in #14173. Author: Junyang Qian <[email protected]> Closes #14258 from junyangq/SPARK-16579. (cherry picked from commit 214ba66) Signed-off-by: Shivaram Venkataraman <[email protected]>

junyangq added 2 commits July 18, 2016 21:15

add install_spark

66cfb6c

add doc for install_spark

9d52d19

changes to conform to R code style

89efb04

shivaram reviewed Jul 19, 2016
View reviewed changes

add install into sparkR.session if spark jar is not found

7ba5213

felixcheung reviewed Jul 19, 2016
View reviewed changes

junyangq added 2 commits July 19, 2016 14:47

message when SPARK_HOME is non-empty

6203223

change options of spark mirror url

98087ad

minor changes

0db89b7

fix R style issue: don't use absolute paths

503cb9f

sun-rui reviewed Jul 21, 2016
View reviewed changes

junyangq added 2 commits July 21, 2016 11:35

remove spark version function, and use return value of packageVersion…

f4522a6

… instead

remove unnecessary dir create

78d6f91

junyangq added 2 commits July 25, 2016 15:07

fix issue that dir.exists not available before 3.2.0

9666e06

another dir.exists

e4fe002

Disable (temporarily) some test cases of describe and summary functions

29bdf30

This is because change of output from 2.0 to 2.1. The downloaded JARs are 2.0, while the test code in the master branch assumes new output format.

junyangq changed the title ~~[Spark-16579][SparkR] add install_spark function~~ [Spark-16579][SparkR] add install.spark function Jul 30, 2016

felixcheung reviewed Jul 30, 2016
View reviewed changes

felixcheung mentioned this pull request Aug 3, 2016

[SPARK-16829][SparkR]:sparkR sc.setLogLevel doesn't work #14433

Closed

felixcheung reviewed Aug 7, 2016
View reviewed changes

junyangq added 2 commits August 9, 2016 11:30

Send message of reset SPARK_HOME in installation

5decac6

Specify path in jsonUrl and add alias to install function doc

d84ba06

remove comment

3aeb4eb

asfgit closed this in 214ba66 Aug 10, 2016

[Spark-16579][SparkR] add install.spark function #14258

[Spark-16579][SparkR] add install.spark function #14258

Conversation

junyangq commented Jul 19, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 19, 2016

SparkQA commented Jul 19, 2016

shivaram commented Jul 19, 2016

shivaram commented Jul 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junyangq commented Jul 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 21, 2016

SparkQA commented Jul 21, 2016

SparkQA commented Jul 21, 2016

sun-rui Jul 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 21, 2016

SparkQA commented Jul 26, 2016

SparkQA commented Jul 28, 2016

shivaram commented Jul 28, 2016

junyangq commented Jul 28, 2016 • edited Loading

shivaram commented Jul 29, 2016

junyangq commented Jul 29, 2016

junyangq commented Jul 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Jul 30, 2016

junyangq commented Aug 1, 2016

shivaram commented Aug 3, 2016

junyangq commented Aug 4, 2016

shivaram commented Aug 4, 2016

junyangq commented Aug 5, 2016

felixcheung commented Aug 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 9, 2016

SparkQA commented Aug 9, 2016

felixcheung commented Aug 10, 2016

shivaram commented Aug 10, 2016

junyangq commented Jul 19, 2016 •

edited

Loading

sun-rui Jul 21, 2016 •

edited

Loading

junyangq commented Jul 28, 2016 •

edited

Loading

felixcheung commented Aug 7, 2016 •

edited

Loading