Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-16579][SparkR] add install.spark function #14258

Closed
wants to merge 49 commits into from

Conversation

junyangq
Copy link
Contributor

@junyangq junyangq commented Jul 19, 2016

What changes were proposed in this pull request?

Add an install_spark function to the SparkR package. User can run install_spark() to install Spark to a local directory within R.

Updates:

Several changes have been made:

  • install.spark()
    • check existence of tar file in the cache folder, and download only if not found
    • trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
    • use 2.0.0
  • sparkR.session()
    • can install spark when not found in SPARK_HOME

How was this patch tested?

Manual tests, running the check-cran.sh script added in #14173.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62518 has finished for PR 14258 at commit 9d52d19.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62519 has finished for PR 14258 at commit 89efb04.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

cc @felixcheung @mengxr @sun-rui

@shivaram
Copy link
Contributor

Thanks @junyangq I'll take a look at this today. One question I had is about adding install_spark as a fallback option in sparkR.session if the jars are not found. Can we add that in this PR or do you want to do that in a different PR ?

packageName <- paste0(version, "-bin-hadoop", hadoop_version)
if (is.null(local_dir)) {
local_dir <- getOption("spark.install.dir",
rappdirs::app_dir("spark"))$cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will mean we need to add rappdirs as a dependency of SparkR ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Perhaps we can do the implementation to avoid such dependency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - will do that!

@junyangq
Copy link
Contributor Author

I can add that in this PR @shivaram

# Functions to install Spark in case the user directly downloads SparkR
# from CRAN.

#' Download and Install Spark to Local Directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bit confusing? "If I have the SparkR package running why do I have to install Spark"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it could be, so would "Spark Core" clear off some confusion?

@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62660 has finished for PR 14258 at commit 98087ad.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62662 has finished for PR 14258 at commit 0db89b7.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62667 has finished for PR 14258 at commit 503cb9f.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' @note install_spark since 2.1.0
install_spark <- function(hadoop_version = NULL, mirror_url = NULL,
local_dir = NULL) {
version <- paste0("spark-", spark_version_default())
Copy link
Contributor

@sun-rui sun-rui Jul 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to create a function for spark version. just use "packageVersion("SparkR")". SparkR version should 1:1 map to Spark version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks :)

@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62688 has finished for PR 14258 at commit 78d6f91.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62860 has finished for PR 14258 at commit e4fe002.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2016

Test build #62982 has finished for PR 14258 at commit 64756de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@junyangq I just ran the CRAN checks locally and I see the problem you ran into in #14357 -- The problem is that if we try to run tests which depend on a Java-side change in master but not in 2.0.0 then they will fail. I think this shouldn't be a problem in the long run in the sense that we should match the SparkR and Spark versions closely. However for the first cut I'm fine with re-opening #14357 and say disabling some of the tests temporarily ?

@junyangq
Copy link
Contributor Author

junyangq commented Jul 28, 2016

@shivaram Would rebuild after cleaning solved the problem, as in #14357? So you meant disabling those tests in this PR first?

@shivaram
Copy link
Contributor

Not sure clean + rebuild will solve the problem here. The problem here is that we load the Spark 2.0.0 JARs using install_spark (i.e. that didn't have the fix in #14095) and we use R test code in the master branch which has the updated unit test. Or in other words need to use R code which doesn't have the test. (i.e. branch-2.0) and the master branch cannot be used with Spark 2.0.0 JARs.

This may not be a big problem if we only enable CRAN checks on branch-2.0, but it seems like disabling the tests as in #14357 is an easy way to avoid confusion for now.

@junyangq
Copy link
Contributor Author

I see - since the local test uses the downloaded JARs, not the one built from master source files, but the test code used is from master - this causes the problem. I will first disable the tests and run CRAN checks.

This is because change of output from 2.0 to 2.1. The downloaded JARs are 2.0, while the test code in the master branch assumes new output format.
@junyangq
Copy link
Contributor Author

@mengxr @shivaram @felixcheung Thank you for the discussion and review of this PR. For compatibility of SparkR version and the JARs, perhaps it's better to send the PR and try to merge into branch 2.0 first, and come back to this later when the new version of Spark package is available for download.

@junyangq junyangq changed the title [Spark-16579][SparkR] add install_spark function [Spark-16579][SparkR] add install.spark function Jul 30, 2016
@@ -36,7 +36,7 @@
#' \code{without-hadoop}.
#'
#' @param hadoopVersion Version of Hadoop to install. Default is \code{"2.7"}. It can take other
#' version number in the format of "int.int".
#' version number in the format of "x.y" where x and y are integer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"are integers"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

@felixcheung
Copy link
Member

For my comment on #14258 (comment)

Like this:

private[deploy] def isShell(res: String): Boolean = {

@junyangq
Copy link
Contributor Author

junyangq commented Aug 1, 2016

@felixcheung Sorry I still didn't get there. It seems that internally it checks via args.primaryResource. I was wondering if there is a good way to access to that in sparkR. Thanks!

@shivaram
Copy link
Contributor

shivaram commented Aug 3, 2016

@junyangq Is #14448 different from this PR or is it the same one on branch-2.0 ? I can just merge this into two branches, so we dont need a new PR I think

@junyangq
Copy link
Contributor Author

junyangq commented Aug 4, 2016

@shivaram There is only one additional minor change there. The reason I opened #14448 on branch-2.0 is because we download the 2.0 jars, and there are some api changes from 2.0 to current master (e.g. showString), so I guess it would cause some problem if we use the master R code with 2.0 jars.

@shivaram
Copy link
Contributor

shivaram commented Aug 4, 2016

I see - So I was thinking that we could merge this into master as well as its not going to fail any tests or affect any users building SparkR from source -- I dont think we make any promises about the master branch to users. As long as the same code works in branch-2.0 then we can just backport this (if we do want a separate PR for branch-2.0 thats fine, but its just easier to keep all the code review on one PR)

@felixcheung @mengxr Any other comments on this ?

@junyangq
Copy link
Contributor Author

junyangq commented Aug 5, 2016

Sounds good to me. It doesn't fail tests except for the cran one if you delete --no-test.

@felixcheung
Copy link
Member

felixcheung commented Aug 7, 2016

I think we should go ahead with this and get some feedback from the community if we could, as early as possible.
LGTM - we could see if we could improve on how to detect if running from shell later.

@@ -365,6 +365,23 @@ sparkR.session <- function(
}
overrideEnvs(sparkConfigMap, paramMap)
}
# do not download if it is run in the sparkR shell
if (!grepl(".*shell\\.R$", Sys.getenv("R_PROFILE_USER"), perl = TRUE)) {
if (!nzchar(master) || is_master_local(master)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also fail if master != local but SPARK_HOME is not defined or spark jar is not in SPARK_HOME?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, i mean this check isn't restricted to local only, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, since the install function is restricted to local for now, perhaps we only have to do the check for local master? If so, then I think it would be better to flip the order of if statements...

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63455 has finished for PR 14258 at commit d84ba06.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63456 has finished for PR 14258 at commit 3aeb4eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

LGTM

@shivaram
Copy link
Contributor

Thanks @junyangq and @felixcheung -- LGTM. Merging this to master and branch-2.0
We should add some tests to this and enable the checks to run on every PR. But we can do this as a part of SPARK-16577

@asfgit asfgit closed this in 214ba66 Aug 10, 2016
asfgit pushed a commit that referenced this pull request Aug 10, 2016
Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R.

Updates:

Several changes have been made:

- `install.spark()`
    - check existence of tar file in the cache folder, and download only if not found
    - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
    - use 2.0.0

- `sparkR.session()`
    - can install spark when not found in `SPARK_HOME`

Manual tests, running the check-cran.sh script added in #14173.

Author: Junyang Qian <[email protected]>

Closes #14258 from junyangq/SPARK-16579.

(cherry picked from commit 214ba66)
Signed-off-by: Shivaram Venkataraman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants