-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-16579][SparkR] add install.spark function #14258
Conversation
Test build #62518 has finished for PR 14258 at commit
|
Test build #62519 has finished for PR 14258 at commit
|
Thanks @junyangq I'll take a look at this today. One question I had is about adding |
packageName <- paste0(version, "-bin-hadoop", hadoop_version) | ||
if (is.null(local_dir)) { | ||
local_dir <- getOption("spark.install.dir", | ||
rappdirs::app_dir("spark"))$cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will mean we need to add rappdirs
as a dependency of SparkR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Perhaps we can do the implementation to avoid such dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it looks like we just want the cache dir from https://github.com/hadley/rappdirs/blob/6b42011053ec9db2068de3f93d3be0a9197b4043/R/cache.r#L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - will do that!
I can add that in this PR @shivaram |
# Functions to install Spark in case the user directly downloads SparkR | ||
# from CRAN. | ||
|
||
#' Download and Install Spark to Local Directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bit confusing? "If I have the SparkR package running why do I have to install Spark"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it could be, so would "Spark Core" clear off some confusion?
Test build #62660 has finished for PR 14258 at commit
|
Test build #62662 has finished for PR 14258 at commit
|
Test build #62667 has finished for PR 14258 at commit
|
#' @note install_spark since 2.1.0 | ||
install_spark <- function(hadoop_version = NULL, mirror_url = NULL, | ||
local_dir = NULL) { | ||
version <- paste0("spark-", spark_version_default()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to create a function for spark version. just use "packageVersion("SparkR")". SparkR version should 1:1 map to Spark version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks :)
Test build #62688 has finished for PR 14258 at commit
|
Test build #62860 has finished for PR 14258 at commit
|
Test build #62982 has finished for PR 14258 at commit
|
@junyangq I just ran the CRAN checks locally and I see the problem you ran into in #14357 -- The problem is that if we try to run tests which depend on a Java-side change in master but not in 2.0.0 then they will fail. I think this shouldn't be a problem in the long run in the sense that we should match the SparkR and Spark versions closely. However for the first cut I'm fine with re-opening #14357 and say disabling some of the tests temporarily ? |
Not sure clean + rebuild will solve the problem here. The problem here is that we load the Spark 2.0.0 JARs using This may not be a big problem if we only enable CRAN checks on branch-2.0, but it seems like disabling the tests as in #14357 is an easy way to avoid confusion for now. |
I see - since the local test uses the downloaded JARs, not the one built from master source files, but the test code used is from master - this causes the problem. I will first disable the tests and run CRAN checks. |
This is because change of output from 2.0 to 2.1. The downloaded JARs are 2.0, while the test code in the master branch assumes new output format.
@mengxr @shivaram @felixcheung Thank you for the discussion and review of this PR. For compatibility of SparkR version and the JARs, perhaps it's better to send the PR and try to merge into branch 2.0 first, and come back to this later when the new version of Spark package is available for download. |
@@ -36,7 +36,7 @@ | |||
#' \code{without-hadoop}. | |||
#' | |||
#' @param hadoopVersion Version of Hadoop to install. Default is \code{"2.7"}. It can take other | |||
#' version number in the format of "int.int". | |||
#' version number in the format of "x.y" where x and y are integer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"are integers"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks!
For my comment on #14258 (comment) Like this:
|
@felixcheung Sorry I still didn't get there. It seems that internally it checks via |
I see - So I was thinking that we could merge this into master as well as its not going to fail any tests or affect any users building SparkR from source -- I dont think we make any promises about the master branch to users. As long as the same code works in branch-2.0 then we can just backport this (if we do want a separate PR for branch-2.0 thats fine, but its just easier to keep all the code review on one PR) @felixcheung @mengxr Any other comments on this ? |
Sounds good to me. It doesn't fail tests except for the cran one if you delete |
I think we should go ahead with this and get some feedback from the community if we could, as early as possible. |
@@ -365,6 +365,23 @@ sparkR.session <- function( | |||
} | |||
overrideEnvs(sparkConfigMap, paramMap) | |||
} | |||
# do not download if it is run in the sparkR shell | |||
if (!grepl(".*shell\\.R$", Sys.getenv("R_PROFILE_USER"), perl = TRUE)) { | |||
if (!nzchar(master) || is_master_local(master)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we also fail if master != local but SPARK_HOME is not defined or spark jar is not in SPARK_HOME?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify, i mean this check isn't restricted to local only, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, since the install function is restricted to local for now, perhaps we only have to do the check for local master? If so, then I think it would be better to flip the order of if
statements...
Test build #63455 has finished for PR 14258 at commit
|
Test build #63456 has finished for PR 14258 at commit
|
LGTM |
Thanks @junyangq and @felixcheung -- LGTM. Merging this to master and branch-2.0 |
Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R. Updates: Several changes have been made: - `install.spark()` - check existence of tar file in the cache folder, and download only if not found - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option - use 2.0.0 - `sparkR.session()` - can install spark when not found in `SPARK_HOME` Manual tests, running the check-cran.sh script added in #14173. Author: Junyang Qian <[email protected]> Closes #14258 from junyangq/SPARK-16579. (cherry picked from commit 214ba66) Signed-off-by: Shivaram Venkataraman <[email protected]>
What changes were proposed in this pull request?
Add an install_spark function to the SparkR package. User can run
install_spark()
to install Spark to a local directory within R.Updates:
Several changes have been made:
install.spark()
sparkR.session()
SPARK_HOME
How was this patch tested?
Manual tests, running the check-cran.sh script added in #14173.