-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1652: Remove incorrect deprecation warning in spark-submit #578
Conversation
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14525/ |
@@ -124,6 +124,8 @@ object SparkSubmit { | |||
} | |||
} else if (clusterManager == YARN) { | |||
childMainClass = "org.apache.spark.deploy.yarn.Client" | |||
// Special flag to avoid deprecation warnings at the client | |||
sysProps("SPARK_SUBMIT_YARN") = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: is YARN relevant? Could call it SPARK_SUBMIT and use it for standalone with cluster deploy mode as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem we have with existing configs and envs is that someone creates it for one purpose and then it gets used for a zillion other purposes and it becomes really hard to understand what it's for.
So in general I'm biased at this point towards creating fairly narrowly scoped internal options whenever possible and generalizing them (lazily) when it's necessary. In this case I'd actually hope we can just remove this thing if/when we decide to remove the YARN clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sryza if you feel really strongly I could make it more general. I'm a bit scarred from some cases where options like this have taken on lives of their own :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's a hack / workaround, I think it's probably better to work with the assumption that it'll end up sticking around for a while. With that in mind, a system property that signifies we started something from spark-submit makes more sense to me than a system property that signifying we started from spark-submit AND happen to be running YARN. Especially if we'll need a similar one for standalone mode. Do you think SPARK_SUBMIT_YARN would be less likely to be abused than SPARK_SUBMIT?
That said, it's a tiny issue, and I don't feel super strongly, so if you feel differently, go ahead with the current patch.
Merged build triggered. |
Merged build started. |
Alright @sryza I updated it. I think the fact we'll need this for standalone mode justifies generalizing it. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14532/ |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14536/ |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14543/ |
Jenkins, retest this pleas.e |
Jenkins, retest this please. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
This is a straightforward fix. Author: Patrick Wendell <[email protected]> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <[email protected]> Closes #578 from pwendell/spark-submit-yarn and squashes the following commits: 96027c7 [Patrick Wendell] Test fixes b5be173 [Patrick Wendell] Review feedback 4ac9cac [Patrick Wendell] SPARK-1652: spark-submit for yarn prints warnings even though calling as expected (cherry picked from commit 9f7a095) Signed-off-by: Patrick Wendell <[email protected]>
SPARK-1076: zipWithIndex and zipWithUniqueId to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel. The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job. https://spark-project.atlassian.net/browse/SPARK-1076 == update == Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field. Author: Xiangrui Meng <[email protected]> Closes apache#578 and squashes the following commits: 52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates 756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition 630868c [Xiangrui Meng] newline 21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD
SPARK-1076: Convert Int to Long to avoid overflow Patch for PR apache#578. Author: Xiangrui Meng <[email protected]> Closes apache#589 and squashes the following commits: 98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow
SPARK-1076: [Fix apache#578] add @transient to some vals I'll try to be more careful next time. Author: Xiangrui Meng <[email protected]> Closes apache#591 and squashes the following commits: 2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD
This is a straightforward fix. Author: Patrick Wendell <[email protected]> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <[email protected]> Closes apache#578 from pwendell/spark-submit-yarn and squashes the following commits: 96027c7 [Patrick Wendell] Test fixes b5be173 [Patrick Wendell] Review feedback 4ac9cac [Patrick Wendell] SPARK-1652: spark-submit for yarn prints warnings even though calling as expected
SPARK-1076: zipWithIndex and zipWithUniqueId to RDD Assign ranks to an ordered or unordered data set is a common operation. This could be done by first counting records in each partition and then assign ranks in parallel. The purpose of assigning ranks to an unordered set is usually to get a unique id for each item, e.g., to map feature names to feature indices. In such cases, the assignment could be done without counting records, saving one spark job. https://spark-project.atlassian.net/browse/SPARK-1076 == update == Because assigning ranks is very similar to Scala's zipWithIndex, I changed the method name to zipWithIndex and put the index in the value field. Author: Xiangrui Meng <[email protected]> Closes apache#578 and squashes the following commits: 52a05e1 [Xiangrui Meng] changed assignRanks to zipWithIndex changed assignUniqueIds to zipWithUniqueId minor updates 756881c [Xiangrui Meng] simplified RankedRDD by implementing assignUniqueIds separately moved couting iterator size to Utils do not count items in the last partition and skip counting if there is only one partition 630868c [Xiangrui Meng] newline 21b434b [Xiangrui Meng] add assignRanks and assignUniqueIds to RDD
SPARK-1076: Convert Int to Long to avoid overflow Patch for PR apache#578. Author: Xiangrui Meng <[email protected]> Closes apache#589 and squashes the following commits: 98c435e [Xiangrui Meng] cast Int to Long to avoid Int overflow
SPARK-1076: [Fix apache#578] add @transient to some vals I'll try to be more careful next time. Author: Xiangrui Meng <[email protected]> Closes apache#591 and squashes the following commits: 2b4f044 [Xiangrui Meng] add @transient to prev in ZippedWithIndexRDD add @transient to seed in PartitionwiseSampledRDD Conflicts: core/src/main/scala/org/apache/spark/rdd/PartitionwiseSampledRDD.scala
…taFrame to R DataFrame (apache#578) ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from Spark DataFrame to R DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` **Note:** currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. **Note:** currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks **Shall** ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` **R code** ```r df <- cache(createDataFrame(read.csv("500000.csv"))) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() collect(df) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` **Data (350 MB):** ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ **Results** ``` Time difference of 221.32014 secs ``` ``` Time difference of 15.51145 secs ``` The performance improvement was around **1426%**. ### Limitations: - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. ## How was this patch tested? Existing tests related with Arrow optimization cover this change. Also, manually tested.
1. Remove deprecated role "install-docker-ce" 2. Apply role "install-openjdk" instead of installing OpenJDK in run.yaml 3. Add option "--r" for command "make-distribution.sh" in minikube job to avoid following docker build error because R code can't be found in docker file. Related-Bug: theopenlab/openlab#310 Related-Bug: theopenlab/openlab#308
…pache#577)" (apache#578) This reverts commit c996bb0.
This is a straightforward fix.