Skip to content

Commit

Permalink
Merge pull request apache#476 from palantir/rk/more-merge
Browse files Browse the repository at this point in the history
Merge from upstream
  • Loading branch information
robert3005 authored Feb 1, 2019
2 parents a51fa9c + a23eb35 commit 6f62a0d
Show file tree
Hide file tree
Showing 771 changed files with 27,757 additions and 19,523 deletions.
21 changes: 8 additions & 13 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ version: 2

defaults: &defaults
docker:
- image: palantirtechnologies/circle-spark-base:0.1.0
- image: palantirtechnologies/circle-spark-base:0.1.3
resource_class: xlarge
environment: &defaults-environment
TERM: dumb
Expand Down Expand Up @@ -128,7 +128,7 @@ jobs:
<<: *defaults
# Some part of the maven setup fails if there's no R, so we need to use the R image here
docker:
- image: palantirtechnologies/circle-spark-r:0.1.0
- image: palantirtechnologies/circle-spark-r:0.1.3
steps:
# Saves us from recompiling every time...
- restore_cache:
Expand All @@ -147,12 +147,7 @@ jobs:
keys:
- build-binaries-{{ checksum "build/mvn" }}-{{ checksum "build/sbt" }}
- build-binaries-
- run: |
./build/mvn -T1C -DskipTests -Phadoop-cloud -Phadoop-palantir -Pkinesis-asl -Pkubernetes -Pyarn -Psparkr install \
| tee -a "/tmp/mvn-install.log"
- store_artifacts:
path: /tmp/mvn-install.log
destination: mvn-install.log
- run: ./build/mvn -DskipTests -Phadoop-cloud -Phadoop-palantir -Pkinesis-asl -Pkubernetes -Pyarn -Psparkr install
# Get sbt to run trivially, ensures its launcher is downloaded under build/
- run: ./build/sbt -h || true
- save_cache:
Expand Down Expand Up @@ -300,7 +295,7 @@ jobs:
# depends on build-sbt, but we only need the assembly jars
<<: *defaults
docker:
- image: palantirtechnologies/circle-spark-python:0.1.0
- image: palantirtechnologies/circle-spark-python:0.1.3
parallelism: 2
steps:
- *checkout-code
Expand All @@ -325,7 +320,7 @@ jobs:
# depends on build-sbt, but we only need the assembly jars
<<: *defaults
docker:
- image: palantirtechnologies/circle-spark-r:0.1.0
- image: palantirtechnologies/circle-spark-r:0.1.3
steps:
- *checkout-code
- attach_workspace:
Expand Down Expand Up @@ -438,7 +433,7 @@ jobs:
<<: *defaults
# Some part of the maven setup fails if there's no R, so we need to use the R image here
docker:
- image: palantirtechnologies/circle-spark-r:0.1.0
- image: palantirtechnologies/circle-spark-r:0.1.3
steps:
- *checkout-code
- restore_cache:
Expand All @@ -458,7 +453,7 @@ jobs:
deploy-gradle:
<<: *defaults
docker:
- image: palantirtechnologies/circle-spark-r:0.1.0
- image: palantirtechnologies/circle-spark-r:0.1.3
steps:
- *checkout-code
- *restore-gradle-wrapper-cache
Expand All @@ -470,7 +465,7 @@ jobs:
<<: *defaults
# Some part of the maven setup fails if there's no R, so we need to use the R image here
docker:
- image: palantirtechnologies/circle-spark-r:0.1.0
- image: palantirtechnologies/circle-spark-r:0.1.3
steps:
# This cache contains the whole project after version was set and mvn package was called
# Restoring first (and instead of checkout) as mvn versions:set mutates real source code...
Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,6 @@ work/
.credentials
dev/pr-deps
docs/.jekyll-metadata
*.crc

# For Hive
TempStatsStore/
Expand Down
2 changes: 2 additions & 0 deletions FORK.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,5 @@
# Reverted
* [SPARK-25908](https://issues.apache.org/jira/browse/SPARK-25908) - Removal of `monotonicall_increasing_id`, `toDegree`, `toRadians`, `approxCountDistinct`, `unionAll`
* [SPARK-25862](https://issues.apache.org/jira/browse/SPARK-25862) - Removal of `unboundedPreceding`, `unboundedFollowing`, `currentRow`
* [SPARK-26127](https://issues.apache.org/jira/browse/SPARK-26127) - Removal of deprecated setters from tree regression and classification models
* [SPARK-25867](https://issues.apache.org/jira/browse/SPARK-25867) - Removal of KMeans computeCost
2 changes: 1 addition & 1 deletion R/WINDOWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
To build SparkR on Windows, the following steps are required

1. Install R (>= 3.1) and [Rtools](http://cran.r-project.org/bin/windows/Rtools/). Make sure to
include Rtools and R in `PATH`.
include Rtools and R in `PATH`. Note that support for R prior to version 3.4 is deprecated as of Spark 3.0.0.

2. Install
[JDK8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) and set
Expand Down
2 changes: 1 addition & 1 deletion R/pkg/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ URL: http://www.apache.org/ http://spark.apache.org/
BugReports: http://spark.apache.org/contributing.html
SystemRequirements: Java (== 8)
Depends:
R (>= 3.0),
R (>= 3.1),
methods
Suggests:
knitr,
Expand Down
8 changes: 8 additions & 0 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -767,6 +767,14 @@ setMethod("repartition",
#' using \code{spark.sql.shuffle.partitions} as number of partitions.}
#'}
#'
#' At least one partition-by expression must be specified.
#' When no explicit sort order is specified, "ascending nulls first" is assumed.
#'
#' Note that due to performance reasons this method uses sampling to estimate the ranges.
#' Hence, the output may not be consistent, since sampling can return different values.
#' The sample size can be controlled by the config
#' \code{spark.sql.execution.rangeExchange.sampleSizePerPartition}.
#'
#' @param x a SparkDataFrame.
#' @param numPartitions the number of partitions to use.
#' @param col the column by which the range partitioning will be performed.
Expand Down
2 changes: 1 addition & 1 deletion R/pkg/R/functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -3370,7 +3370,7 @@ setMethod("flatten",
#'
#' @rdname column_collection_functions
#' @aliases map_entries map_entries,Column-method
#' @note map_entries since 2.4.0
#' @note map_entries since 3.0.0
setMethod("map_entries",
signature(x = "Column"),
function(x) {
Expand Down
4 changes: 2 additions & 2 deletions R/pkg/R/stats.R
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ setMethod("corr",
#'
#' Finding frequent items for columns, possibly with false positives.
#' Using the frequent element count algorithm described in
#' \url{http://dx.doi.org/10.1145/762471.762473}, proposed by Karp, Schenker, and Papadimitriou.
#' \url{https://doi.org/10.1145/762471.762473}, proposed by Karp, Schenker, and Papadimitriou.
#'
#' @param x A SparkDataFrame.
#' @param cols A vector column names to search frequent items in.
Expand Down Expand Up @@ -143,7 +143,7 @@ setMethod("freqItems", signature(x = "SparkDataFrame", cols = "character"),
#' *exact* rank of x is close to (p * N). More precisely,
#' floor((p - err) * N) <= rank(x) <= ceil((p + err) * N).
#' This method implements a variation of the Greenwald-Khanna algorithm (with some speed
#' optimizations). The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670
#' optimizations). The algorithm was first present in [[https://doi.org/10.1145/375663.375670
#' Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna.
#' Note that NA values will be ignored in numerical columns before calculation. For
#' columns only containing NA values, an empty list is returned.
Expand Down
4 changes: 4 additions & 0 deletions R/pkg/inst/profile/general.R
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@
#

.First <- function() {
if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), "3.4.0") == -1) {
warning("Support for R prior to version 3.4 is deprecated since Spark 3.0.0")
}

packageDir <- Sys.getenv("SPARKR_PACKAGE_DIR")
dirs <- strsplit(packageDir, ",")[[1]]
.libPaths(c(dirs, .libPaths()))
Expand Down
4 changes: 4 additions & 0 deletions R/pkg/inst/profile/shell.R
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@
#

.First <- function() {
if (utils::compareVersion(paste0(R.version$major, ".", R.version$minor), "3.4.0") == -1) {
warning("Support for R prior to version 3.4 is deprecated since Spark 3.0.0")
}

home <- Sys.getenv("SPARK_HOME")
.libPaths(c(file.path(home, "R", "lib"), .libPaths()))
Sys.setenv(NOAWT = 1)
Expand Down
2 changes: 1 addition & 1 deletion R/pkg/tests/fulltests/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -1674,7 +1674,7 @@ test_that("column functions", {

# check for unparseable
df <- as.DataFrame(list(list("a" = "")))
expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]], NA)
expect_equal(collect(select(df, from_json(df$a, schema)))[[1]][[1]]$a, NA)

# check if array type in string is correctly supported.
jsonArr <- "[{\"name\":\"Bob\"}, {\"name\":\"Alice\"}]"
Expand Down
1 change: 1 addition & 0 deletions R/pkg/tests/fulltests/test_streaming.R
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ test_that("Specify a schema by using a DDL-formatted string when reading", {
expect_false(awaitTermination(q, 5 * 1000))
callJMethod(q@ssq, "processAllAvailable")
expect_equal(head(sql("SELECT count(*) FROM people3"))[[1]], 3)
stopQuery(q)

expect_error(read.stream(path = parquetPath, schema = "name stri"),
"DataType stri is not supported.")
Expand Down
14 changes: 14 additions & 0 deletions R/pkg/vignettes/sparkr-vignettes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,20 @@ First, let's load and attach the package.
library(SparkR)
```

```{r, include=FALSE}
# disable eval if java version not supported
override_eval <- tryCatch(!is.numeric(SparkR:::checkJavaVersion()),
error = function(e) { TRUE },
warning = function(e) { TRUE })
if (override_eval) {
opts_hooks$set(eval = function(options) {
options$eval = FALSE
options
})
}
```

`SparkSession` is the entry point into SparkR which connects your R program to a Spark cluster. You can create a `SparkSession` using `sparkR.session` and pass in options such as the application name, any Spark packages depended on, etc.

We use default settings in which it runs in local mode. It auto downloads Spark package in the background if no previous installation is found. For more details about setup, see [Spark Session](#SetupSparkSession).
Expand Down
2 changes: 1 addition & 1 deletion assembly/README
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ This module is off by default. To activate it specify the profile in the command

If you need to build an assembly for a different version of Hadoop the
hadoop-version system property needs to be set as in this example:
-Dhadoop.version=2.7.3
-Dhadoop.version=2.7.4
6 changes: 3 additions & 3 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.11</artifactId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.0.0-SNAPSHOT</version>
<relativePath>../pom.xml</relativePath>
</parent>

<artifactId>spark-assembly_2.11</artifactId>
<artifactId>spark-assembly_2.12</artifactId>
<name>Spark Project Assembly</name>
<url>http://spark.apache.org/</url>
<packaging>pom</packaging>
Expand Down Expand Up @@ -76,7 +76,7 @@
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<artifactId>spark-avro_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>

Expand Down
Loading

0 comments on commit 6f62a0d

Please sign in to comment.