Skip to content

Commit

Permalink
Merge remote-tracking branch 'apache-github/master' into queryProgress
Browse files Browse the repository at this point in the history
  • Loading branch information
tdas committed Nov 29, 2016
2 parents b41a662 + 3c0beea commit d92f4bf
Show file tree
Hide file tree
Showing 323 changed files with 6,160 additions and 3,033 deletions.
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Please review http://spark.apache.org/contributing.html before opening a pull request.
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
## Contributing to Spark

*Before opening a pull request*, review the
[Contributing to Spark wiki](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark).
[Contributing to Spark guide](http://spark.apache.org/contributing.html).
It lists steps that are required before creating a PR. In particular, consider:

- Is the change important and ready enough to ask the community to spend time reviewing?
- Have you searched for existing, related JIRAs and pull requests?
- Is this a new feature that can stand alone as a [third party project](https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects) ?
- Is this a new feature that can stand alone as a [third party project](http://spark.apache.org/third-party-projects.html) ?
- Is the change being proposed clearly explained and motivated?

When you contribute code, you affirm that the contribution is your original work and that you
Expand Down
2 changes: 1 addition & 1 deletion R/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ sparkR.session()

#### Making changes to SparkR

The [instructions](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) for making contributions to Spark also apply to SparkR.
The [instructions](http://spark.apache.org/contributing.html) for making contributions to Spark also apply to SparkR.
If you only make R file changes (i.e. no Scala changes) then you can just re-install the R package using `R/install-dev.sh` and test your changes.
Once you have made your changes, please include unit tests for them and run existing unit tests using the `R/run-tests.sh` script as described below.

Expand Down
2 changes: 1 addition & 1 deletion R/pkg/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),
email = "[email protected]"),
person(family = "The Apache Software Foundation", role = c("aut", "cph")))
URL: http://www.apache.org/ http://spark.apache.org/
BugReports: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingBugReports
BugReports: http://spark.apache.org/contributing.html
Depends:
R (>= 3.0),
methods
Expand Down
6 changes: 4 additions & 2 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -2541,7 +2541,8 @@ generateAliasesForIntersectedCols <- function (x, intersectedColNames, suffix) {
#'
#' Return a new SparkDataFrame containing the union of rows in this SparkDataFrame
#' and another SparkDataFrame. This is equivalent to \code{UNION ALL} in SQL.
#' Note that this does not remove duplicate rows across the two SparkDataFrames.
#'
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
#'
#' @param x A SparkDataFrame
#' @param y A SparkDataFrame
Expand Down Expand Up @@ -2584,7 +2585,8 @@ setMethod("unionAll",
#' Union two or more SparkDataFrames
#'
#' Union two or more SparkDataFrames. This is equivalent to \code{UNION ALL} in SQL.
#' Note that this does not remove duplicate rows across the two SparkDataFrames.
#'
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
#'
#' @param x a SparkDataFrame.
#' @param ... additional SparkDataFrame(s).
Expand Down
7 changes: 4 additions & 3 deletions R/pkg/R/functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -2296,7 +2296,7 @@ setMethod("n", signature(x = "Column"),
#' A pattern could be for instance \preformatted{dd.MM.yyyy} and could return a string like '18.03.1993'. All
#' pattern letters of \code{java.text.SimpleDateFormat} can be used.
#'
#' NOTE: Use when ever possible specialized functions like \code{year}. These benefit from a
#' Note: Use when ever possible specialized functions like \code{year}. These benefit from a
#' specialized implementation.
#'
#' @param y Column to compute on.
Expand Down Expand Up @@ -2341,7 +2341,7 @@ setMethod("from_utc_timestamp", signature(y = "Column", x = "character"),
#' Locate the position of the first occurrence of substr column in the given string.
#' Returns null if either of the arguments are null.
#'
#' NOTE: The position is not zero based, but 1 based index. Returns 0 if substr
#' Note: The position is not zero based, but 1 based index. Returns 0 if substr
#' could not be found in str.
#'
#' @param y column to check
Expand Down Expand Up @@ -2779,7 +2779,8 @@ setMethod("window", signature(x = "Column"),
#' locate
#'
#' Locate the position of the first occurrence of substr.
#' NOTE: The position is not zero based, but 1 based index. Returns 0 if substr
#'
#' Note: The position is not zero based, but 1 based index. Returns 0 if substr
#' could not be found in str.
#'
#' @param substr a character string to be matched.
Expand Down
21 changes: 16 additions & 5 deletions R/pkg/R/mllib.R
Original file line number Diff line number Diff line change
Expand Up @@ -278,8 +278,10 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDat

#' @param object a fitted generalized linear model.
#' @return \code{summary} returns a summary object of the fitted model, a list of components
#' including at least the coefficients, null/residual deviance, null/residual degrees
#' of freedom, AIC and number of iterations IRLS takes.
#' including at least the coefficients matrix (which includes coefficients, standard error
#' of coefficients, t value and p value), null/residual deviance, null/residual degrees of
#' freedom, AIC and number of iterations IRLS takes. If there are collinear columns
#' in you data, the coefficients matrix only provides coefficients.
#'
#' @rdname spark.glm
#' @export
Expand All @@ -303,9 +305,18 @@ setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"),
} else {
dataFrame(callJMethod(jobj, "rDevianceResiduals"))
}
coefficients <- matrix(coefficients, ncol = 4)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- unlist(features)
# If the underlying WeightedLeastSquares using "normal" solver, we can provide
# coefficients, standard error of coefficients, t value and p value. Otherwise,
# it will be fitted by local "l-bfgs", we can only provide coefficients.
if (length(features) == length(coefficients)) {
coefficients <- matrix(coefficients, ncol = 1)
colnames(coefficients) <- c("Estimate")
rownames(coefficients) <- unlist(features)
} else {
coefficients <- matrix(coefficients, ncol = 4)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- unlist(features)
}
ans <- list(deviance.resid = deviance.resid, coefficients = coefficients,
dispersion = dispersion, null.deviance = null.deviance,
deviance = deviance, df.null = df.null, df.residual = df.residual,
Expand Down
20 changes: 14 additions & 6 deletions R/pkg/R/sparkR.R
Original file line number Diff line number Diff line change
Expand Up @@ -373,8 +373,13 @@ sparkR.session <- function(
overrideEnvs(sparkConfigMap, paramMap)
}

deployMode <- ""
if (exists("spark.submit.deployMode", envir = sparkConfigMap)) {
deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
}

if (!exists(".sparkRjsc", envir = .sparkREnv)) {
retHome <- sparkCheckInstall(sparkHome, master)
retHome <- sparkCheckInstall(sparkHome, master, deployMode)
if (!is.null(retHome)) sparkHome <- retHome
sparkExecutorEnvMap <- new.env()
sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, sparkExecutorEnvMap,
Expand Down Expand Up @@ -550,24 +555,27 @@ processSparkPackages <- function(packages) {
#
# @param sparkHome directory to find Spark package.
# @param master the Spark master URL, used to check local or remote mode.
# @param deployMode whether to deploy your driver on the worker nodes (cluster)
# or locally as an external client (client).
# @return NULL if no need to update sparkHome, and new sparkHome otherwise.
sparkCheckInstall <- function(sparkHome, master) {
sparkCheckInstall <- function(sparkHome, master, deployMode) {
if (!isSparkRShell()) {
if (!is.na(file.info(sparkHome)$isdir)) {
msg <- paste0("Spark package found in SPARK_HOME: ", sparkHome)
message(msg)
NULL
} else {
if (!nzchar(master) || isMasterLocal(master)) {
msg <- paste0("Spark not found in SPARK_HOME: ",
sparkHome)
if (isMasterLocal(master)) {
msg <- paste0("Spark not found in SPARK_HOME: ", sparkHome)
message(msg)
packageLocalDir <- install.spark()
packageLocalDir
} else {
} else if (isClientMode(master) || deployMode == "client") {
msg <- paste0("Spark not found in SPARK_HOME: ",
sparkHome, "\n", installInstruction("remote"))
stop(msg)
} else {
NULL
}
}
} else {
Expand Down
4 changes: 4 additions & 0 deletions R/pkg/R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -777,6 +777,10 @@ isMasterLocal <- function(master) {
grepl("^local(\\[([0-9]+|\\*)\\])?$", master, perl = TRUE)
}

isClientMode <- function(master) {
grepl("([a-z]+)-client$", master, perl = TRUE)
}

isSparkRShell <- function() {
grepl(".*shell\\.R$", Sys.getenv("R_PROFILE_USER"), perl = TRUE)
}
Expand Down
9 changes: 9 additions & 0 deletions R/pkg/inst/tests/testthat/test_mllib.R
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,15 @@ test_that("spark.glm summary", {
df <- suppressWarnings(createDataFrame(data))
regStats <- summary(spark.glm(df, b ~ a1 + a2, regParam = 1.0))
expect_equal(regStats$aic, 14.00976, tolerance = 1e-4) # 14.00976 is from summary() result

# Test spark.glm works on collinear data
A <- matrix(c(1, 2, 3, 4, 2, 4, 6, 8), 4, 2)
b <- c(1, 2, 3, 4)
data <- as.data.frame(cbind(A, b))
df <- createDataFrame(data)
stats <- summary(spark.glm(df, b ~ . - 1))
coefs <- unlist(stats$coefficients)
expect_true(all(abs(c(0.5, 0.25) - coefs) < 1e-4))
})

test_that("spark.glm save/load", {
Expand Down
46 changes: 46 additions & 0 deletions R/pkg/inst/tests/testthat/test_sparkR.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

context("functions in sparkR.R")

test_that("sparkCheckInstall", {
# "local, yarn-client, mesos-client" mode, SPARK_HOME was set correctly,
# and the SparkR job was submitted by "spark-submit"
sparkHome <- paste0(tempdir(), "/", "sparkHome")
dir.create(sparkHome)
master <- ""
deployMode <- ""
expect_true(is.null(sparkCheckInstall(sparkHome, master, deployMode)))
unlink(sparkHome, recursive = TRUE)

# "yarn-cluster, mesos-cluster" mode, SPARK_HOME was not set,
# and the SparkR job was submitted by "spark-submit"
sparkHome <- ""
master <- ""
deployMode <- ""
expect_true(is.null(sparkCheckInstall(sparkHome, master, deployMode)))

# "yarn-client, mesos-client" mode, SPARK_HOME was not set
sparkHome <- ""
master <- "yarn-client"
deployMode <- ""
expect_error(sparkCheckInstall(sparkHome, master, deployMode))
sparkHome <- ""
master <- ""
deployMode <- "client"
expect_error(sparkCheckInstall(sparkHome, master, deployMode))
})
2 changes: 1 addition & 1 deletion R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -2684,7 +2684,7 @@ test_that("Call DataFrameWriter.load() API in Java without path and check argume
# It makes sure that we can omit path argument in read.df API and then it calls
# DataFrameWriter.load() without path.
expect_error(read.df(source = "json"),
paste("Error in loadDF : analysis error - Unable to infer schema for JSON at .",
paste("Error in loadDF : analysis error - Unable to infer schema for JSON.",
"It must be specified manually"))
expect_error(read.df("arbitrary_path"), "Error in loadDF : analysis error - Path does not exist")
expect_error(read.json("arbitrary_path"), "Error in json : analysis error - Path does not exist")
Expand Down
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,9 @@ To build Spark and its example programs, run:
You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).
More detailed documentation is available from the project site, at
["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
For developing Spark using an IDE, see [Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse)
and [IntelliJ](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ).

For general development tips, including info on developing Spark using an IDE, see
[http://spark.apache.org/developer-tools.html](the Useful Developer Tools page).

## Interactive Scala Shell

Expand Down Expand Up @@ -80,7 +81,7 @@ can be run using:
./dev/run-tests

Please see the guidance on how to
[run tests for a module, or individual tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).
[run tests for a module, or individual tests](http://spark.apache.org/developer-tools.html#individual-tests).

## A Note About Hadoop Versions

Expand All @@ -100,5 +101,5 @@ in the online documentation for an overview on how to configure Spark.

## Contributing

Please review the [Contribution to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)
wiki for information on how to get started contributing to the project.
Please review the [Contribution to Spark guide](http://spark.apache.org/contributing.html)
for information on how to get started contributing to the project.
Original file line number Diff line number Diff line change
Expand Up @@ -378,14 +378,14 @@ public long cleanUpAllAllocatedMemory() {
for (MemoryConsumer c: consumers) {
if (c != null && c.getUsed() > 0) {
// In case of failed task, it's normal to see leaked memory
logger.warn("leak " + Utils.bytesToString(c.getUsed()) + " memory from " + c);
logger.debug("unreleased " + Utils.bytesToString(c.getUsed()) + " memory from " + c);
}
}
consumers.clear();

for (MemoryBlock page : pageTable) {
if (page != null) {
logger.warn("leak a page: " + page + " in task " + taskAttemptId);
logger.debug("unreleased page: " + page + " in task " + taskAttemptId);
memoryManager.tungstenMemoryAllocator().free(page);
}
}
Expand Down
4 changes: 2 additions & 2 deletions core/src/main/scala/org/apache/spark/SSLOptions.scala
Original file line number Diff line number Diff line change
Expand Up @@ -150,8 +150,8 @@ private[spark] object SSLOptions extends Logging {
* $ - `[ns].enabledAlgorithms` - a comma separated list of ciphers
*
* For a list of protocols and ciphers supported by particular Java versions, you may go to
* [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
* blog page]].
* <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
* Oracle blog page</a>.
*
* You can optionally specify the default configuration. If you do, for each setting which is
* missing in SparkConf, the corresponding setting is used from the default configuration.
Expand Down
23 changes: 5 additions & 18 deletions core/src/main/scala/org/apache/spark/SecurityManager.scala
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ import java.lang.{Byte => JByte}
import java.net.{Authenticator, PasswordAuthentication}
import java.security.{KeyStore, SecureRandom}
import java.security.cert.X509Certificate
import javax.crypto.KeyGenerator
import javax.net.ssl._

import com.google.common.hash.HashCodes
Expand All @@ -33,7 +32,6 @@ import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.internal.Logging
import org.apache.spark.internal.config._
import org.apache.spark.network.sasl.SecretKeyHolder
import org.apache.spark.security.CryptoStreamUtils._
import org.apache.spark.util.Utils

/**
Expand Down Expand Up @@ -185,7 +183,9 @@ import org.apache.spark.util.Utils
* setting `spark.ssl.useNodeLocalConf` to `true`.
*/

private[spark] class SecurityManager(sparkConf: SparkConf)
private[spark] class SecurityManager(
sparkConf: SparkConf,
ioEncryptionKey: Option[Array[Byte]] = None)
extends Logging with SecretKeyHolder {

import SecurityManager._
Expand Down Expand Up @@ -415,6 +415,8 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
logInfo("Changing acls enabled to: " + aclsOn)
}

def getIOEncryptionKey(): Option[Array[Byte]] = ioEncryptionKey

/**
* Generates or looks up the secret key.
*
Expand Down Expand Up @@ -559,19 +561,4 @@ private[spark] object SecurityManager {
// key used to store the spark secret in the Hadoop UGI
val SECRET_LOOKUP_KEY = "sparkCookie"

/**
* Setup the cryptographic key used by IO encryption in credentials. The key is generated using
* [[KeyGenerator]]. The algorithm and key length is specified by the [[SparkConf]].
*/
def initIOEncryptionKey(conf: SparkConf, credentials: Credentials): Unit = {
if (credentials.getSecretKey(SPARK_IO_TOKEN) == null) {
val keyLen = conf.get(IO_ENCRYPTION_KEY_SIZE_BITS)
val ioKeyGenAlgorithm = conf.get(IO_ENCRYPTION_KEYGEN_ALGORITHM)
val keyGen = KeyGenerator.getInstance(ioKeyGenAlgorithm)
keyGen.init(keyLen)

val ioKey = keyGen.generateKey()
credentials.addSecretKey(SPARK_IO_TOKEN, ioKey.getEncoded)
}
}
}
4 changes: 0 additions & 4 deletions core/src/main/scala/org/apache/spark/SparkContext.scala
Original file line number Diff line number Diff line change
Expand Up @@ -422,10 +422,6 @@ class SparkContext(config: SparkConf) extends Logging {
}

if (master == "yarn" && deployMode == "client") System.setProperty("SPARK_YARN_MODE", "true")
if (_conf.get(IO_ENCRYPTION_ENABLED) && !SparkHadoopUtil.get.isYarnMode()) {
throw new SparkException("IO encryption is only supported in YARN mode, please disable it " +
s"by setting ${IO_ENCRYPTION_ENABLED.key} to false")
}

// "_jobProgressListener" should be set up before creating SparkEnv because when creating
// "SparkEnv", some messages will be posted to "listenerBus" and we should not miss them.
Expand Down
Loading

0 comments on commit d92f4bf

Please sign in to comment.