Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16508][SPARKR] doc updates and more CRAN check fixes #14734

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion R/pkg/NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Imports from base R
importFrom(methods, setGeneric, setMethod, setOldClass)
# Do not include stats:: "rpois", "runif" - causes error at runtime
importFrom("methods", "setGeneric", "setMethod", "setOldClass")
importFrom("methods", "is", "new", "signature", "show")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these things show up as CRAN warnings ? I dont see them on my machine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about this part as well.

Copy link
Member Author

@felixcheung felixcheung Aug 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure why but what I see is a much longer list

I'm still getting the same, longer output after upgrading to R 3.3.1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might have to do with the R version used. I am using R 3.2.1 on my machine while this is from R 3.3.0 -- Using a later R version is obviously better.

Copy link
Member Author

@felixcheung felixcheung Aug 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, that was my first thought. But Jenkins is running R 3.3.1 I think?

Oops 3.1.1, so older.
log

importFrom("stats", "gaussian", "setNames")
importFrom("utils", "download.file", "packageVersion", "untar")

# Disable native libraries till we figure out how to package it
# See SPARKR-7839
Expand Down
71 changes: 35 additions & 36 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ setMethod("explain",

#' isLocal
#'
#' Returns True if the `collect` and `take` methods can be run locally
#' Returns True if the \code{collect} and \code{take} methods can be run locally
#' (without any Spark executors).
#'
#' @param x A SparkDataFrame
Expand Down Expand Up @@ -182,7 +182,7 @@ setMethod("isLocal",
#' @param numRows the number of rows to print. Defaults to 20.
#' @param truncate whether truncate long strings. If \code{TRUE}, strings more than
#' 20 characters will be truncated. However, if set greater than zero,
#' truncates strings longer than `truncate` characters and all cells
#' truncates strings longer than \code{truncate} characters and all cells
#' will be aligned right.
#' @param ... further arguments to be passed to or from other methods.
#' @family SparkDataFrame functions
Expand Down Expand Up @@ -642,10 +642,10 @@ setMethod("unpersist",
#' The following options for repartition are possible:
#' \itemize{
#' \item{1.} {Return a new SparkDataFrame partitioned by
#' the given columns into `numPartitions`.}
#' \item{2.} {Return a new SparkDataFrame that has exactly `numPartitions`.}
#' the given columns into \code{numPartitions}.}
#' \item{2.} {Return a new SparkDataFrame that has exactly \code{numPartitions}.}
#' \item{3.} {Return a new SparkDataFrame partitioned by the given column(s),
#' using `spark.sql.shuffle.partitions` as number of partitions.}
#' using \code{spark.sql.shuffle.partitions} as number of partitions.}
#'}
#' @param x a SparkDataFrame.
#' @param numPartitions the number of partitions to use.
Expand Down Expand Up @@ -1132,9 +1132,8 @@ setMethod("take",

#' Head
#'
#' Return the first NUM rows of a SparkDataFrame as a R data.frame. If NUM is NULL,
#' then head() returns the first 6 rows in keeping with the current data.frame
#' convention in R.
#' Return the first \code{num} rows of a SparkDataFrame as a R data.frame. If \code{num} is not
#' specified, then head() returns the first 6 rows as with R data.frame.
#'
#' @param x a SparkDataFrame.
#' @param num the number of rows to return. Default is 6.
Expand Down Expand Up @@ -1406,11 +1405,11 @@ setMethod("dapplyCollect",
#'
#' @param cols grouping columns.
#' @param func a function to be applied to each group partition specified by grouping
#' column of the SparkDataFrame. The function `func` takes as argument
#' column of the SparkDataFrame. The function \code{func} takes as argument
#' a key - grouping columns and a data frame - a local R data.frame.
#' The output of `func` is a local R data.frame.
#' The output of \code{func} is a local R data.frame.
#' @param schema the schema of the resulting SparkDataFrame after the function is applied.
#' The schema must match to output of `func`. It has to be defined for each
#' The schema must match to output of \code{func}. It has to be defined for each
#' output column with preferred output column name and corresponding data type.
#' @return A SparkDataFrame.
#' @family SparkDataFrame functions
Expand Down Expand Up @@ -1497,9 +1496,9 @@ setMethod("gapply",
#'
#' @param cols grouping columns.
#' @param func a function to be applied to each group partition specified by grouping
#' column of the SparkDataFrame. The function `func` takes as argument
#' column of the SparkDataFrame. The function \code{func} takes as argument
#' a key - grouping columns and a data frame - a local R data.frame.
#' The output of `func` is a local R data.frame.
#' The output of \code{func} is a local R data.frame.
#' @return A data.frame.
#' @family SparkDataFrame functions
#' @aliases gapplyCollect,SparkDataFrame-method
Expand Down Expand Up @@ -1657,7 +1656,7 @@ setMethod("$", signature(x = "SparkDataFrame"),
getColumn(x, name)
})

#' @param value a Column or NULL. If NULL, the specified Column is dropped.
#' @param value a Column or \code{NULL}. If \code{NULL}, the specified Column is dropped.
#' @rdname select
#' @name $<-
#' @aliases $<-,SparkDataFrame-method
Expand Down Expand Up @@ -1747,7 +1746,7 @@ setMethod("[", signature(x = "SparkDataFrame"),
#' @family subsetting functions
#' @examples
#' \dontrun{
#' # Columns can be selected using `[[` and `[`
#' # Columns can be selected using [[ and [
#' df[[2]] == df[["age"]]
#' df[,2] == df[,"age"]
#' df[,c("name", "age")]
Expand Down Expand Up @@ -1792,7 +1791,7 @@ setMethod("subset", signature(x = "SparkDataFrame"),
#' select(df, df$name, df$age + 1)
#' select(df, c("col1", "col2"))
#' select(df, list(df$name, df$age + 1))
#' # Similar to R data frames columns can also be selected using `$`
#' # Similar to R data frames columns can also be selected using $
#' df[,df$age]
#' }
#' @note select(SparkDataFrame, character) since 1.4.0
Expand Down Expand Up @@ -2443,7 +2442,7 @@ generateAliasesForIntersectedCols <- function (x, intersectedColNames, suffix) {
#' Return a new SparkDataFrame containing the union of rows
#'
#' Return a new SparkDataFrame containing the union of rows in this SparkDataFrame
#' and another SparkDataFrame. This is equivalent to `UNION ALL` in SQL.
#' and another SparkDataFrame. This is equivalent to \code{UNION ALL} in SQL.
#' Note that this does not remove duplicate rows across the two SparkDataFrames.
#'
#' @param x A SparkDataFrame
Expand Down Expand Up @@ -2486,7 +2485,7 @@ setMethod("unionAll",

#' Union two or more SparkDataFrames
#'
#' Union two or more SparkDataFrames. This is equivalent to `UNION ALL` in SQL.
#' Union two or more SparkDataFrames. This is equivalent to \code{UNION ALL} in SQL.
#' Note that this does not remove duplicate rows across the two SparkDataFrames.
#'
#' @param x a SparkDataFrame.
Expand Down Expand Up @@ -2519,7 +2518,7 @@ setMethod("rbind",
#' Intersect
#'
#' Return a new SparkDataFrame containing rows only in both this SparkDataFrame
#' and another SparkDataFrame. This is equivalent to `INTERSECT` in SQL.
#' and another SparkDataFrame. This is equivalent to \code{INTERSECT} in SQL.
#'
#' @param x A SparkDataFrame
#' @param y A SparkDataFrame
Expand Down Expand Up @@ -2547,7 +2546,7 @@ setMethod("intersect",
#' except
#'
#' Return a new SparkDataFrame containing rows in this SparkDataFrame
#' but not in another SparkDataFrame. This is equivalent to `EXCEPT` in SQL.
#' but not in another SparkDataFrame. This is equivalent to \code{EXCEPT} in SQL.
#'
#' @param x a SparkDataFrame.
#' @param y a SparkDataFrame.
Expand Down Expand Up @@ -2576,8 +2575,8 @@ setMethod("except",

#' Save the contents of SparkDataFrame to a data source.
#'
#' The data source is specified by the `source` and a set of options (...).
#' If `source` is not specified, the default data source configured by
#' The data source is specified by the \code{source} and a set of options (...).
#' If \code{source} is not specified, the default data source configured by
#' spark.sql.sources.default will be used.
#'
#' Additionally, mode is used to specify the behavior of the save operation when data already
Expand Down Expand Up @@ -2613,7 +2612,7 @@ setMethod("except",
#' @note write.df since 1.4.0
setMethod("write.df",
signature(df = "SparkDataFrame", path = "character"),
function(df, path, source = NULL, mode = "error", ...){
function(df, path, source = NULL, mode = "error", ...) {
if (is.null(source)) {
source <- getDefaultSqlSource()
}
Expand All @@ -2635,14 +2634,14 @@ setMethod("write.df",
#' @note saveDF since 1.4.0
setMethod("saveDF",
signature(df = "SparkDataFrame", path = "character"),
function(df, path, source = NULL, mode = "error", ...){
function(df, path, source = NULL, mode = "error", ...) {
write.df(df, path, source, mode, ...)
})

#' Save the contents of the SparkDataFrame to a data source as a table
#'
#' The data source is specified by the `source` and a set of options (...).
#' If `source` is not specified, the default data source configured by
#' The data source is specified by the \code{source} and a set of options (...).
#' If \code{source} is not specified, the default data source configured by
#' spark.sql.sources.default will be used.
#'
#' Additionally, mode is used to specify the behavior of the save operation when
Expand Down Expand Up @@ -2675,7 +2674,7 @@ setMethod("saveDF",
#' @note saveAsTable since 1.4.0
setMethod("saveAsTable",
signature(df = "SparkDataFrame", tableName = "character"),
function(df, tableName, source = NULL, mode="error", ...){
function(df, tableName, source = NULL, mode="error", ...) {
if (is.null(source)) {
source <- getDefaultSqlSource()
}
Expand Down Expand Up @@ -2752,11 +2751,11 @@ setMethod("summary",
#' @param how "any" or "all".
#' if "any", drop a row if it contains any nulls.
#' if "all", drop a row only if all its values are null.
#' if minNonNulls is specified, how is ignored.
#' if \code{minNonNulls} is specified, how is ignored.
#' @param minNonNulls if specified, drop rows that have less than
#' minNonNulls non-null values.
#' \code{minNonNulls} non-null values.
#' This overwrites the how parameter.
#' @param cols optional list of column names to consider. In `fillna`,
#' @param cols optional list of column names to consider. In \code{fillna},
#' columns specified in cols that do not have matching data
#' type are ignored. For example, if value is a character, and
#' subset contains a non-character column, then the non-character
Expand Down Expand Up @@ -2879,8 +2878,8 @@ setMethod("fillna",
#' in your system to accommodate the contents.
#'
#' @param x a SparkDataFrame.
#' @param row.names NULL or a character vector giving the row names for the data frame.
#' @param optional If `TRUE`, converting column names is optional.
#' @param row.names \code{NULL} or a character vector giving the row names for the data frame.
#' @param optional If \code{TRUE}, converting column names is optional.
#' @param ... additional arguments to pass to base::as.data.frame.
#' @return A data.frame.
#' @family SparkDataFrame functions
Expand Down Expand Up @@ -3058,7 +3057,7 @@ setMethod("str",
#' @note drop since 2.0.0
setMethod("drop",
signature(x = "SparkDataFrame"),
function(x, col, ...) {
function(x, col) {
stopifnot(class(col) == "character" || class(col) == "Column")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify removing ... is intentional ? Just wondering as we have the @param documentation above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually follows from the discussion in #14705. A summary may be seen at #14735 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - that sounds good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, in fact, this one was added in #14705 - which we missed and shouldn't be added.


if (class(col) == "Column") {
Expand Down Expand Up @@ -3218,8 +3217,8 @@ setMethod("histogram",
#' and to not change the existing data.
#' }
#'
#' @param x s SparkDataFrame.
#' @param url JDBC database url of the form `jdbc:subprotocol:subname`.
#' @param x a SparkDataFrame.
#' @param url JDBC database url of the form \code{jdbc:subprotocol:subname}.
#' @param tableName yhe name of the table in the external database.
#' @param mode one of 'append', 'overwrite', 'error', 'ignore' save mode (it is 'error' by default).
#' @param ... additional JDBC database connection properties.
Expand All @@ -3237,7 +3236,7 @@ setMethod("histogram",
#' @note write.jdbc since 2.0.0
setMethod("write.jdbc",
signature(x = "SparkDataFrame", url = "character", tableName = "character"),
function(x, url, tableName, mode = "error", ...){
function(x, url, tableName, mode = "error", ...) {
jmode <- convertToJSaveMode(mode)
jprops <- varargsToJProperties(...)
write <- callJMethod(x@sdf, "write")
Expand Down
10 changes: 5 additions & 5 deletions R/pkg/R/RDD.R
Original file line number Diff line number Diff line change
Expand Up @@ -887,17 +887,17 @@ setMethod("sampleRDD",

# Discards some random values to ensure each partition has a
# different random seed.
runif(partIndex)
stats::runif(partIndex)

for (elem in part) {
if (withReplacement) {
count <- rpois(1, fraction)
count <- stats::rpois(1, fraction)
if (count > 0) {
res[ (len + 1) : (len + count) ] <- rep(list(elem), count)
len <- len + count
}
} else {
if (runif(1) < fraction) {
if (stats::runif(1) < fraction) {
len <- len + 1
res[[len]] <- elem
}
Expand Down Expand Up @@ -965,15 +965,15 @@ setMethod("takeSample", signature(x = "RDD", withReplacement = "logical",

set.seed(seed)
samples <- collectRDD(sampleRDD(x, withReplacement, fraction,
as.integer(ceiling(runif(1,
as.integer(ceiling(stats::runif(1,
-MAXINT,
MAXINT)))))
# If the first sample didn't turn out large enough, keep trying to
# take samples; this shouldn't happen often because we use a big
# multiplier for thei initial size
while (length(samples) < total)
samples <- collectRDD(sampleRDD(x, withReplacement, fraction,
as.integer(ceiling(runif(1,
as.integer(ceiling(stats::runif(1,
-MAXINT,
MAXINT)))))

Expand Down
30 changes: 15 additions & 15 deletions R/pkg/R/SQLContext.R
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ infer_type <- function(x) {
#' Get Runtime Config from the current active SparkSession
#'
#' Get Runtime Config from the current active SparkSession.
#' To change SparkSession Runtime Config, please see `sparkR.session()`.
#' To change SparkSession Runtime Config, please see \code{sparkR.session()}.
#'
#' @param key (optional) The key of the config to get, if omitted, all config is returned
#' @param defaultValue (optional) The default value of the config to return if they config is not
Expand Down Expand Up @@ -720,11 +720,11 @@ dropTempView <- function(viewName) {
#'
#' Returns the dataset in a data source as a SparkDataFrame
#'
#' The data source is specified by the `source` and a set of options(...).
#' If `source` is not specified, the default data source configured by
#' The data source is specified by the \code{source} and a set of options(...).
#' If \code{source} is not specified, the default data source configured by
#' "spark.sql.sources.default" will be used. \cr
#' Similar to R read.csv, when `source` is "csv", by default, a value of "NA" will be interpreted
#' as NA.
#' Similar to R read.csv, when \code{source} is "csv", by default, a value of "NA" will be
#' interpreted as NA.
#'
#' @param path The path of files to load
#' @param source The name of external data source
Expand Down Expand Up @@ -791,8 +791,8 @@ loadDF <- function(x, ...) {
#' Creates an external table based on the dataset in a data source,
#' Returns a SparkDataFrame associated with the external table.
#'
#' The data source is specified by the `source` and a set of options(...).
#' If `source` is not specified, the default data source configured by
#' The data source is specified by the \code{source} and a set of options(...).
#' If \code{source} is not specified, the default data source configured by
#' "spark.sql.sources.default" will be used.
#'
#' @param tableName a name of the table.
Expand Down Expand Up @@ -830,22 +830,22 @@ createExternalTable <- function(x, ...) {
#' Additional JDBC database connection properties can be set (...)
#'
#' Only one of partitionColumn or predicates should be set. Partitions of the table will be
#' retrieved in parallel based on the `numPartitions` or by the predicates.
#' retrieved in parallel based on the \code{numPartitions} or by the predicates.
#'
#' Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash
#' your external database systems.
#'
#' @param url JDBC database url of the form `jdbc:subprotocol:subname`
#' @param url JDBC database url of the form \code{jdbc:subprotocol:subname}
#' @param tableName the name of the table in the external database
#' @param partitionColumn the name of a column of integral type that will be used for partitioning
#' @param lowerBound the minimum value of `partitionColumn` used to decide partition stride
#' @param upperBound the maximum value of `partitionColumn` used to decide partition stride
#' @param numPartitions the number of partitions, This, along with `lowerBound` (inclusive),
#' `upperBound` (exclusive), form partition strides for generated WHERE
#' clause expressions used to split the column `partitionColumn` evenly.
#' @param lowerBound the minimum value of \code{partitionColumn} used to decide partition stride
#' @param upperBound the maximum value of \code{partitionColumn} used to decide partition stride
#' @param numPartitions the number of partitions, This, along with \code{lowerBound} (inclusive),
#' \code{upperBound} (exclusive), form partition strides for generated WHERE
#' clause expressions used to split the column \code{partitionColumn} evenly.
#' This defaults to SparkContext.defaultParallelism when unset.
#' @param predicates a list of conditions in the where clause; each one defines one partition
#' @param ... additional JDBC database connection named propertie(s).
#' @param ... additional JDBC database connection named properties.
#' @return SparkDataFrame
#' @rdname read.jdbc
#' @name read.jdbc
Expand Down
Loading