[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR #16670

felixcheung · 2017-01-22T02:19:37Z

What changes were proposed in this pull request?

This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)

Before:

> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$count@jc, "explain", TRUE)
NULL

After:

> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$count@jc, "explain", TRUE)
count#11L
NULL

Now, column.explain() doesn't seem very useful (we can get more extensive output with DataFrame.explain()) but there are other more complex examples with calls of println in Scala/JVM side, that are getting dropped.

How was this patch tested?

manual

felixcheung · 2017-01-22T02:20:39Z

R/pkg/R/utils.R

  } else {
-    system2(script, combinedArgs, wait = capture, stdout = capture)
+    system2(script, combinedArgs, wait = wait)


http://stat.ethz.ch/R-manual/R-devel/library/base/html/system2.html
stdout = F means "discard output"
stdout = "" (default) means to the console

http://www.astrostatistics.psu.edu/datasets/R/html/base/html/shell.html
on Windows, intern = F seems to mean output to the console.
(doc page is missing on stat.ethz.ch)

Minor - can we explicitly pass stdout = "" and put this comment on top of it ?

SparkQA · 2017-01-22T03:00:22Z

Test build #71790 has finished for PR 16670 at commit 294ce99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2017-01-22T04:27:09Z

R/pkg/inst/tests/testthat/test_Windows.R

@@ -20,7 +20,7 @@ test_that("sparkJars tag in SparkContext", {
  if (.Platform$OS.type != "windows") {
    skip("This test is only for Windows, skipped")
  }
-  testOutput <- launchScript("ECHO", "a/b/c", capture = TRUE)
+  testOutput <- launchScript("ECHO", "a/b/c", wait = TRUE)


Can we add a similar test with something getting printed on stdout from the JVM ?

we could, but unfortunately, we don't actually call launchScript with wait/capture = TRUE
we call wait/capture = FALSE and expect to let console/stdout to leak through, and return NULL.

I'll try to add test for that.

Hmm, I've tried, I don't think it would work.
When calling system2(.., wait = FALSE, capture = "") the output to stdout is actually from the child process, so I don't think we would be able to see it from the R process.
We could redirect it, but then it would be the same as system2(..., wait = FALSE, capture = TRUE) but again it wouldn't be what we are normally calling.

I think we would need to dig deeper on this.

I see - I expected the stdout = "" to be piping it to the R process stdout. We could also explicitly pass a fd to do this pipe ?

From what I observed running SparkR as a package, I'm not sure we should pipe/redirect the stdout always - it could get very noisy running SparkR from an IDE. On the other hand, often times the result (error) is not enough to debug the issue.

I'd propose we don't redirect stdout by default in keeping the IDE experience cleaner, however we should have an API to "turn this on and off" programmatically on demand. Although it is not clear system2 supports that though, stdout is either TRUE (capture to return as a character vector), NULL/FALSE (drop), "" (to the console of that child process), "name" (file name to write into)

This is an example of what in a R IDE see:

> head(p, 40) Error in handleErrors(returnStatus, conn) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 115, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (string) => double) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)

Is this what they see before this change or after it ?

this is from #16670 (comment) this is "often times the result (error) is not enough to debug the issue."

our options are:

redirect stdout always (could be very noisy)

do not redirect stdout by default, but also do not drop (this PR) - if R shell share the console with JVM (aka sparkR shell) then user would see messages, but if IDE, user would not see like [SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR #16670 (comment)

I'd further propose an API to turn on redirect on demand which would address the R IDE case

I think the second option is fine - i.e not redirecting it by default but happening to share stdout with the R process. That way if say the R IDE has some way to save or view logs from the stdout of R, then users can use that (Does RStudio have something like this ?)

The API to redirect on demand might be useful (it'll be something like setLogLevel ?) but I'm not sure we can change it for an already running JVM ?

Anyways let me review this PR one more time, I think we can discuss the new API in a separate JIRA

SparkQA · 2017-01-27T20:30:05Z

Test build #72084 has finished for PR 16670 at commit 322b760.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2017-01-27T20:37:33Z

LGTM. Merging this to master, branch-2.1

…parkR ## What changes were proposed in this pull request? This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible) Before: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) NULL ``` After: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) count#11L NULL ``` Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped. ## How was this patch tested? manual Author: Felix Cheung <[email protected]> Closes #16670 from felixcheung/rjvmstdout. (cherry picked from commit a7ab6f9) Signed-off-by: Shivaram Venkataraman <[email protected]>

…parkR ## What changes were proposed in this pull request? This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible) Before: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) NULL ``` After: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) count#11L NULL ``` Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped. ## How was this patch tested? manual Author: Felix Cheung <[email protected]> Closes apache#16670 from felixcheung/rjvmstdout.

do not drop stdout

294ce99

felixcheung commented Jan 22, 2017

View reviewed changes

shivaram reviewed Jan 22, 2017

View reviewed changes

add default, comment

322b760

asfgit closed this in a7ab6f9 Jan 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR #16670

[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR #16670

felixcheung commented Jan 22, 2017 •

edited

Loading

felixcheung Jan 22, 2017

felixcheung Jan 22, 2017 •

edited

Loading

shivaram Jan 27, 2017

SparkQA commented Jan 22, 2017

shivaram Jan 22, 2017

felixcheung Jan 22, 2017

felixcheung Jan 22, 2017

shivaram Jan 22, 2017

felixcheung Jan 23, 2017 •

edited

Loading

felixcheung Jan 26, 2017

shivaram Jan 26, 2017

felixcheung Jan 26, 2017 •

edited

Loading

shivaram Jan 27, 2017

SparkQA commented Jan 27, 2017

shivaram commented Jan 27, 2017

[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR #16670

[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR #16670

Conversation

felixcheung commented Jan 22, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

felixcheung Jan 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Jan 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Jan 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 27, 2017

shivaram commented Jan 27, 2017

felixcheung commented Jan 22, 2017 •

edited

Loading

felixcheung Jan 22, 2017 •

edited

Loading

felixcheung Jan 23, 2017 •

edited

Loading

felixcheung Jan 26, 2017 •

edited

Loading