Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR #16670

Closed
wants to merge 2 commits into from

Conversation

felixcheung
Copy link
Member

@felixcheung felixcheung commented Jan 22, 2017

What changes were proposed in this pull request?

This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)

Before:

> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$count@jc, "explain", TRUE)
NULL

After:

> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$count@jc, "explain", TRUE)
count#11L
NULL

Now, column.explain() doesn't seem very useful (we can get more extensive output with DataFrame.explain()) but there are other more complex examples with calls of println in Scala/JVM side, that are getting dropped.

How was this patch tested?

manual

} else {
system2(script, combinedArgs, wait = capture, stdout = capture)
system2(script, combinedArgs, wait = wait)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://stat.ethz.ch/R-manual/R-devel/library/base/html/system2.html
stdout = F means "discard output"
stdout = "" (default) means to the console

Copy link
Member Author

@felixcheung felixcheung Jan 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://www.astrostatistics.psu.edu/datasets/R/html/base/html/shell.html
on Windows, intern = F seems to mean output to the console.
(doc page is missing on stat.ethz.ch)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - can we explicitly pass stdout = "" and put this comment on top of it ?

@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71790 has finished for PR 16670 at commit 294ce99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -20,7 +20,7 @@ test_that("sparkJars tag in SparkContext", {
if (.Platform$OS.type != "windows") {
skip("This test is only for Windows, skipped")
}
testOutput <- launchScript("ECHO", "a/b/c", capture = TRUE)
testOutput <- launchScript("ECHO", "a/b/c", wait = TRUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a similar test with something getting printed on stdout from the JVM ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but unfortunately, we don't actually call launchScript with wait/capture = TRUE
we call wait/capture = FALSE and expect to let console/stdout to leak through, and return NULL.

I'll try to add test for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I've tried, I don't think it would work.
When calling system2(.., wait = FALSE, capture = "") the output to stdout is actually from the child process, so I don't think we would be able to see it from the R process.
We could redirect it, but then it would be the same as system2(..., wait = FALSE, capture = TRUE) but again it wouldn't be what we are normally calling.

I think we would need to dig deeper on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - I expected the stdout = "" to be piping it to the R process stdout. We could also explicitly pass a fd to do this pipe ?

Copy link
Member Author

@felixcheung felixcheung Jan 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I observed running SparkR as a package, I'm not sure we should pipe/redirect the stdout always - it could get very noisy running SparkR from an IDE. On the other hand, often times the result (error) is not enough to debug the issue.

I'd propose we don't redirect stdout by default in keeping the IDE experience cleaner, however we should have an API to "turn this on and off" programmatically on demand. Although it is not clear system2 supports that though, stdout is either TRUE (capture to return as a character vector), NULL/FALSE (drop), "" (to the console of that child process), "name" (file name to write into)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of what in a R IDE see:

> head(p, 40)
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 115, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (string) => double)
       at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
       at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
       at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
       at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
       at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
       at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
       at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what they see before this change or after it ?

Copy link
Member Author

@felixcheung felixcheung Jan 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from #16670 (comment) this is "often times the result (error) is not enough to debug the issue."

our options are:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second option is fine - i.e not redirecting it by default but happening to share stdout with the R process. That way if say the R IDE has some way to save or view logs from the stdout of R, then users can use that (Does RStudio have something like this ?)

The API to redirect on demand might be useful (it'll be something like setLogLevel ?) but I'm not sure we can change it for an already running JVM ?

Anyways let me review this PR one more time, I think we can discuss the new API in a separate JIRA

@SparkQA
Copy link

SparkQA commented Jan 27, 2017

Test build #72084 has finished for PR 16670 at commit 322b760.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

LGTM. Merging this to master, branch-2.1

asfgit pushed a commit that referenced this pull request Jan 27, 2017
…parkR

## What changes were proposed in this pull request?

This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)

Before:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
NULL
```

After:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
count#11L
NULL
```

Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.

## How was this patch tested?

manual

Author: Felix Cheung <[email protected]>

Closes #16670 from felixcheung/rjvmstdout.

(cherry picked from commit a7ab6f9)
Signed-off-by: Shivaram Venkataraman <[email protected]>
@asfgit asfgit closed this in a7ab6f9 Jan 27, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…parkR

## What changes were proposed in this pull request?

This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)

Before:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
NULL
```

After:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
count#11L
NULL
```

Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.

## How was this patch tested?

manual

Author: Felix Cheung <[email protected]>

Closes apache#16670 from felixcheung/rjvmstdout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants