-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR #16670
Conversation
} else { | ||
system2(script, combinedArgs, wait = capture, stdout = capture) | ||
system2(script, combinedArgs, wait = wait) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/system2.html
stdout = F means "discard output"
stdout = "" (default) means to the console
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://www.astrostatistics.psu.edu/datasets/R/html/base/html/shell.html
on Windows, intern = F seems to mean output to the console.
(doc page is missing on stat.ethz.ch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor - can we explicitly pass stdout = ""
and put this comment on top of it ?
Test build #71790 has finished for PR 16670 at commit
|
@@ -20,7 +20,7 @@ test_that("sparkJars tag in SparkContext", { | |||
if (.Platform$OS.type != "windows") { | |||
skip("This test is only for Windows, skipped") | |||
} | |||
testOutput <- launchScript("ECHO", "a/b/c", capture = TRUE) | |||
testOutput <- launchScript("ECHO", "a/b/c", wait = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a similar test with something getting printed on stdout
from the JVM ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could, but unfortunately, we don't actually call launchScript with wait/capture = TRUE
we call wait/capture = FALSE and expect to let console/stdout to leak through, and return NULL.
I'll try to add test for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I've tried, I don't think it would work.
When calling system2(.., wait = FALSE, capture = "")
the output to stdout is actually from the child process, so I don't think we would be able to see it from the R process.
We could redirect it, but then it would be the same as system2(..., wait = FALSE, capture = TRUE)
but again it wouldn't be what we are normally calling.
I think we would need to dig deeper on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - I expected the stdout = "" to be piping it to the R process stdout. We could also explicitly pass a fd
to do this pipe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I observed running SparkR as a package, I'm not sure we should pipe/redirect the stdout always - it could get very noisy running SparkR from an IDE. On the other hand, often times the result (error) is not enough to debug the issue.
I'd propose we don't redirect stdout by default in keeping the IDE experience cleaner, however we should have an API to "turn this on and off" programmatically on demand. Although it is not clear system2
supports that though, stdout
is either TRUE (capture to return as a character vector), NULL/FALSE (drop), "" (to the console of that child process), "name" (file name to write into)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of what in a R IDE see:
> head(p, 40)
Error in handleErrors(returnStatus, conn) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 70.0 failed 1 times, most recent failure: Lost task 0.0 in stage 70.0 (TID 115, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$4: (string) => double)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what they see before this change or after it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is from #16670 (comment) this is "often times the result (error) is not enough to debug the issue."
our options are:
- redirect stdout always (could be very noisy)
- do not redirect stdout by default, but also do not drop (this PR) - if R shell share the console with JVM (aka sparkR shell) then user would see messages, but if IDE, user would not see like [SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR #16670 (comment)
- I'd further propose an API to turn on redirect on demand which would address the R IDE case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the second option is fine - i.e not redirecting it by default but happening to share stdout with the R process. That way if say the R IDE has some way to save or view logs from the stdout of R, then users can use that (Does RStudio have something like this ?)
The API to redirect on demand might be useful (it'll be something like setLogLevel ?) but I'm not sure we can change it for an already running JVM ?
Anyways let me review this PR one more time, I think we can discuss the new API in a separate JIRA
Test build #72084 has finished for PR 16670 at commit
|
LGTM. Merging this to master, branch-2.1 |
…parkR ## What changes were proposed in this pull request? This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible) Before: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) NULL ``` After: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) count#11L NULL ``` Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped. ## How was this patch tested? manual Author: Felix Cheung <[email protected]> Closes #16670 from felixcheung/rjvmstdout. (cherry picked from commit a7ab6f9) Signed-off-by: Shivaram Venkataraman <[email protected]>
…parkR ## What changes were proposed in this pull request? This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible) Before: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) NULL ``` After: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) count#11L NULL ``` Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped. ## How was this patch tested? manual Author: Felix Cheung <[email protected]> Closes apache#16670 from felixcheung/rjvmstdout.
What changes were proposed in this pull request?
This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)
Before:
After:
Now,
column.explain()
doesn't seem very useful (we can get more extensive output withDataFrame.explain()
) but there are other more complex examples with calls ofprintln
in Scala/JVM side, that are getting dropped.How was this patch tested?
manual