Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3193]output errer info when Process exit code is not zero in test suite #2108

Closed
wants to merge 2 commits into from

Conversation

scwf
Copy link
Contributor

@scwf scwf commented Aug 24, 2014

https://issues.apache.org/jira/browse/SPARK-3193
I noticed that sometimes pr tests failed due to the Process exitcode != 0,refer to
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull

[info] SparkSubmitSuite:
[info] - prints usage on empty input
[info] - prints usage with only --help
[info] - prints error with unrecognized options
[info] - handle binary specified but not class
[info] - handles arguments with --key=val
[info] - handles arguments to user program
[info] - handles arguments to user program with name collision
[info] - handles YARN cluster mode
[info] - handles YARN client mode
[info] - handles standalone cluster mode
[info] - handles standalone client mode
[info] - handles mesos client mode
[info] - handles confs with flag equivalents
[info] - launch simple application with spark-submit *** FAILED ***
[info] org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited with code 1
[info] at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
[info] at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
[info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
[info] at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
[info] at org.apacSpark assembly has been built with Hive, including Datanucleus jars on classpath

this PR output the process error info when failed, it can be helpful for diagnosis.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ScrapCodes
Copy link
Member

Hey !, Thanks for raising this concern.

The convention in spark is that we look in the [sub-project]/target/unit-tests.log. And this is applicable to all test suits. So when you saw a particular test fail on the Jenkins, you can rerun that test locally and then check that unit-tests.log file for that sub-project.

I hope this helps. You can close this PR if you are convinced.

P.S: May be we can expand our wiki page with this information.

@scwf
Copy link
Contributor Author

scwf commented Aug 25, 2014

Hi, @ScrapCodes, i think unit-tests.log is very big and it's hard to search the matched log of a test suite. And the key point is that the jenkins failed accidentally due to "Process exitcode != 0", reruning may can not replay the error. This PR output the error only when failed.

refer to our discuss on dev list
http://apache-spark-developers-list.1001551.n3.nabble.com/mvn-test-error-td7858.html

@ScrapCodes
Copy link
Member

Ahh, I am still not sure of the changes. May be the PR has more changes than just fix what you said ?

@scwf
Copy link
Contributor Author

scwf commented Aug 25, 2014

@ScrapCodes, yeah, here i fixed the log4j config of the forked process, this is because the old version is not valid. By the old version the forked process's InputStream output nothing

@andrewor14
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 2108 at commit 563fde1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

Tests timed out after a configured wait of 120m.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2108 at commit 563fde1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2108 at commit 563fde1.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@scwf
Copy link
Contributor Author

scwf commented Aug 29, 2014

@andrewor14, tests passed~

@andrewor14
Copy link
Contributor

@scwf Looks like the code already redirects all stderr of the subprocess to the console. Also, if the process does fail, what we want is not the stdout but stderr, but your code logs the stdout to error. Is there a reason why you're configuring log4j at all? Not sure if I understand that part of your changes.

@scwf
Copy link
Contributor Author

scwf commented Aug 30, 2014

@andrewor14, stderr may not enough when process exitcode != 0(see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19505/consoleFull), why i configure log4j is that

Logger.getRootLogger().setLevel(Level.WARN)

in DriverSuite is not valid unless we have configured log4j property

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@andrewor14
Copy link
Contributor

I see. I would like to see the SparkSubmitSuite and DriverSuite tests fail with this PR so we know what the messages look like before we merge it. Unfortunately these are flaky themselves so we may have to test this a few times...

ok to test

@JoshRosen
Copy link
Contributor

These tests are reliably failing in Jenkins Maven build for Spark Master with YARN.

Maybe they'll fail here, too.

Jenkins, retest this please.

@JoshRosen
Copy link
Contributor

I SSH'ed into one of the Jenkins boxes and ran the Maven build using this, which resulted in a very interesting error message when SparkSubmitSuite failed:

4/09/08 11:41:40 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
14/09/08 11:41:40 WARN AbstractLifeCycle: FAILED [email protected]:4044: java.net.BindException: Address already in use
[... x 15 ..]
java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:444)
    at sun.nio.ch.Net.bind(Net.java:436)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
    at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
    at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.eclipse.jetty.server.Server.doStart(Server.java:293)
    at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:192)
    at org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
    at org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
    at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1414)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
    at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1410)
    at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:202)
    at org.apache.spark.ui.WebUI.bind(WebUI.scala:102)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:225)
    at org.apache.spark.deploy.JarCreationTest$.main(SparkSubmitSuite.scala:322)
    at org.apache.spark.deploy.JarCreationTest.main(SparkSubmitSuite.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
- spark submit includes jars passed in through --jar *** FAILED ***
  org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1410205297744-1/testJar-1410205297797.jar,file:/tmp/1410205297797-0/testJar-1410205297849.jar, file:/tmp/1410205297744-0/testJar-1410205297744.jar) exited with code 1
  at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:840)
  at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
  at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
  at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...

It looks like the SparkSubmitSuite test is failing due to running out of Spark web UI ports. This explains why I couldn't reproduce this failure when running that test in isolation.

@JoshRosen
Copy link
Contributor

Oh, and the test failure message helpfully included the actual spark-submit command:

- spark submit includes jars passed in through --jar *** FAILED ***
  org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, local-cluster[2,1,512], --jars, file:/tmp/1410205297744-1/testJar-1410205297797.jar,file:/tmp/1410205297797-0/testJar-1410205297849.jar, file:/tmp/1410205297744-0/testJar-1410205297744.jar) exited with code 1
  at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:840)
  at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
  at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
  at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)

@JoshRosen
Copy link
Contributor

I'm going to merge this patch, since this additional logging will be helpful in diagnosing the problem behind the failing DriverSuite and SparkSubmitSuite tests.

@JoshRosen
Copy link
Contributor

Actually, let me make sure that it passes Jenkins first...

Jenkins, this is ok to test.

@srowen
Copy link
Member

srowen commented Sep 8, 2014

I also was unable to reproduce it in Maven when only SparkSubmitSuite was run. This does seem consistent with the failure being caused by port conflict but I still can't guess why it doesn't happen in SBT.

I found that setting <forkMode>never</forkMode> in the scalatest plugin causes this stack trace to show up immediately in the test output too, which is at least handy for reproducing it.

Another interesting data point is that if I hack the code temporarily to choose a random UI port, it's all fine.

Choosing ports randomly might be a good idea in the longer-term anyway to facilitate parallel tests, and I believe is mentioned in another PR.

I suppose it's a potential solution here too, but would be good to understand why it doesn't happen with SBT. Neither is parallel. Maybe some difference in the test lifecycle allows the previous web servers to shut down more reliably before the next ones start? I'm only guessing at this stage.

@andrewor14
Copy link
Contributor

@srowen Actually we already use random ports in SBT tests (by setting spark.ui.port to 0). This was introduced in #1777. This may be why the tests are failing in MVN much more frequently than in SBT. This also explains why we can't reproduce the test failure when we only run one or two tests, because we haven't had a chance to accumulate open ports.

@JoshRosen Even before this patch it printed the command it ran, but it seems that with this patch we also get the actual stack trace with the BindException, which is very nice.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have started for PR 2108 at commit 563fde1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have started for PR 2108 at commit 563fde1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 9, 2014

QA tests have finished for PR 2108 at commit 563fde1.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -18,9 +18,9 @@
package org.apache.spark

import java.io.File
import java.util.Properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import not used I think


import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.log4j.{PropertyConfigurator, Logger, Level}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks for this

@srowen
Copy link
Member

srowen commented Sep 9, 2014

@andrewor14 @JoshRosen Bingo! Adding <spark.ui.port>0</spark.ui.port> to the Maven build makes SparkSubmitSuite pass for me where it failed before. This was already set in the SBT build. I'll submit a PR against https://issues.apache.org/jira/browse/SPARK-3404

@scwf
Copy link
Contributor Author

scwf commented Sep 9, 2014

@srowen, it is cool but can not explain SBT test failure. we can use this pr to test a few times to diagnose SBT test problem.

@srowen
Copy link
Member

srowen commented Sep 9, 2014

@scwf The SBT build has already set -Dspark.ui.port=0, so I suspect this is in fact the difference:
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L340

@scwf
Copy link
Contributor Author

scwf commented Sep 9, 2014

@srowen ,yeah, we should set spark.ui.port in maven, but 'exited with code 1' error also happened when test using SBT, there may be some other cause behind this

@srowen
Copy link
Member

srowen commented Sep 9, 2014

@scwf Yes there is still some underlying issue here, where tests hold open ports somehow for a long time. Randomizing the starting port usually avoids most collision, but not all the time. It's a step forward but it is not the ultimate solution.

@JoshRosen
Copy link
Contributor

I wonder if this is a resource contention issue from having many parallel copies of the tests running on the same Jenkins worker. For example, we might be exhausting ephemeral ports.

I think that #2259 will help with this by dramatically reducing the number of ephemeral ports used by PySpark jobs.

@scwf
Copy link
Contributor Author

scwf commented Sep 9, 2014

@JoshRosen, exhausting ephemeral ports has small probability, if so we should reduce ephemeral ports. Or we can just verify the port available before using it and try several times to get a free one?

@andrewor14
Copy link
Contributor

@scwf Actually we already try several times before getting a free one. My interpretation of this is we simply ran out of ports, such that no matter how many times we retry we can't get a free one. Using a random port for the UI certainly reduces probability of collision, but yes it is definitely not the final solution as it simply puts maven tests on par with SBT tests, the latter of which still fail sometimes but not always.

@JoshRosen #2259 may help for PySpark tests, but SparkSubmitSuite and DriverSuite are run before those, so regardless we need to find another way to reduce the number of ports used.

One solution @JoshRosen and I discussed is to simply not start the SparkUI during tests (maybe except for the UISuite). This alone will surely free up a significant number of ports we currently occupy.

@andrewor14
Copy link
Contributor

For now, I will merge this (master + 1.1) because this helps us debug these test failures. Thanks.

@asfgit asfgit closed this in 2686233 Sep 9, 2014
asfgit pushed a commit that referenced this pull request Sep 9, 2014
…est suite

https://issues.apache.org/jira/browse/SPARK-3193
I noticed that sometimes pr tests failed due to the Process exitcode != 0,refer to
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull

[info] SparkSubmitSuite:
[info] - prints usage on empty input
[info] - prints usage with only --help
[info] - prints error with unrecognized options
[info] - handle binary specified but not class
[info] - handles arguments with --key=val
[info] - handles arguments to user program
[info] - handles arguments to user program with name collision
[info] - handles YARN cluster mode
[info] - handles YARN client mode
[info] - handles standalone cluster mode
[info] - handles standalone client mode
[info] - handles mesos client mode
[info] - handles confs with flag equivalents
[info] - launch simple application with spark-submit *** FAILED ***
[info]   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited with code 1
[info]   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
[info]   at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
[info]   at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
[info]   at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
[info]   at org.apacSpark assembly has been built with Hive, including Datanucleus jars on classpath

this PR output the process error info when failed, it can be helpful for diagnosis.

Author: scwf <[email protected]>

Closes #2108 from scwf/output-test-error-info and squashes the following commits:

0c48082 [scwf] minor fix according to comments
563fde1 [scwf] output errer info when Process exitcode not zero

(cherry picked from commit 2686233)
Signed-off-by: Andrew Or <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants