Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3809][SQL]fix HiveThriftServer2Suite to make it work correctly #2671

Closed
wants to merge 3 commits into from

Conversation

scwf
Copy link
Contributor

@scwf scwf commented Oct 6, 2014

Currently HiveThriftServer2Suite is a fake one, actually HiveThriftServer not even started there. Issues here:
1 Thriftserver not started. Testing will get this error ---
ERROR HiveThriftServer2Suite: Failed to start Hive Thrift server within 30 seconds
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
2 Thriftserver not stoped. After test finished the process of thriftserver did not exit.

This patch fix this problems as follows:
1 Since thriftserver started as a daemon in #2509, output of thestart-thriftserver.sh has be redirect to a log file such as spark-kf-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-kf.out, so to see whether this file contain "ThriftBinaryCLIService listening on" ` to assert server started successfully

2 Start server in beforeAll

3 stop server in afterAll

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@scwf
Copy link
Contributor Author

scwf commented Oct 6, 2014

cc @liancheng

@liancheng
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2671 at commit 0081a50.

  • This patch merges cleanly.

@liancheng
Copy link
Contributor

@scwf Thanks, this is a good catch. However, I should mention that HiveThriftServer2Suite is known to be flaky before the Thrift server is made a daemon. I had opened #2214 to try to fix this issue, but unfortunately Jenkins fails because of unknown reason that I couldn't reproduce locally. After numerous unsuccessful tries, I haven't got time to get it done yet. Sorry for the trouble... The essential issue fixed in #2214 is that the exception caught in HiveThriftServer2Suite is not re-thrown in the catch clause. That's why it always passes no matter what exception is thrown.

Back to this PR, I have several comments:

  1. Personally I don't prefer to start/stop the server process in beforeAll/afterAll. I'd like to make sure every test is executed against a Thrift server process with clean states.
  2. The sleeps introduced in this PR can be eliminated by starting a tail process to watch the log file, and then monitor the output of the tail process. Since an empty log file does no harm, we can try to create a new empty log file to ensure the file exists before executing tail.
  3. Log file should be removed after stopping the server process.

Since the Jenkins failure issue in #2214 is really tricky to fix and without fixing that, we couldn't make any change to HiveThriftServer2Suite, I'm going to open new PR to have another try to fix this issue together with those left in #2214.

Thread.sleep(5000)
// logFile may have not finished, try every second
while (!logFile.exists() || (!fileToString(logFile).contains(
"ThriftBinaryCLIService listening on") && tryNum < maxTries)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryNum is never increased.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have finished for PR 2671 at commit 0081a50.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@scwf
Copy link
Contributor Author

scwf commented Oct 6, 2014

Hi, @liancheng, thanks for you comments that are very useful.
1 Actually i have tried start a server for a every test case but the second one failed due to "org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 82266. Stop it first." Now i get the reason from your patch "A Thread.sleep has to be introduced because the kill command used in stop-thriftserver.sh is not synchronous."
2 Use a tail process is ok, actually i think about it. Since the log file won't be a big file, so here i mike to use fileToString
3 Yeah, here should remove log file

@liancheng
Copy link
Contributor

BTW, Jenkins passes because the exception re-thrown issue is not fixed in your PR :) You may check the full console output for sure https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/271/consoleFull

And mine still suffers the mysterious timeout. Keep digging...

@liancheng
Copy link
Contributor

The key point to use tail is to eliminate the sleep call rather than avoid fileToString.

@scwf
Copy link
Contributor Author

scwf commented Oct 6, 2014

Get it, i will check this

@scwf
Copy link
Contributor Author

scwf commented Oct 6, 2014

It is very strange in my local maching it is ok. Hi @liancheng, can you get the log file in "stdout> starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /home/jenkins/workspace/NewSparkPullRequestBuilder/sbin/../logs/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-test02.amplab.out"?
I think may be there is a already exist thriftserver process that leads server start failed on Jenkins, we need the log file to check.

@liancheng
Copy link
Contributor

When trying #2214 several weeks ago, the Thrift server process simply couldn't start on Jenkins server, but everything's fine on my local machine.

However, the pull request builder had been refactored a lot by Josh. It seems that #2675 fails simply because my timeout was too tight for Jenkins. I'm trying to relax the timeout a bit.

@scwf
Copy link
Contributor Author

scwf commented Oct 6, 2014

Ok, since now the process output redirect to log file, we can also check with it to see where is the problem

@scwf
Copy link
Contributor Author

scwf commented Oct 6, 2014

@liancheng, can you retest this, i add some print to diagnose this

@liancheng
Copy link
Contributor

Actually #2214 fixes several issues, you need all of them to make HiveThriftServer2Suite healthy. Please refer to the PR description for details.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2671 at commit 48979f6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have finished for PR 2671 at commit 48979f6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case e: Exception => logError("Source class " + classPath + " cannot be instantiated", e)
    • case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
    • case class UncacheTableCommand(tableName: String) extends Command
    • case class CacheTableCommand(
    • case class UncacheTableCommand(tableName: String) extends LeafNode with Command
    • case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

@scwf
Copy link
Contributor Author

scwf commented Oct 8, 2014

we will fix it in #2675

@scwf scwf closed this Oct 8, 2014
@scwf scwf deleted the fix-HiveThriftServer2Suite branch October 16, 2014 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants