Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite #15725

Closed
wants to merge 19 commits into from

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Nov 2, 2016

What changes were proposed in this pull request?

It seems the proximate cause of the test failures is that cast(str as decimal) in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts __HIVE_DEFAULT_PARTITION__ entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034

Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in MetaStoreDirectSql.java above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as >, that will fail as well.

The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness.

Update: we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production).

How was this patch tested?

N/A

@ericl
Copy link
Contributor Author

ericl commented Nov 2, 2016

This also re-enables the test. @marmbrus feel free to revert this patch as soon as we have a sample failure output.

@ericl
Copy link
Contributor Author

ericl commented Nov 2, 2016

@rxin

@yhuai
Copy link
Contributor

yhuai commented Nov 2, 2016

lgtm

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67944 has finished for PR 15725 at commit 0dca3ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericl
Copy link
Contributor Author

ericl commented Nov 2, 2016

Oh cool it flaked on that run

jenkins retest this please. // let's try to get a successful run for comparison

@JoshRosen
Copy link
Contributor

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67950 has finished for PR 15725 at commit 0dca3ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Nov 2, 2016

spark.sql.hive.metastorePartitionPruning is seen in failed test, not in successful test.

@ericl
Copy link
Contributor Author

ericl commented Nov 2, 2016

Oh wow, so it's deterministically broken, but some suite must be leaking a
conf that disables the flag. This explains why I couldn't find a preceding
suite to blame for the flake - it's merely the absence of a suite that
causes the failure.

On Tue, Nov 1, 2016, 11:43 PM Liang-Chi Hsieh [email protected]
wrote:

spark.sql.hive.metastorePartitionPruning is seen in failed test, not in
successful test.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#15725 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6SqQETH_l9h2a-HgenDMzQ8ejnKqoks5q6DEWgaJpZM4KmxsT
.

@ericl ericl closed this Nov 2, 2016
@viirya
Copy link
Member

viirya commented Nov 2, 2016

Should be HiveTableScanSuite. I think.

@viirya
Copy link
Member

viirya commented Nov 2, 2016

HiveTableScanSuite changes the config spark.sql.hive.metastorePartitionPruning to false in the end.

In the failed test run, HiveTableScanSuite runs before SQLQuerySuite. In the successful run, HiveTableScanSuite runs after SQLQuerySuite.

@ericl
Copy link
Contributor Author

ericl commented Nov 2, 2016

After some local experimental, I think that is actually a red herring. spark.sql.hive.metastorePartitionPruning defaults to true now, so omitting it has no effect. Also, I am not able to reproduce the failure locally by setting any of the spark confs. (Haven't tried the hive confs since there are so many, but I don't see anything obviously different there).

@ericl ericl changed the title [SPARK-18167] Print out spark confs, and hive confs when SQLQuerySuite fails [SPARK-18167] [DO NOT MERGE] Print out spark confs, and hive confs when SQLQuerySuite fails Nov 2, 2016
@ericl ericl reopened this Nov 2, 2016
@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #68001 has finished for PR 15725 at commit 0f3a787.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #68003 has finished for PR 15725 at commit 1e06b72.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #68007 has finished for PR 15725 at commit e131f60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #68009 has finished for PR 15725 at commit aeede50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #68005 has finished for PR 15725 at commit f002f41.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #3400 has finished for PR 15725 at commit b03bbfa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #68014 has finished for PR 15725 at commit b03bbfa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #68006 has finished for PR 15725 at commit 31041c0.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #3401 has finished for PR 15725 at commit b03bbfa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #3402 has finished for PR 15725 at commit b03bbfa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2016

Test build #68037 has finished for PR 15725 at commit b8c07d2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericl ericl changed the title [SPARK-18167] [DO NOT MERGE] Print out spark confs, and hive confs when SQLQuerySuite fails [SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite Nov 3, 2016
@SparkQA
Copy link

SparkQA commented Nov 3, 2016

Test build #68076 has finished for PR 15725 at commit b1e912b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Nov 4, 2016

lgtm. merging to master and branch 2.1.

asfgit pushed a commit that referenced this pull request Nov 4, 2016
## What changes were proposed in this pull request?

It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034

Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well.

The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness.

**Update:** we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production).

## How was this patch tested?

N/A

Author: Eric Liang <[email protected]>

Closes #15725 from ericl/print-confs-out.

(cherry picked from commit 4cee2ce)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in 4cee2ce Nov 4, 2016
@jiangxb1987
Copy link
Contributor

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034

Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well.

The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness.

**Update:** we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production).

## How was this patch tested?

N/A

Author: Eric Liang <[email protected]>

Closes apache#15725 from ericl/print-confs-out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants