[SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite #15725

ericl · 2016-11-02T01:49:07Z

What changes were proposed in this pull request?

It seems the proximate cause of the test failures is that cast(str as decimal) in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts __HIVE_DEFAULT_PARTITION__ entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034

Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in MetaStoreDirectSql.java above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as >, that will fail as well.

The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness.

Update: we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production).

How was this patch tested?

N/A

ericl · 2016-11-02T01:50:22Z

This also re-enables the test. @marmbrus feel free to revert this patch as soon as we have a sample failure output.

ericl · 2016-11-02T01:56:00Z

@rxin

yhuai · 2016-11-02T02:08:07Z

lgtm

SparkQA · 2016-11-02T02:58:40Z

Test build #67944 has finished for PR 15725 at commit 0dca3ac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-11-02T03:28:01Z

Oh cool it flaked on that run

jenkins retest this please. // let's try to get a successful run for comparison

JoshRosen · 2016-11-02T04:06:02Z

Jenkins, retest this please

SparkQA · 2016-11-02T05:43:01Z

Test build #67950 has finished for PR 15725 at commit 0dca3ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-11-02T06:42:39Z

spark.sql.hive.metastorePartitionPruning is seen in failed test, not in successful test.

ericl · 2016-11-02T06:49:36Z

Oh wow, so it's deterministically broken, but some suite must be leaking a
conf that disables the flag. This explains why I couldn't find a preceding
suite to blame for the flake - it's merely the absence of a suite that
causes the failure.

On Tue, Nov 1, 2016, 11:43 PM Liang-Chi Hsieh [email protected]
wrote:

spark.sql.hive.metastorePartitionPruning is seen in failed test, not in
successful test.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#15725 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6SqQETH_l9h2a-HgenDMzQ8ejnKqoks5q6DEWgaJpZM4KmxsT
.

viirya · 2016-11-02T07:06:30Z

Should be HiveTableScanSuite. I think.

viirya · 2016-11-02T07:11:05Z

HiveTableScanSuite changes the config spark.sql.hive.metastorePartitionPruning to false in the end.

In the failed test run, HiveTableScanSuite runs before SQLQuerySuite. In the successful run, HiveTableScanSuite runs after SQLQuerySuite.

ericl · 2016-11-02T17:44:24Z

After some local experimental, I think that is actually a red herring. spark.sql.hive.metastorePartitionPruning defaults to true now, so omitting it has no effect. Also, I am not able to reproduce the failure locally by setting any of the spark confs. (Haven't tried the hive confs since there are so many, but I don't see anything obviously different there).

SparkQA · 2016-11-02T18:41:24Z

Test build #68001 has finished for PR 15725 at commit 0f3a787.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T18:46:42Z

Test build #68003 has finished for PR 15725 at commit 1e06b72.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T21:05:22Z

Test build #68007 has finished for PR 15725 at commit e131f60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T21:20:58Z

Test build #68009 has finished for PR 15725 at commit aeede50.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T21:30:50Z

Test build #68005 has finished for PR 15725 at commit f002f41.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T22:54:08Z

Test build #3400 has finished for PR 15725 at commit b03bbfa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T23:02:01Z

Test build #68014 has finished for PR 15725 at commit b03bbfa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T23:02:34Z

Test build #68006 has finished for PR 15725 at commit 31041c0.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T23:06:10Z

Test build #3401 has finished for PR 15725 at commit b03bbfa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-02T23:09:22Z

Test build #3402 has finished for PR 15725 at commit b03bbfa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-03T02:20:44Z

Test build #68037 has finished for PR 15725 at commit b8c07d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-03T21:17:35Z

Test build #68076 has finished for PR 15725 at commit b1e912b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-11-04T22:53:09Z

lgtm. merging to master and branch 2.1.

## What changes were proposed in this pull request? It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034 Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well. The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness. **Update:** we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production). ## How was this patch tested? N/A Author: Eric Liang <[email protected]> Closes #15725 from ericl/print-confs-out. (cherry picked from commit 4cee2ce) Signed-off-by: Yin Huai <[email protected]>

HyukjinKwon · 2016-11-05T12:22:50Z

Maybe this one is also flaky..
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68198/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68185/consoleFull

jiangxb1987 · 2016-11-06T04:02:49Z

This is also flaky..
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68219/consoleFull

## What changes were proposed in this pull request? It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034 Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well. The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness. **Update:** we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production). ## How was this patch tested? N/A Author: Eric Liang <[email protected]> Closes apache#15725 from ericl/print-confs-out.

ericl added 2 commits November 1, 2016 18:38

Tue Nov 1 18:38:57 PDT 2016

88adf1d

Merge branch 'master' into print-confs-out

0dca3ac

ericl closed this Nov 2, 2016

ericl added 2 commits November 2, 2016 11:25

Wed Nov 2 11:25:46 PDT 2016

5129d9d

Wed Nov 2 11:26:09 PDT 2016

5f7d799

ericl changed the title ~~[SPARK-18167] Print out spark confs, and hive confs when SQLQuerySuite fails~~ [SPARK-18167] [DO NOT MERGE] Print out spark confs, and hive confs when SQLQuerySuite fails Nov 2, 2016

ericl reopened this Nov 2, 2016

ericl force-pushed the print-confs-out branch from eac0abc to 29ae24d Compare November 2, 2016 18:27

trigger retest

39e8ce7

ericl force-pushed the print-confs-out branch from 29ae24d to 39e8ce7 Compare November 2, 2016 18:27

ericl added 7 commits November 2, 2016 11:29

Wed Nov 2 11:29:00 PDT 2016

0f3a787

Wed Nov 2 11:33:55 PDT 2016

f4710dc

Wed Nov 2 11:34:00 PDT 2016

debba89

Wed Nov 2 11:34:03 PDT 2016

7b90687

Wed Nov 2 11:34:08 PDT 2016

bbac536

Wed Nov 2 11:34:12 PDT 2016

e7c6c8f

Wed Nov 2 11:34:15 PDT 2016

1e06b72

Wed Nov 2 11:46:15 PDT 2016

f002f41

ericl added 3 commits November 2, 2016 11:50

Wed Nov 2 11:50:23 PDT 2016

31041c0

Wed Nov 2 11:57:19 PDT 2016

e131f60

Wed Nov 2 12:29:27 PDT 2016

aeede50

Wed Nov 2 14:18:31 PDT 2016

b03bbfa

Wed Nov 2 17:33:42 PDT 2016

b8c07d2

Thu Nov 3 11:42:32 PDT 2016

b1e912b

ericl changed the title ~~[SPARK-18167] [DO NOT MERGE] Print out spark confs, and hive confs when SQLQuerySuite fails~~ [SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite Nov 3, 2016

asfgit closed this in 4cee2ce Nov 4, 2016

jiangxb1987 mentioned this pull request Nov 6, 2016

[SPARK-18191][CORE] Port RDD API to use commit protocol #15769

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite #15725

[SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite #15725

ericl commented Nov 2, 2016 •

edited

Loading

ericl commented Nov 2, 2016 •

edited

Loading

ericl commented Nov 2, 2016

yhuai commented Nov 2, 2016

SparkQA commented Nov 2, 2016

ericl commented Nov 2, 2016

JoshRosen commented Nov 2, 2016

SparkQA commented Nov 2, 2016

viirya commented Nov 2, 2016

ericl commented Nov 2, 2016

viirya commented Nov 2, 2016

viirya commented Nov 2, 2016

ericl commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 3, 2016

SparkQA commented Nov 3, 2016

yhuai commented Nov 4, 2016

HyukjinKwon commented Nov 5, 2016 •

edited

Loading

jiangxb1987 commented Nov 6, 2016

[SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite #15725

[SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite #15725

Conversation

ericl commented Nov 2, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

ericl commented Nov 2, 2016 • edited Loading

ericl commented Nov 2, 2016

yhuai commented Nov 2, 2016

SparkQA commented Nov 2, 2016

ericl commented Nov 2, 2016

JoshRosen commented Nov 2, 2016

SparkQA commented Nov 2, 2016

viirya commented Nov 2, 2016

ericl commented Nov 2, 2016

viirya commented Nov 2, 2016

viirya commented Nov 2, 2016

ericl commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 2, 2016

SparkQA commented Nov 3, 2016

SparkQA commented Nov 3, 2016

yhuai commented Nov 4, 2016

HyukjinKwon commented Nov 5, 2016 • edited Loading

jiangxb1987 commented Nov 6, 2016

ericl commented Nov 2, 2016 •

edited

Loading

ericl commented Nov 2, 2016 •

edited

Loading

HyukjinKwon commented Nov 5, 2016 •

edited

Loading