-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite #15725
Conversation
This also re-enables the test. @marmbrus feel free to revert this patch as soon as we have a sample failure output. |
lgtm |
Test build #67944 has finished for PR 15725 at commit
|
Oh cool it flaked on that run jenkins retest this please. // let's try to get a successful run for comparison |
Jenkins, retest this please |
Test build #67950 has finished for PR 15725 at commit
|
|
Oh wow, so it's deterministically broken, but some suite must be leaking a On Tue, Nov 1, 2016, 11:43 PM Liang-Chi Hsieh [email protected]
|
Should be HiveTableScanSuite. I think. |
In the failed test run, |
After some local experimental, I think that is actually a red herring. |
eac0abc
to
29ae24d
Compare
29ae24d
to
39e8ce7
Compare
Test build #68001 has finished for PR 15725 at commit
|
Test build #68003 has finished for PR 15725 at commit
|
Test build #68007 has finished for PR 15725 at commit
|
Test build #68009 has finished for PR 15725 at commit
|
Test build #68005 has finished for PR 15725 at commit
|
Test build #3400 has finished for PR 15725 at commit
|
Test build #68014 has finished for PR 15725 at commit
|
Test build #68006 has finished for PR 15725 at commit
|
Test build #3401 has finished for PR 15725 at commit
|
Test build #3402 has finished for PR 15725 at commit
|
Test build #68037 has finished for PR 15725 at commit
|
Test build #68076 has finished for PR 15725 at commit
|
lgtm. merging to master and branch 2.1. |
## What changes were proposed in this pull request? It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034 Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well. The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness. **Update:** we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production). ## How was this patch tested? N/A Author: Eric Liang <[email protected]> Closes #15725 from ericl/print-confs-out. (cherry picked from commit 4cee2ce) Signed-off-by: Yin Huai <[email protected]>
This is also flaky.. |
## What changes were proposed in this pull request? It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034 Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well. The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness. **Update:** we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production). ## How was this patch tested? N/A Author: Eric Liang <[email protected]> Closes apache#15725 from ericl/print-confs-out.
What changes were proposed in this pull request?
It seems the proximate cause of the test failures is that
cast(str as decimal)
in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts__HIVE_DEFAULT_PARTITION__
entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in
MetaStoreDirectSql.java
above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as>
, that will fail as well.The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness.
Update: we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production).
How was this patch tested?
N/A