-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-5068][SQL]Fix bug query data when path doesn't exist for HiveContext #5059
Conversation
ok to test |
Test build #28738 has finished for PR 5059 at commit
|
val hivePartitionRDDs = partitionToDeserializer.map { case (partition, partDeserializer) => | ||
// SPARK-5068:get FileStatus and do the filtering locally when the path is not exists | ||
|
||
var existPathSet =collection.mutable.Set[String]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the indent id off
Lots of style comments. Please checkout: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Also, while this seems better it does seem like it could have a non-trivial performance penalty. Maybe we should have a config flag to turn it off? |
Test build #28791 has finished for PR 5059 at commit
|
Test build #28795 has finished for PR 5059 at commit
|
Thanks for your review. |
Test build #28796 has finished for PR 5059 at commit
|
def verifyPartitionPath( | ||
partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]): | ||
Map[HivePartition, Class[_ <: Deserializer]] = { | ||
if (!sc.getConf("spark.sql.hive.verifyPartitionPath", "true").toBoolean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this into SQLConf?
Thanks for working on this! Only two minor comments, then I think we can probably commit this. We can probably leave the flag undocumented for now and only add it to the docs if we see real world cases where it makes a performance difference. |
val pathPattern = new Path(pathPatternStr) | ||
val fs = pathPattern.getFileSystem(sc.hiveconf) | ||
val matches = fs.globStatus(pathPattern) | ||
matches.map(fileStatus => existPathSet += fileStatus.getPath.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use foreach
?
Test build #29776 has finished for PR 5059 at commit
|
Thanks for suggestion ! @chenghao-intel @marmbrus |
@marmbrus On your earlier documentation point, I was thinking that there might be genuine cases where the actual location of a partition might not be matching the expected partition structure, and where this check needs to be disabled. The case I just encountered was a unit test which was adding as a partition a randomly generated temp directory using ALTER TABLE / ADD PARTITION, which leads to an NPE in https://github.com/apache/spark/pull/5059/files#diff-8887a877bd52611df9aea06ccfe3a2d7R166 when running a SELECT against the corresponding table. Disabling the check using that configuration flag gets rid of that exception. Really not sure how common that use-case is though. |
We turned this off by default in Spark 1.5 as it was causing problems similar to what you saw. |
Perfect, thanks! |
This PR follow up PR #3907 & #3891 & #4356.
According to @marmbrus @liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally.
[1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo///*)
[2]. retrieve all FileStatus objects ,and cache them by undating existPathSet.
[3]. do the filtering locally
[4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern)
@chenghao-intel @jeanlyn