Skip to content

Commit

Permalink
[SPARK-4443][SQL] Fix statistics for external table in spark sql hive
Browse files Browse the repository at this point in the history
The `totalSize` of external table  is always zero, which will influence join strategy(always use broadcast join for external table).

Author: w00228970 <[email protected]>

Closes #3304 from scwf/statistics and squashes the following commits:

568f321 [w00228970] fix statistics for external table

(cherry picked from commit 42389b1)
Signed-off-by: Michael Armbrust <[email protected]>
  • Loading branch information
scwf authored and marmbrus committed Nov 18, 2014
1 parent ff2fe56 commit 060d621
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -447,16 +447,21 @@ private[hive] case class MetastoreRelation

@transient override lazy val statistics = Statistics(
sizeInBytes = {
val totalSize = hiveQlTable.getParameters.get(HiveShim.getStatsSetupConstTotalSize)
val rawDataSize = hiveQlTable.getParameters.get(HiveShim.getStatsSetupConstRawDataSize)
// TODO: check if this estimate is valid for tables after partition pruning.
// NOTE: getting `totalSize` directly from params is kind of hacky, but this should be
// relatively cheap if parameters for the table are populated into the metastore. An
// alternative would be going through Hadoop's FileSystem API, which can be expensive if a lot
// of RPCs are involved. Besides `totalSize`, there are also `numFiles`, `numRows`,
// `rawDataSize` keys (see StatsSetupConst in Hive) that we can look at in the future.
BigInt(
Option(hiveQlTable.getParameters.get(HiveShim.getStatsSetupConstTotalSize))
.map(_.toLong)
.getOrElse(sqlContext.defaultSizeInBytes))
// When table is external,`totalSize` is always zero, which will influence join strategy
// so when `totalSize` is zero, use `rawDataSize` instead
// if the size is still less than zero, we use default size
Option(totalSize).map(_.toLong).filter(_ > 0)
.getOrElse(Option(rawDataSize).map(_.toLong).filter(_ > 0)
.getOrElse(sqlContext.defaultSizeInBytes)))
}
)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,8 @@ private[hive] object HiveShim {

def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE

def getStatsSetupConstRawDataSize = StatsSetupConst.RAW_DATA_SIZE

def createDefaultDBIfNeeded(context: HiveContext) = { }

def getCommandProcessor(cmd: Array[String], conf: HiveConf) = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,8 @@ private[hive] object HiveShim {

def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE

def getStatsSetupConstRawDataSize = StatsSetupConst.RAW_DATA_SIZE

def createDefaultDBIfNeeded(context: HiveContext) = {
context.runSqlHive("CREATE DATABASE default")
context.runSqlHive("USE default")
Expand Down

0 comments on commit 060d621

Please sign in to comment.