Skip to content

Commit

Permalink
[SPARK-3739] [SQL] Update the split num base on block size for table …
Browse files Browse the repository at this point in the history
…scanning

In local mode, Hadoop/Hive will ignore the "mapred.map.tasks", hence for small table file, it's always a single input split, however, SparkSQL doesn't honor that in table scanning, and we will get different result when do the Hive Compatibility test. This PR will fix that.

Author: Cheng Hao <[email protected]>

Closes #2589 from chenghao-intel/source_split and squashes the following commits:

dff38e7 [Cheng Hao] Remove the extra blank line
160a2b6 [Cheng Hao] fix the compiling bug
04d67f7 [Cheng Hao] Keep 1 split for small file in table scanning
  • Loading branch information
chenghao-intel authored and marmbrus committed Dec 17, 2014
1 parent 902e4d5 commit 636d9fc
Show file tree
Hide file tree
Showing 3 changed files with 517 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,15 @@ class HadoopTableReader(
@transient hiveExtraConf: HiveConf)
extends TableReader {

// Choose the minimum number of splits. If mapred.map.tasks is set, then use that unless
// it is smaller than what Spark suggests.
private val _minSplitsPerRDD = math.max(
sc.hiveconf.getInt("mapred.map.tasks", 1), sc.sparkContext.defaultMinPartitions)
// Hadoop honors "mapred.map.tasks" as hint, but will ignore when mapred.job.tracker is "local".
// https://hadoop.apache.org/docs/r1.0.4/mapred-default.html
//
// In order keep consistency with Hive, we will let it be 0 in local mode also.
private val _minSplitsPerRDD = if (sc.sparkContext.isLocal) {
0 // will splitted based on block by default.
} else {
math.max(sc.hiveconf.getInt("mapred.map.tasks", 1), sc.sparkContext.defaultMinPartitions)
}

// TODO: set aws s3 credentials.

Expand Down
Loading

0 comments on commit 636d9fc

Please sign in to comment.