[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort. #1631

rxin · 2014-07-29T05:21:04Z

The pull request includes two changes:

Removes SortOrder introduced by SPARK-2125. The key ordering already includes the SortOrder information since an Ordering can be reverse. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as a SortOrder.
Replaces the sortWith call in HashShuffleReader with an in-place quick sort.

In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table & partitions. This is the follow up with: apache#1408 apache#1390 I've run a micro benchmark in my local with 15000000 records totally, and got the result as below: With This Patch | Partition-Based Table | Non-Partition-Based Table ------------ | ------------- | ------------- No | 1927 ms | 1885 ms Yes | 1541 ms | 1524 ms It showed this patch will also improve the performance. PS: the benchmark code is also attached. (thanks liancheng ) ``` package org.apache.spark.sql.hive import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql._ object HiveTableScanPrepare extends App { case class Record(key: String, value: String) val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"$i", s"val_$i"))) import hiveContext._ hql("SHOW TABLES") hql("DROP TABLE if exists part_scan_test") hql("DROP TABLE if exists scan_test") hql("DROP TABLE if exists records") rdd.registerAsTable("records") hql("""CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED BY (part1 string, part2 STRING) | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' | STORED AS RCFILE """.stripMargin) hql("""CREATE TABLE scan_test (key STRING, value STRING) | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' | STORED AS RCFILE """.stripMargin) for (part1 <- 2000 until 2001) { for (part2 <- 1 to 5) { hql(s"""from records | insert into table part_scan_test PARTITION (part1='$part1', part2='2010-01-$part2') | select key, value """.stripMargin) hql(s"""from records | insert into table scan_test select key, value """.stripMargin) } } } object HiveTableScanTest extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ hql("SHOW TABLES") val part_scan_test = hql("select key, value from part_scan_test") val scan_test = hql("select key, value from scan_test") val r_part_scan_test = (0 to 5).map(i => benchmark(part_scan_test)) val r_scan_test = (0 to 5).map(i => benchmark(scan_test)) println("Scanning Partition-Based Table") r_part_scan_test.foreach(printResult) println("Scanning Non-Partition-Based Table") r_scan_test.foreach(printResult) def printResult(result: (Long, Long)) { println(s"Duration: ${result._1} ms Result: ${result._2}") } def benchmark(srdd: SchemaRDD) = { val begin = System.currentTimeMillis() val result = srdd.count() val end = System.currentTimeMillis() ((end - begin), result) } } ``` Author: Cheng Hao <[email protected]> Closes apache#1439 from chenghao-intel/hadoop_table_scan and squashes the following commits: 888968f [Cheng Hao] Fix issues in code style 27540ba [Cheng Hao] Fix the TableScan Bug while partition serde differs 40a24a7 [Cheng Hao] Add Unit Test

…rror in UnitTests Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors. Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result. That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored. Based on discussion in the community, we have implemented two different APIs for relative tolerance, and absolute tolerance. It makes sense that test writers should know which one they need depending on their circumstances. Developers also need to explicitly specify the eps, and there is no default value which will sometimes cause confusion. When comparing against zero using relative tolerance, a exception will be raised to warn users that it's meaningless. For relative tolerance, users can now write assert(23.1 ~== 23.52 relTol 0.02) assert(23.1 ~== 22.74 relTol 0.02) assert(23.1 ~= 23.52 relTol 0.02) assert(23.1 ~= 22.74 relTol 0.02) assert(!(23.1 !~= 23.52 relTol 0.02)) assert(!(23.1 !~= 22.74 relTol 0.02)) // This will throw exception with the following message. // "Did not expect 23.1 and 23.52 to be within 0.02 using relative tolerance." assert(23.1 !~== 23.52 relTol 0.02) // "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance." assert(23.1 ~== 22.34 relTol 0.02) For absolute error, assert(17.8 ~== 17.99 absTol 0.2) assert(17.8 ~== 17.61 absTol 0.2) assert(17.8 ~= 17.99 absTol 0.2) assert(17.8 ~= 17.61 absTol 0.2) assert(!(17.8 !~= 17.99 absTol 0.2)) assert(!(17.8 !~= 17.61 absTol 0.2)) // This will throw exception with the following message. // "Did not expect 17.8 and 17.99 to be within 0.2 using absolute error." assert(17.8 !~== 17.99 absTol 0.2) // "Expected 17.8 and 17.59 to be within 0.2 using absolute error." assert(17.8 ~== 17.59 absTol 0.2) Authors: DB Tsai <dbtsaialpinenow.com> Marek Kolodziej <marekalpinenow.com> Author: DB Tsai <[email protected]> Closes apache#1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes the following commits: 8c7cbcc [DB Tsai] Alpine Data Labs

… fix) JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410) Another try for apache#1399 & apache#1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar. In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity. Author: Cheng Lian <[email protected]> Closes apache#1620 from liancheng/jdbc-with-maven-fix and squashes the following commits: 629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server

Spark only transitively depends on the latter, based on the Hadoop version. Author: Aaron Davidson <[email protected]> Closes apache#1621 from aarondav/lang3 and squashes the following commits: 93c93bf [Aaron Davidson] Use commons-lang3 in SignalLogger rather than commons-lang

Author: Yadong Qi <[email protected]> Closes apache#1629 from watermen/bug-fix2 and squashes the following commits: 59b7237 [Yadong Qi] Update HiveQl.scala

rxin · 2014-07-29T05:21:24Z

The diff is screwed up because I'm basing the pull request off the real ASF master.

Diff here: rxin@c9d37e1

rxin · 2014-07-29T05:21:33Z

@jerryshao @mateiz

SparkQA · 2014-07-29T05:23:53Z

QA tests have started for PR 1631. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17335/consoleFull

jerryshao · 2014-07-29T05:59:55Z

Cool, much cleaner than the previous code, looks good to me :)

SparkQA · 2014-07-29T06:42:08Z

QA results for PR 1631:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class SparkSQLOperationManager(hiveContext: HiveContext) extends OperationManager with Logging {
class HadoopTableReader(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17335/consoleFull

rxin · 2014-07-29T06:46:03Z

Jenkins, retest this please.

SparkQA · 2014-07-29T06:48:50Z

QA tests have started for PR 1631. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17342/consoleFull

mateiz · 2014-07-29T07:24:00Z

Looks good to me too, though it might be better to use Java's Arrays.sort instead of Scala's quickSort since Java has fancier algorithms in new versions.

rxin · 2014-07-29T07:37:14Z

I tried that - had some issues with types between Scala and Java and resorted to the current implementation. In any case because this code will likely be replaced soon by the sorter, I'd leave it as is.

SparkQA · 2014-07-29T08:10:05Z

QA results for PR 1631:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class SparkSQLOperationManager(hiveContext: HiveContext) extends OperationManager with Logging {
class HadoopTableReader(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17342/consoleFull

rxin · 2014-07-29T08:13:18Z

Ok I'm merging this. Thanks for reviewing.

The pull request includes two changes: 1. Removes SortOrder introduced by SPARK-2125. The key ordering already includes the SortOrder information since an Ordering can be reverse. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as a SortOrder. 2. Replaces the sortWith call in HashShuffleReader with an in-place quick sort. Author: Reynold Xin <[email protected]> Closes #1631 from rxin/sortOrder and squashes the following commits: c9d37e1 [Reynold Xin] [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.

The pull request includes two changes: 1. Removes SortOrder introduced by SPARK-2125. The key ordering already includes the SortOrder information since an Ordering can be reverse. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as a SortOrder. 2. Replaces the sortWith call in HashShuffleReader with an in-place quick sort. Author: Reynold Xin <[email protected]> Closes apache#1631 from rxin/sortOrder and squashes the following commits: c9d37e1 [Reynold Xin] [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.

chenghao-intel and others added 6 commits July 28, 2014 10:59

Excess judgment

16ef4d1

Author: Yadong Qi <[email protected]> Closes apache#1629 from watermen/bug-fix2 and squashes the following commits: 59b7237 [Yadong Qi] Update HiveQl.scala

[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.

c9d37e1

rxin closed this Jul 29, 2014

rxin deleted the sortOrder branch August 13, 2014 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort. #1631

[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort. #1631

rxin commented Jul 29, 2014

rxin commented Jul 29, 2014

rxin commented Jul 29, 2014

SparkQA commented Jul 29, 2014

jerryshao commented Jul 29, 2014

SparkQA commented Jul 29, 2014

rxin commented Jul 29, 2014

SparkQA commented Jul 29, 2014

mateiz commented Jul 29, 2014

rxin commented Jul 29, 2014

SparkQA commented Jul 29, 2014

rxin commented Jul 29, 2014

[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort. #1631

[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort. #1631

Conversation

rxin commented Jul 29, 2014

rxin commented Jul 29, 2014

rxin commented Jul 29, 2014

SparkQA commented Jul 29, 2014

jerryshao commented Jul 29, 2014

SparkQA commented Jul 29, 2014

rxin commented Jul 29, 2014

SparkQA commented Jul 29, 2014

mateiz commented Jul 29, 2014

rxin commented Jul 29, 2014

SparkQA commented Jul 29, 2014

rxin commented Jul 29, 2014