Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort. #1631

Closed
wants to merge 6 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Jul 29, 2014

The pull request includes two changes:

  1. Removes SortOrder introduced by SPARK-2125. The key ordering already includes the SortOrder information since an Ordering can be reverse. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as a SortOrder.
  2. Replaces the sortWith call in HashShuffleReader with an in-place quick sort.

chenghao-intel and others added 6 commits July 28, 2014 10:59
In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table & partitions.

This is the follow up with:
apache#1408
apache#1390

I've run a micro benchmark in my local with 15000000 records totally, and got the result as below:

With This Patch  |  Partition-Based Table  |  Non-Partition-Based Table
------------ | ------------- | -------------
No  |  1927 ms  |  1885 ms
Yes  | 1541 ms  |  1524 ms

It showed this patch will also improve the performance.

PS:  the benchmark code is also attached. (thanks liancheng )
```
package org.apache.spark.sql.hive

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._

object HiveTableScanPrepare extends App {
  case class Record(key: String, value: String)

  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"$i", s"val_$i")))

  import hiveContext._

  hql("SHOW TABLES")
  hql("DROP TABLE if exists part_scan_test")
  hql("DROP TABLE if exists scan_test")
  hql("DROP TABLE if exists records")
  rdd.registerAsTable("records")

  hql("""CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED BY (part1 string, part2 STRING)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)
  hql("""CREATE TABLE scan_test (key STRING, value STRING)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)

  for (part1 <- 2000 until 2001) {
    for (part2 <- 1 to 5) {
      hql(s"""from records
                 | insert into table part_scan_test PARTITION (part1='$part1', part2='2010-01-$part2')
                 | select key, value
               """.stripMargin)
      hql(s"""from records
                 | insert into table scan_test select key, value
               """.stripMargin)
    }
  }
}

object HiveTableScanTest extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  hql("SHOW TABLES")
  val part_scan_test = hql("select key, value from part_scan_test")
  val scan_test = hql("select key, value from scan_test")

  val r_part_scan_test = (0 to 5).map(i => benchmark(part_scan_test))
  val r_scan_test = (0 to 5).map(i => benchmark(scan_test))
  println("Scanning Partition-Based Table")
  r_part_scan_test.foreach(printResult)
  println("Scanning Non-Partition-Based Table")
  r_scan_test.foreach(printResult)

  def printResult(result: (Long, Long)) {
    println(s"Duration: ${result._1} ms Result: ${result._2}")
  }

  def benchmark(srdd: SchemaRDD) = {
    val begin = System.currentTimeMillis()
    val result = srdd.count()
    val end = System.currentTimeMillis()
    ((end - begin), result)
  }
}
```

Author: Cheng Hao <[email protected]>

Closes apache#1439 from chenghao-intel/hadoop_table_scan and squashes the following commits:

888968f [Cheng Hao] Fix issues in code style
27540ba [Cheng Hao] Fix the TableScan Bug while partition serde differs
40a24a7 [Cheng Hao] Add Unit Test
…rror in UnitTests

Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors.

Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result.

That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored.

Based on discussion in the community, we have implemented two different APIs for relative tolerance, and absolute tolerance. It makes sense that test writers should know which one they need depending on their circumstances.

Developers also need to explicitly specify the eps, and there is no default value which will sometimes cause confusion.

When comparing against zero using relative tolerance, a exception will be raised to warn users that it's meaningless.

For relative tolerance, users can now write

    assert(23.1 ~== 23.52 relTol 0.02)
    assert(23.1 ~== 22.74 relTol 0.02)
    assert(23.1 ~= 23.52 relTol 0.02)
    assert(23.1 ~= 22.74 relTol 0.02)
    assert(!(23.1 !~= 23.52 relTol 0.02))
    assert(!(23.1 !~= 22.74 relTol 0.02))

    // This will throw exception with the following message.
    // "Did not expect 23.1 and 23.52 to be within 0.02 using relative tolerance."
    assert(23.1 !~== 23.52 relTol 0.02)

    // "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance."
    assert(23.1 ~== 22.34 relTol 0.02)

For absolute error,

    assert(17.8 ~== 17.99 absTol 0.2)
    assert(17.8 ~== 17.61 absTol 0.2)
    assert(17.8 ~= 17.99 absTol 0.2)
    assert(17.8 ~= 17.61 absTol 0.2)
    assert(!(17.8 !~= 17.99 absTol 0.2))
    assert(!(17.8 !~= 17.61 absTol 0.2))

    // This will throw exception with the following message.
    // "Did not expect 17.8 and 17.99 to be within 0.2 using absolute error."
    assert(17.8 !~== 17.99 absTol 0.2)

    // "Expected 17.8 and 17.59 to be within 0.2 using absolute error."
    assert(17.8 ~== 17.59 absTol 0.2)

Authors:
  DB Tsai <dbtsaialpinenow.com>
  Marek Kolodziej <marekalpinenow.com>

Author: DB Tsai <[email protected]>

Closes apache#1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes the following commits:

8c7cbcc [DB Tsai] Alpine Data Labs
… fix)

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Another try for apache#1399 & apache#1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.

In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.

Author: Cheng Lian <[email protected]>

Closes apache#1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:

629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
Spark only transitively depends on the latter, based on the Hadoop version.

Author: Aaron Davidson <[email protected]>

Closes apache#1621 from aarondav/lang3 and squashes the following commits:

93c93bf [Aaron Davidson] Use commons-lang3 in SignalLogger rather than commons-lang
Author: Yadong Qi <[email protected]>

Closes apache#1629 from watermen/bug-fix2 and squashes the following commits:

59b7237 [Yadong Qi] Update HiveQl.scala
@rxin
Copy link
Contributor Author

rxin commented Jul 29, 2014

The diff is screwed up because I'm basing the pull request off the real ASF master.

Diff here: rxin@c9d37e1

@rxin
Copy link
Contributor Author

rxin commented Jul 29, 2014

@jerryshao @mateiz

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1631. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17335/consoleFull

@jerryshao
Copy link
Contributor

Cool, much cleaner than the previous code, looks good to me :)

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1631:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class SparkSQLOperationManager(hiveContext: HiveContext) extends OperationManager with Logging {
class HadoopTableReader(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17335/consoleFull

@rxin
Copy link
Contributor Author

rxin commented Jul 29, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1631. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17342/consoleFull

@mateiz
Copy link
Contributor

mateiz commented Jul 29, 2014

Looks good to me too, though it might be better to use Java's Arrays.sort instead of Scala's quickSort since Java has fancier algorithms in new versions.

@rxin
Copy link
Contributor Author

rxin commented Jul 29, 2014

I tried that - had some issues with types between Scala and Java and resorted to the current implementation. In any case because this code will likely be replaced soon by the sorter, I'd leave it as is.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1631:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class SparkSQLOperationManager(hiveContext: HiveContext) extends OperationManager with Logging {
class HadoopTableReader(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17342/consoleFull

@rxin
Copy link
Contributor Author

rxin commented Jul 29, 2014

Ok I'm merging this. Thanks for reviewing.

@rxin rxin closed this Jul 29, 2014
asfgit pushed a commit that referenced this pull request Jul 29, 2014
The pull request includes two changes:

1. Removes SortOrder introduced by SPARK-2125. The key ordering already includes the SortOrder information since an Ordering can be reverse. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as a SortOrder.

2. Replaces the sortWith call in HashShuffleReader with an in-place quick sort.

Author: Reynold Xin <[email protected]>

Closes #1631 from rxin/sortOrder and squashes the following commits:

c9d37e1 [Reynold Xin] [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.
@rxin rxin deleted the sortOrder branch August 13, 2014 08:01
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
The pull request includes two changes:

1. Removes SortOrder introduced by SPARK-2125. The key ordering already includes the SortOrder information since an Ordering can be reverse. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as a SortOrder.

2. Replaces the sortWith call in HashShuffleReader with an in-place quick sort.

Author: Reynold Xin <[email protected]>

Closes apache#1631 from rxin/sortOrder and squashes the following commits:

c9d37e1 [Reynold Xin] [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants