[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column #25214

wangyum · 2019-07-21T00:54:58Z

What changes were proposed in this pull request?

HIVE-12063 improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063.

HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1).
However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0.
The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however.

Spark SQL:

// bin/spark-sql
spark-sql> select cast(1 as decimal(38, 18));
1
spark-sql>

// bin/beeline
0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18));
+----------------------------+--+
| CAST(1 AS DECIMAL(38,18))  |
+----------------------------+--+
| 1.000000000000000000       |
+----------------------------+--+

// bin/spark-shell
scala> spark.sql("select cast(1 as decimal(38, 18))").show(false)
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|1.000000000000000000     |
+-------------------------+

// bin/pyspark
>>> spark.sql("select cast(1 as decimal(38, 18))").show()
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+

// bin/sparkR
> showDF(sql("SELECT cast(1 as decimal(38, 18))"))
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+

PostgreSQL:

postgres=# select cast(1 as decimal(38, 18));
       numeric
----------------------
 1.000000000000000000
(1 row)

Presto:

presto> select cast(1 as decimal(38, 18));
        _col0
----------------------
 1.000000000000000000
(1 row)

How was this patch tested?

unit tests and manual test:

spark-sql> select cast(1 as decimal(38, 18));
1.000000000000000000

Spark SQL Upgrading Guide:

sql/hive/src/test/resources/golden/decimal_4-6-693c2e345731f9b2b547c3b75218458e

maropu · 2019-07-21T03:24:15Z

docs/sql-migration-guide-upgrade.md

@@ -23,6 +23,8 @@ license: |
 {:toc}

 ## Upgrading From Spark SQL 2.4 to 3.0
+  - Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result.


btw, we should put this description here about hive specific changes? Is it better in docs/sql-migration-guide-hive-compatibility.md?

It would be great to show an example here.

Since this is sort of minor, I don't think we should put it in the first place of migration guide.

SparkQA · 2019-07-21T03:55:08Z

Test build #107949 has finished for PR 25214 at commit 6672e18.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-21T07:05:02Z

Test build #107956 has finished for PR 25214 at commit 832d009.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-07-21T07:09:32Z

retest this please

SparkQA · 2019-07-21T11:22:24Z

Test build #107958 has finished for PR 25214 at commit 832d009.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-07-30T06:35:31Z

Only bin/spark-sql does not pad with zeros, bin/beeline, bin/spark-shell, bin/pyspark and bin/sparkR are padded with zeros.

# Conflicts: # sql/core/src/test/resources/sql-tests/results/pgSQL/int4.sql.out # sql/core/src/test/resources/sql-tests/results/pgSQL/numeric.sql.out

SparkQA · 2019-08-04T03:31:56Z

Test build #108600 has finished for PR 25214 at commit a617812.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-08-05T07:45:02Z

Does this only affect the output result of CLI applications？

wangyum · 2019-08-05T14:41:44Z

Yes. It only affect the output result of CLI application.

# Conflicts: # docs/sql-migration-guide-upgrade.md

SparkQA · 2019-08-06T15:23:14Z

Test build #108713 has finished for PR 25214 at commit 141c3d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-08-23T02:08:06Z

docs/sql-migration-guide-upgrade.md

@@ -159,6 +159,32 @@ license: |

  - Since Spark 3.0, Dataset query fails if it contains ambiguous column reference that is caused by self join. A typical example: `val df1 = ...; val df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an empty result which is quite confusing. This is because Spark cannot resolve Dataset column references that point to tables being self joined, and `df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin` to `false`.

+  - Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result, for example:


To me, it looks not clear what Hive result is.

Which one do you think is better?

HyukjinKwon · 2019-08-23T02:08:24Z

Seems right to me. Trailing behaviour was a legacy behaviour from old Hive IIRC

# Conflicts: # docs/sql-migration-guide-upgrade.md # sql/core/src/test/resources/sql-tests/results/pgSQL/numeric.sql.out # sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala

SparkQA · 2019-08-25T15:32:58Z

Test build #109699 has finished for PR 25214 at commit b82d6a2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-25T20:00:04Z

Test build #109706 has finished for PR 25214 at commit 243e3c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-18T03:00:58Z

@dongjoon-hyun and @gatorsmile WDYT?

tooptoop4 · 2019-09-18T05:37:19Z

does this fix https://issues.apache.org/jira/browse/SPARK-23576 ?

wangyum · 2019-09-19T05:16:40Z

@tooptoop4 I think it's different things.

# Conflicts: # docs/sql-migration-guide-upgrade.md # sql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out # sql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out

SparkQA · 2019-10-17T06:26:54Z

Test build #112195 has finished for PR 25214 at commit 91bd74a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-10-23T03:50:37Z

The current change looks fine to me, too. Anyone can check this? @gatorsmile @dongjoon-hyun @viirya

wangyum · 2019-11-22T01:12:53Z

Thank you @maropu I think we should continue this work because it is the default behavior of Hive 2.x and we are going to upgrade the default built-in Hive to 2.3.

maropu · 2019-11-24T01:45:31Z

yea, can you resolve the conflict? @wangyum

# Conflicts: # docs/sql-migration-guide.md # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala # sql/core/src/test/resources/sql-tests/results/decimalArithmeticOperations.sql.out # sql/core/src/test/resources/sql-tests/results/literals.sql.out # sql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out

SparkQA · 2019-11-24T14:49:56Z

Test build #114341 has finished for PR 25214 at commit e221f0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-11-24T23:34:15Z

But let me defer to others (or maybe let me merge in some days). At least I know this will break a lot of sql tests outside like TPC-DS and they will need to fix accordingly.

docs/sql-migration-guide.md

viirya · 2019-11-25T00:13:46Z

Looks fine as it follows other major databases as shown in the description.

# Conflicts: # docs/sql-migration-guide.md

SparkQA · 2019-11-25T12:29:02Z

Test build #114387 has finished for PR 25214 at commit 1c30198.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-11-26T12:51:13Z

docs/sql-migration-guide.md

@@ -226,6 +226,32 @@ license: |

  - Since Spark 3.0, when casting string value to date, timestamp and interval values, the leading and trailing white spaces(<= ACSII 32) will be trimmed before casing, e.g. `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, only the trailing space will be removed, thus, the result is `null`.

+  - Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result, for example:


@wangyum, can you clarify what "Hive result" means?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

Lines 33 to 37 in 051e691

/**

* Returns the result as a hive compatible sequence of strings. This is used in tests and

* `SparkSQLDriver` for CLI applications.

*/

def hiveResultString(executedPlan: SparkPlan): Seq[String] = executedPlan match {

I meant to clarify it in this migration guide ...

How about?
Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for spark-sql interface.

SparkQA · 2019-11-26T19:58:07Z

Test build #114473 has finished for PR 25214 at commit 564c799.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM

HyukjinKwon · 2019-11-27T09:12:53Z

Merged to master.

dongjoon-hyun · 2019-11-27T19:01:24Z

Hi, @wangyum and @HyukjinKwon .
This PR seems to break all three hadoop-3.2 profile Jenkins jobs.

dongjoon-hyun · 2019-11-27T19:06:05Z

Sorry guys. I'll revert this for now to recover those three Jenkins jobs.

Although the failure is only the difference from 0 and 0.0 and it succeeded with hadoop-2.7, but I'm not sure about the fix because the behavior is different across the profiles.

[info] - decimal_1_1 *** FAILED *** (812 milliseconds)
[info]   Results do not match for decimal_1_1:
[info]   == Parsed Logical Plan ==
[info]   'Project [*]
[info]   +- 'UnresolvedRelation [decimal_1_1]
[info]   
[info]   == Analyzed Logical Plan ==
[info]   d: decimal(1,1)
[info]   Project [d#15908]
[info]   +- SubqueryAlias `default`.`decimal_1_1`
[info]      +- HiveTableRelation `default`.`decimal_1_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [d#15908]
[info]   
[info]   == Optimized Logical Plan ==
[info]   HiveTableRelation `default`.`decimal_1_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [d#15908]
[info]   
[info]   == Physical Plan ==
[info]   Scan hive default.decimal_1_1 [d#15908], HiveTableRelation `default`.`decimal_1_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [d#15908]
[info]   
[info]   d
[info]   !== HIVE - 30 row(s) ==   == CATALYST - 30 row(s) ==
[info]    -0.1                     -0.1
[info]    -0.2                     -0.2
[info]    -0.3                     -0.3
[info]    -0.9                     -0.9
[info]    -0.9                     -0.9
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]    0.1                      0.1
[info]    0.2                      0.2
[info]    0.3                      0.3
[info]    0.9                      0.9
[info]    0.9                      0.9
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL

dongjoon-hyun · 2019-11-27T19:07:55Z

cc @gatorsmile and @cloud-fan

dongjoon-hyun · 2019-11-27T19:31:02Z

It turns out this PR works only Hive 1.2. With this PR on master, I reproduced the failure like the following.

$ build/sbt -Phive "hive/testOnly *.HiveCompatibilitySuite -- -z decimal"
...
[info] *** 1 TEST FAILED ***
[error] Failed: Total 4, Failed 1, Errors 0, Passed 3, Ignored 17
[error] Failed tests:
[error] 	org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
[error] (hive/test:testOnly) sbt.TestsFailedException: Tests unsuccessful

The PR Builder is still using Hive 1.2 and Hadoop 2.7 because we didn't create a new Jenkins job properly yet. For now, please test this locally with the above command and make a PR with [test-hadoop3.2].

https://issues.apache.org/jira/browse/SPARK-29988 will setup Jenkins properly.
https://issues.apache.org/jira/browse/SPARK-29991 will make [test-hive1.2] at PR Builder.

[SPARK-29991][INFRA] Support test-hive1.2 in PR Builder #26695

dongjoon-hyun · 2019-11-27T22:20:30Z

Jenkins is recovered now.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/864/

HyukjinKwon · 2019-11-27T23:17:34Z

So its one test being failed in a different Hive version. I assume the test is being failed as expected because Hive fixed it in 2.0. Why didn't we just fix the test? Seems root cause was clear, easy to fix the test, and didn't break PR build.

HyukjinKwon · 2019-11-27T23:20:21Z

Hope I could leave a comment before this PR was reverted. Since this is already reverted, @wangyum can you open a PR again after fixing it?

…le of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. **Spark SQL**: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ | CAST(1 AS DECIMAL(38,18)) | +----------------------------+--+ | 1.000000000000000000 | +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ |1.000000000000000000 | +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ ``` **PostgreSQL**: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` **Presto**: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes apache#25214 from wangyum/SPARK-28461. Authored-by: Yuming Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

Pad Decimal numbers with trailing zeros to the scale of the column

6672e18

dongjoon-hyun added the SQL label Jul 21, 2019

maropu reviewed Jul 21, 2019

View reviewed changes

sql/hive/src/test/resources/golden/decimal_4-6-693c2e345731f9b2b547c3b75218458e Show resolved Hide resolved

maropu reviewed Jul 21, 2019

View reviewed changes

Fix test error

832d009

wangyum added 2 commits August 3, 2019 14:33

Merge remote-tracking branch 'upstream/master' into SPARK-28461

35f0bfa

# Conflicts: # sql/core/src/test/resources/sql-tests/results/pgSQL/int4.sql.out # sql/core/src/test/resources/sql-tests/results/pgSQL/numeric.sql.out

Merge master

a617812

wangyum added 2 commits August 6, 2019 19:37

Add example

bd6b1d2

Merge remote-tracking branch 'upstream/master' into SPARK-28461

141c3d4

# Conflicts: # docs/sql-migration-guide-upgrade.md

HyukjinKwon reviewed Aug 23, 2019

View reviewed changes

wangyum added 2 commits August 25, 2019 20:31

Merge remote-tracking branch 'upstream/master' into SPARK-28461

ab192f6

# Conflicts: # docs/sql-migration-guide-upgrade.md # sql/core/src/test/resources/sql-tests/results/pgSQL/numeric.sql.out # sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala

Merge master

b82d6a2

Fix test error

243e3c4

wangyum added 2 commits October 16, 2019 17:06

Merge remote-tracking branch 'upstream/master' into SPARK-28461

2428957

# Conflicts: # docs/sql-migration-guide-upgrade.md # sql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out # sql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out

Update migration guide.

93a5d29

Fix test error

91bd74a

wangyum added 2 commits November 24, 2019 17:18

Merge master

e221f0e

HyukjinKwon approved these changes Nov 24, 2019

View reviewed changes

viirya reviewed Nov 24, 2019

View reviewed changes

docs/sql-migration-guide.md Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into SPARK-28461

1c30198

# Conflicts: # docs/sql-migration-guide.md

HyukjinKwon reviewed Nov 26, 2019

View reviewed changes

Update migration guide

564c799

HyukjinKwon approved these changes Nov 26, 2019

View reviewed changes

HyukjinKwon closed this in 19af1fe Nov 27, 2019

wangyum deleted the SPARK-28461 branch November 27, 2019 09:21

		@@ -159,6 +159,32 @@ license: \|

		- Since Spark 3.0, Dataset query fails if it contains ambiguous column reference that is caused by self join. A typical example: `val df1 = ...; val df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an empty result which is quite confusing. This is because Spark cannot resolve Dataset column references that point to tables being self joined, and `df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin` to `false`.

		- Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result, for example:

		@@ -226,6 +226,32 @@ license: \|

		- Since Spark 3.0, when casting string value to date, timestamp and interval values, the leading and trailing white spaces(<= ACSII 32) will be trimmed before casing, e.g. `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, only the trailing space will be removed, thus, the result is `null`.

		- Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result, for example:

	/**
	* Returns the result as a hive compatible sequence of strings. This is used in tests and
	* `SparkSQLDriver` for CLI applications.
	*/
	def hiveResultString(executedPlan: SparkPlan): Seq[String] = executedPlan match {

[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column #25214

[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column #25214

Conversation

wangyum commented Jul 21, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

gengliangwang Aug 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 21, 2019

SparkQA commented Jul 21, 2019

wangyum commented Jul 21, 2019

SparkQA commented Jul 21, 2019

wangyum commented Jul 30, 2019

SparkQA commented Aug 4, 2019

gengliangwang commented Aug 5, 2019

wangyum commented Aug 5, 2019

SparkQA commented Aug 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 23, 2019

SparkQA commented Aug 25, 2019

SparkQA commented Aug 25, 2019

HyukjinKwon commented Sep 18, 2019

tooptoop4 commented Sep 18, 2019

wangyum commented Sep 19, 2019

SparkQA commented Oct 17, 2019

maropu commented Oct 23, 2019

wangyum commented Nov 22, 2019

maropu commented Nov 24, 2019

SparkQA commented Nov 24, 2019

HyukjinKwon commented Nov 24, 2019 • edited Loading

viirya commented Nov 25, 2019

SparkQA commented Nov 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 26, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Nov 27, 2019

dongjoon-hyun commented Nov 27, 2019

dongjoon-hyun commented Nov 27, 2019 • edited Loading

dongjoon-hyun commented Nov 27, 2019

dongjoon-hyun commented Nov 27, 2019 • edited Loading

dongjoon-hyun commented Nov 27, 2019

HyukjinKwon commented Nov 27, 2019 • edited Loading

HyukjinKwon commented Nov 27, 2019

wangyum commented Jul 21, 2019 •

edited

Loading

gengliangwang Aug 5, 2019 •

edited

Loading

HyukjinKwon commented Nov 24, 2019 •

edited

Loading

dongjoon-hyun commented Nov 27, 2019 •

edited

Loading

dongjoon-hyun commented Nov 27, 2019 •

edited

Loading

HyukjinKwon commented Nov 27, 2019 •

edited

Loading