Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column #25214

Closed
wants to merge 16 commits into from
Closed

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Jul 21, 2019

What changes were proposed in this pull request?

HIVE-12063 improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063.

HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1).
However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0.
The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however.

Spark SQL:

// bin/spark-sql
spark-sql> select cast(1 as decimal(38, 18));
1
spark-sql>

// bin/beeline
0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18));
+----------------------------+--+
| CAST(1 AS DECIMAL(38,18))  |
+----------------------------+--+
| 1.000000000000000000       |
+----------------------------+--+

// bin/spark-shell
scala> spark.sql("select cast(1 as decimal(38, 18))").show(false)
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|1.000000000000000000     |
+-------------------------+

// bin/pyspark
>>> spark.sql("select cast(1 as decimal(38, 18))").show()
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+

// bin/sparkR
> showDF(sql("SELECT cast(1 as decimal(38, 18))"))
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+

PostgreSQL:

postgres=# select cast(1 as decimal(38, 18));
       numeric
----------------------
 1.000000000000000000
(1 row)

Presto:

presto> select cast(1 as decimal(38, 18));
        _col0
----------------------
 1.000000000000000000
(1 row)

How was this patch tested?

unit tests and manual test:

spark-sql> select cast(1 as decimal(38, 18));
1.000000000000000000

Spark SQL Upgrading Guide:
image

@@ -23,6 +23,8 @@ license: |
{:toc}

## Upgrading From Spark SQL 2.4 to 3.0
- Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, we should put this description here about hive specific changes? Is it better in docs/sql-migration-guide-hive-compatibility.md?

Copy link
Member

@gengliangwang gengliangwang Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to show an example here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is sort of minor, I don't think we should put it in the first place of migration guide.

@SparkQA
Copy link

SparkQA commented Jul 21, 2019

Test build #107949 has finished for PR 25214 at commit 6672e18.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2019

Test build #107956 has finished for PR 25214 at commit 832d009.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jul 21, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jul 21, 2019

Test build #107958 has finished for PR 25214 at commit 832d009.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jul 30, 2019

Only bin/spark-sql does not pad with zeros, bin/beeline, bin/spark-shell, bin/pyspark and bin/sparkR are padded with zeros.

wangyum added 2 commits August 3, 2019 14:33
# Conflicts:
#	sql/core/src/test/resources/sql-tests/results/pgSQL/int4.sql.out
#	sql/core/src/test/resources/sql-tests/results/pgSQL/numeric.sql.out
@SparkQA
Copy link

SparkQA commented Aug 4, 2019

Test build #108600 has finished for PR 25214 at commit a617812.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

Does this only affect the output result of CLI applications?

@wangyum
Copy link
Member Author

wangyum commented Aug 5, 2019

Yes. It only affect the output result of CLI application.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108713 has finished for PR 25214 at commit 141c3d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -159,6 +159,32 @@ license: |

- Since Spark 3.0, Dataset query fails if it contains ambiguous column reference that is caused by self join. A typical example: `val df1 = ...; val df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an empty result which is quite confusing. This is because Spark cannot resolve Dataset column references that point to tables being self joined, and `df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin` to `false`.

- Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result, for example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it looks not clear what Hive result is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one do you think is better?
image


image

@HyukjinKwon
Copy link
Member

Seems right to me. Trailing behaviour was a legacy behaviour from old Hive IIRC

# Conflicts:
#	docs/sql-migration-guide-upgrade.md
#	sql/core/src/test/resources/sql-tests/results/pgSQL/numeric.sql.out
#	sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
@SparkQA
Copy link

SparkQA commented Aug 25, 2019

Test build #109699 has finished for PR 25214 at commit b82d6a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2019

Test build #109706 has finished for PR 25214 at commit 243e3c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@dongjoon-hyun and @gatorsmile WDYT?

@tooptoop4
Copy link
Contributor

@wangyum
Copy link
Member Author

wangyum commented Sep 19, 2019

@tooptoop4 I think it's different things.

# Conflicts:
#	docs/sql-migration-guide-upgrade.md
#	sql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out
#	sql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out
@SparkQA
Copy link

SparkQA commented Oct 17, 2019

Test build #112195 has finished for PR 25214 at commit 91bd74a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Oct 23, 2019

The current change looks fine to me, too. Anyone can check this? @gatorsmile @dongjoon-hyun @viirya

@wangyum
Copy link
Member Author

wangyum commented Nov 22, 2019

Thank you @maropu I think we should continue this work because it is the default behavior of Hive 2.x and we are going to upgrade the default built-in Hive to 2.3.

@maropu
Copy link
Member

maropu commented Nov 24, 2019

yea, can you resolve the conflict? @wangyum

# Conflicts:
#	docs/sql-migration-guide.md
#	sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
#	sql/core/src/test/resources/sql-tests/results/decimalArithmeticOperations.sql.out
#	sql/core/src/test/resources/sql-tests/results/literals.sql.out
#	sql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out
@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114341 has finished for PR 25214 at commit e221f0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 24, 2019

But let me defer to others (or maybe let me merge in some days). At least I know this will break a lot of sql tests outside like TPC-DS and they will need to fix accordingly.

@viirya
Copy link
Member

viirya commented Nov 25, 2019

Looks fine as it follows other major databases as shown in the description.

# Conflicts:
#	docs/sql-migration-guide.md
@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114387 has finished for PR 25214 at commit 1c30198.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -226,6 +226,32 @@ license: |

- Since Spark 3.0, when casting string value to date, timestamp and interval values, the leading and trailing white spaces(<= ACSII 32) will be trimmed before casing, e.g. `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, only the trailing space will be removed, thus, the result is `null`.

- Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result, for example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum, can you clarify what "Hive result" means?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Returns the result as a hive compatible sequence of strings. This is used in tests and
* `SparkSQLDriver` for CLI applications.
*/
def hiveResultString(executedPlan: SparkPlan): Seq[String] = executedPlan match {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to clarify it in this migration guide ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?
Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for spark-sql interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@SparkQA
Copy link

SparkQA commented Nov 26, 2019

Test build #114473 has finished for PR 25214 at commit 564c799.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

@wangyum wangyum deleted the SPARK-28461 branch November 27, 2019 09:21
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 27, 2019

Sorry guys. I'll revert this for now to recover those three Jenkins jobs.

Although the failure is only the difference from 0 and 0.0 and it succeeded with hadoop-2.7, but I'm not sure about the fix because the behavior is different across the profiles.

[info] - decimal_1_1 *** FAILED *** (812 milliseconds)
[info]   Results do not match for decimal_1_1:
[info]   == Parsed Logical Plan ==
[info]   'Project [*]
[info]   +- 'UnresolvedRelation [decimal_1_1]
[info]   
[info]   == Analyzed Logical Plan ==
[info]   d: decimal(1,1)
[info]   Project [d#15908]
[info]   +- SubqueryAlias `default`.`decimal_1_1`
[info]      +- HiveTableRelation `default`.`decimal_1_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [d#15908]
[info]   
[info]   == Optimized Logical Plan ==
[info]   HiveTableRelation `default`.`decimal_1_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [d#15908]
[info]   
[info]   == Physical Plan ==
[info]   Scan hive default.decimal_1_1 [d#15908], HiveTableRelation `default`.`decimal_1_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [d#15908]
[info]   
[info]   d
[info]   !== HIVE - 30 row(s) ==   == CATALYST - 30 row(s) ==
[info]    -0.1                     -0.1
[info]    -0.2                     -0.2
[info]    -0.3                     -0.3
[info]    -0.9                     -0.9
[info]    -0.9                     -0.9
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]   !0                        0.0
[info]    0.1                      0.1
[info]    0.2                      0.2
[info]    0.3                      0.3
[info]    0.9                      0.9
[info]    0.9                      0.9
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL
[info]    NULL                     NULL

@dongjoon-hyun
Copy link
Member

cc @gatorsmile and @cloud-fan

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 27, 2019

It turns out this PR works only Hive 1.2. With this PR on master, I reproduced the failure like the following.

$ build/sbt -Phive "hive/testOnly *.HiveCompatibilitySuite -- -z decimal"
...
[info] *** 1 TEST FAILED ***
[error] Failed: Total 4, Failed 1, Errors 0, Passed 3, Ignored 17
[error] Failed tests:
[error] 	org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
[error] (hive/test:testOnly) sbt.TestsFailedException: Tests unsuccessful

The PR Builder is still using Hive 1.2 and Hadoop 2.7 because we didn't create a new Jenkins job properly yet. For now, please test this locally with the above command and make a PR with [test-hadoop3.2].

https://issues.apache.org/jira/browse/SPARK-29988 will setup Jenkins properly.
https://issues.apache.org/jira/browse/SPARK-29991 will make [test-hive1.2] at PR Builder.

@dongjoon-hyun
Copy link
Member

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 27, 2019

So its one test being failed in a different Hive version. I assume the test is being failed as expected because Hive fixed it in 2.0. Why didn't we just fix the test? Seems root cause was clear, easy to fix the test, and didn't break PR build.

@HyukjinKwon
Copy link
Member

Hope I could leave a comment before this PR was reverted. Since this is already reverted, @wangyum can you open a PR again after fixing it?

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
…le of the column

## What changes were proposed in this pull request?

[HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063.

> HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1).
 However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0.
The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however.

**Spark SQL**:
```sql
// bin/spark-sql
spark-sql> select cast(1 as decimal(38, 18));
1
spark-sql>

// bin/beeline
0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18));
+----------------------------+--+
| CAST(1 AS DECIMAL(38,18))  |
+----------------------------+--+
| 1.000000000000000000       |
+----------------------------+--+

// bin/spark-shell
scala> spark.sql("select cast(1 as decimal(38, 18))").show(false)
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|1.000000000000000000     |
+-------------------------+

// bin/pyspark
>>> spark.sql("select cast(1 as decimal(38, 18))").show()
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+

// bin/sparkR
> showDF(sql("SELECT cast(1 as decimal(38, 18))"))
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|     1.000000000000000000|
+-------------------------+
```

**PostgreSQL**:
```sql
postgres=# select cast(1 as decimal(38, 18));
       numeric
----------------------
 1.000000000000000000
(1 row)
```
**Presto**:
```sql
presto> select cast(1 as decimal(38, 18));
        _col0
----------------------
 1.000000000000000000
(1 row)
```

## How was this patch tested?

unit tests and manual test:
```sql
spark-sql> select cast(1 as decimal(38, 18));
1.000000000000000000
```
Spark SQL Upgrading Guide:
![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png)

Closes apache#25214 from wangyum/SPARK-28461.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants