-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column #25214
Conversation
sql/hive/src/test/resources/golden/decimal_4-6-693c2e345731f9b2b547c3b75218458e
Show resolved
Hide resolved
docs/sql-migration-guide-upgrade.md
Outdated
@@ -23,6 +23,8 @@ license: | | |||
{:toc} | |||
|
|||
## Upgrading From Spark SQL 2.4 to 3.0 | |||
- Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, we should put this description here about hive specific changes? Is it better in docs/sql-migration-guide-hive-compatibility.md
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to show an example here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is sort of minor, I don't think we should put it in the first place of migration guide.
Test build #107949 has finished for PR 25214 at commit
|
Test build #107956 has finished for PR 25214 at commit
|
retest this please |
Test build #107958 has finished for PR 25214 at commit
|
Only |
# Conflicts: # sql/core/src/test/resources/sql-tests/results/pgSQL/int4.sql.out # sql/core/src/test/resources/sql-tests/results/pgSQL/numeric.sql.out
Test build #108600 has finished for PR 25214 at commit
|
Does this only affect the output result of CLI applications? |
Yes. It only affect the output result of CLI application. |
# Conflicts: # docs/sql-migration-guide-upgrade.md
Test build #108713 has finished for PR 25214 at commit
|
docs/sql-migration-guide-upgrade.md
Outdated
@@ -159,6 +159,32 @@ license: | | |||
|
|||
- Since Spark 3.0, Dataset query fails if it contains ambiguous column reference that is caused by self join. A typical example: `val df1 = ...; val df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an empty result which is quite confusing. This is because Spark cannot resolve Dataset column references that point to tables being self joined, and `df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin` to `false`. | |||
|
|||
- Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, it looks not clear what Hive result is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems right to me. Trailing behaviour was a legacy behaviour from old Hive IIRC |
# Conflicts: # docs/sql-migration-guide-upgrade.md # sql/core/src/test/resources/sql-tests/results/pgSQL/numeric.sql.out # sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
Test build #109699 has finished for PR 25214 at commit
|
Test build #109706 has finished for PR 25214 at commit
|
@dongjoon-hyun and @gatorsmile WDYT? |
does this fix https://issues.apache.org/jira/browse/SPARK-23576 ? |
@tooptoop4 I think it's different things. |
# Conflicts: # docs/sql-migration-guide-upgrade.md # sql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out # sql/core/src/test/resources/sql-tests/results/postgreSQL/int4.sql.out
Test build #112195 has finished for PR 25214 at commit
|
The current change looks fine to me, too. Anyone can check this? @gatorsmile @dongjoon-hyun @viirya |
Thank you @maropu I think we should continue this work because it is the default behavior of Hive 2.x and we are going to upgrade the default built-in Hive to 2.3. |
yea, can you resolve the conflict? @wangyum |
# Conflicts: # docs/sql-migration-guide.md # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala # sql/core/src/test/resources/sql-tests/results/decimalArithmeticOperations.sql.out # sql/core/src/test/resources/sql-tests/results/literals.sql.out # sql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out
Test build #114341 has finished for PR 25214 at commit
|
But let me defer to others (or maybe let me merge in some days). At least I know this will break a lot of sql tests outside like TPC-DS and they will need to fix accordingly. |
Looks fine as it follows other major databases as shown in the description. |
# Conflicts: # docs/sql-migration-guide.md
Test build #114387 has finished for PR 25214 at commit
|
docs/sql-migration-guide.md
Outdated
@@ -226,6 +226,32 @@ license: | | |||
|
|||
- Since Spark 3.0, when casting string value to date, timestamp and interval values, the leading and trailing white spaces(<= ACSII 32) will be trimmed before casing, e.g. `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, only the trailing space will be removed, thus, the result is `null`. | |||
|
|||
- Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for Hive result, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyum, can you clarify what "Hive result" means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Lines 33 to 37 in 051e691
/** | |
* Returns the result as a hive compatible sequence of strings. This is used in tests and | |
* `SparkSQLDriver` for CLI applications. | |
*/ | |
def hiveResultString(executedPlan: SparkPlan): Seq[String] = executedPlan match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to clarify it in this migration guide ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for spark-sql
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Test build #114473 has finished for PR 25214 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged to master. |
Sorry guys. I'll revert this for now to recover those three Jenkins jobs. Although the failure is only the difference from
|
cc @gatorsmile and @cloud-fan |
It turns out this PR works only
The PR Builder is still using https://issues.apache.org/jira/browse/SPARK-29988 will setup Jenkins properly. |
So its one test being failed in a different Hive version. I assume the test is being failed as expected because Hive fixed it in 2.0. Why didn't we just fix the test? Seems root cause was clear, easy to fix the test, and didn't break PR build. |
Hope I could leave a comment before this PR was reverted. Since this is already reverted, @wangyum can you open a PR again after fixing it? |
…le of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. **Spark SQL**: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ | CAST(1 AS DECIMAL(38,18)) | +----------------------------+--+ | 1.000000000000000000 | +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ |1.000000000000000000 | +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ |CAST(1 AS DECIMAL(38,18))| +-------------------------+ | 1.000000000000000000| +-------------------------+ ``` **PostgreSQL**: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` **Presto**: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide:  Closes apache#25214 from wangyum/SPARK-28461. Authored-by: Yuming Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
HIVE-12063 improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063.
Spark SQL:
PostgreSQL:
Presto:
How was this patch tested?
unit tests and manual test:
Spark SQL Upgrading Guide:
