Decimal Support for writing Parquet #1531

razajafri · 2021-01-15T06:00:07Z

This PR adds support for writing Decimal types to Parquet file.

There is an issue in cudf at the time of writing this PR (rapidsai/cudf#7152) i.e. Decimals with precision < 10 are not able to be read back using Spark-cpu.

This depends on rapidsai/cudf#7153

Signed-off-by: Raza Jafri [email protected]

Signed-off-by: Raza Jafri <[email protected]>

tgravescs · 2021-01-15T14:50:01Z

integration_tests/src/main/python/parquet_write_test.py

@@ -32,16 +33,30 @@
 writer_confs={'spark.sql.legacy.parquet.datetimeRebaseModeInWrite': 'CORRECTED',
              'spark.sql.legacy.parquet.int96RebaseModeInWrite': 'CORRECTED'}

+# https://github.com/rapidsai/cudf/issues/7152


we could write the tests and just xfail them and we should file a followup issue to track it.

hmm... ok. I will see if I can do that and still use our data_gens

tgravescs · 2021-01-15T14:51:04Z

integration_tests/src/main/python/parquet_write_test.py


 parquet_ts_write_options = ['INT96', 'TIMESTAMP_MICROS', 'TIMESTAMP_MILLIS']

 @pytest.mark.parametrize('parquet_gens', parquet_write_gens_list, ids=idfn)
 @pytest.mark.parametrize('reader_confs', reader_opt_confs)
 @pytest.mark.parametrize('v1_enabled_list', ["", "parquet"])
 @pytest.mark.parametrize('ts_type', parquet_ts_write_options)
+@allow_non_gpu("CoalesceExec")


why is this?

Because CoalesceExec doesn't support Decimals yet

Do we want to rather write a separate test for Decimals so we are at least testing coalesce with other types? we already have other tests testing coalesce though

yes, I don't want it to be ok if other types have that

tgravescs · 2021-01-15T14:52:02Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala

@@ -41,6 +41,11 @@ object GpuParquetFileFormat {
      spark: SparkSession,
      options: Map[String, String],
      schema: StructType): Option[GpuParquetFileFormat] = {
+
+    if(!schema.forall(field => GpuOverrides.isSupportedType(field.dataType, allowDecimal = true))) {


nit space after if

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala

tgravescs · 2021-01-15T14:53:36Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala

+
+
+    if(!schema.forall(field => GpuOverrides.isSupportedType(field.dataType))) {
+      meta.willNotWorkOnGpu("Not all datatypes are supported")


in the very least print the data types here or tell us which one isn't supported

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

jlowe · 2021-01-15T21:35:09Z

Converting this to draft since it depends on a cudf change that is still pending.

jlowe

This needs to update the static documentation in TypeChecks to update the Input/Output table to specify Parquet supports decimals.

Signed-off-by: Raza Jafri <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala

tgravescs · 2021-01-22T15:23:53Z

integration_tests/src/main/python/parquet_write_test.py


 parquet_ts_write_options = ['INT96', 'TIMESTAMP_MICROS', 'TIMESTAMP_MILLIS']

 @pytest.mark.parametrize('parquet_gens', parquet_write_gens_list, ids=idfn)
 @pytest.mark.parametrize('reader_confs', reader_opt_confs)
 @pytest.mark.parametrize('v1_enabled_list', ["", "parquet"])
 @pytest.mark.parametrize('ts_type', parquet_ts_write_options)
-def test_write_round_trip(spark_tmp_path, parquet_gens, v1_enabled_list, ts_type, reader_confs):
+def test_parquet_write_round_trip(spark_tmp_path, parquet_gens, v1_enabled_list, ts_type,


do we really need parquet in the name when its in parquet_write_test? just make it wrap inputs

Changed it. I made the change to be able to execute the test by name otherwise it would run the orc test as well.

ok that is fine. you can also specify the test file. src/main/python/parquet_write_test.py -k test_write_round_trip

Thanks, that's really helpful for future.

tgravescs · 2021-01-22T15:27:22Z

integration_tests/src/main/python/parquet_write_test.py

@@ -112,6 +118,7 @@ def test_compress_write_round_trip(spark_tmp_path, compress, v1_enabled_list, re

 @pytest.mark.parametrize('parquet_gens', parquet_write_gens_list, ids=idfn)
 @pytest.mark.parametrize('ts_type', parquet_ts_write_options)
+@allow_non_gpu("CoalesceExec")


assume this isn't needed if in parquet_write_gens_list?

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2021-01-22T19:46:52Z

@jlowe I have converted this from a draft as the cudf issue is merged. Appreciate the review. Can you PTAL?

jlowe · 2021-01-22T20:01:35Z

build

jlowe · 2021-01-22T20:53:47Z

build

gerashegalov · 2021-01-22T20:59:49Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala

@@ -41,6 +41,13 @@ object GpuParquetFileFormat {
      spark: SparkSession,
      options: Map[String, String],
      schema: StructType): Option[GpuParquetFileFormat] = {
+
+    val unSupportedTypes =
+      schema.filter(field => !GpuOverrides.isSupportedType(field.dataType, allowDecimal = true))


consider schema.filterNot for readability

gerashegalov · 2021-01-22T21:01:25Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala

+
+    val unSupportedTypes =
+      schema.filter(field => !GpuOverrides.isSupportedType(field.dataType, allowDecimal = true))
+    if (!unSupportedTypes.isEmpty) {


consider if (unSupportedTypes.nonEmpty)

gerashegalov · 2021-01-22T21:16:41Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetFileFormat.scala

+    def precisionsList(t: DataType): List[Int] = {
+      t match {
+        case d: DecimalType => List(d.precision)
+        case s: StructType => s.flatMap(f => precisionsList(f.dataType)).toList


we could save the conversion toList if the return type for precisionsList were Seq[Int]

gerashegalov · 2021-01-22T21:26:27Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala

+                    options: Map[String, String],
+                    schema: StructType): Option[GpuOrcFileFormat] = {
+
+    val unSupportedTypes = schema.filter(field => !GpuOverrides.isSupportedType(field.dataType))


looks the same as in GpuParquetFileFormat.scala, consider a shared utils.

Its not exactly the same. The predicate being tested is different. If you still think we can benefit from refactoring this I can do it

the difference is parameterizable but it's not that big of a deal given it's just a few lines, up to you

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2021-01-22T22:04:34Z

@gerashegalov Thanks for the review I have incorporated most of your comments in the PR. The only thing that I haven't done is the utility method.

razajafri · 2021-01-22T22:22:26Z

build

gerashegalov

LGTM

* Decimal Support for writing Parquet Signed-off-by: Raza Jafri <[email protected]> * addressed review comments Signed-off-by: Raza Jafri <[email protected]> * updated static doc Signed-off-by: Raza Jafri <[email protected]> * generated supported_ops Signed-off-by: Raza Jafri <[email protected]> * addressed review comments Signed-off-by: Raza Jafri <[email protected]> Co-authored-by: Raza Jafri <[email protected]>

…IDIA#1531) Signed-off-by: spark-rapids automation <[email protected]>

Decimal Support for writing Parquet

1fa6e53

Signed-off-by: Raza Jafri <[email protected]>

tgravescs reviewed Jan 15, 2021

View reviewed changes

tgravescs assigned razajafri Jan 15, 2021

tgravescs added the feature request New feature or request label Jan 15, 2021

tgravescs added this to the Jan 4 - Jan 15 milestone Jan 15, 2021

jlowe marked this pull request as draft January 15, 2021 21:34

jlowe suggested changes Jan 15, 2021

View reviewed changes

sameerz modified the milestones: Jan 4 - Jan 15, Jan 18 - Jan 29 Jan 16, 2021

razajafri added 3 commits January 19, 2021 22:17

Merge remote-tracking branch 'origin/branch-0.4' into HEAD

dac0c04

Signed-off-by: Raza Jafri <[email protected]>

addressed review comments

43ff253

Signed-off-by: Raza Jafri <[email protected]>

updated static doc

7872500

Signed-off-by: Raza Jafri <[email protected]>

jlowe suggested changes Jan 22, 2021

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala Show resolved Hide resolved

tgravescs reviewed Jan 22, 2021

View reviewed changes

razajafri added 2 commits January 22, 2021 08:37

generated supported_ops

3e36aed

Signed-off-by: Raza Jafri <[email protected]>

Merge remote-tracking branch 'origin/branch-0.4' into par_dec_1

e1be559

Signed-off-by: Raza Jafri <[email protected]>

razajafri marked this pull request as ready for review January 22, 2021 19:46

jlowe previously approved these changes Jan 22, 2021

View reviewed changes

gerashegalov reviewed Jan 22, 2021

View reviewed changes

addressed review comments

77f1c74

Signed-off-by: Raza Jafri <[email protected]>

razajafri dismissed jlowe’s stale review via 77f1c74 January 22, 2021 22:06

gerashegalov approved these changes Jan 22, 2021

View reviewed changes

razajafri merged commit 088cd82 into NVIDIA:branch-0.4 Jan 23, 2021

razajafri deleted the parquet_decimal branch January 23, 2021 00:54

razajafri mentioned this pull request Jan 26, 2021

[FEA] Decimal Support for Parquet Write #1485

Closed

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Update submodule cudf to abc0d41d1d9033d581948ae19384e0aa0f33da77 (NV…

4cff880

…IDIA#1531) Signed-off-by: spark-rapids automation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decimal Support for writing Parquet #1531

Decimal Support for writing Parquet #1531

razajafri commented Jan 15, 2021 •

edited

Loading

tgravescs Jan 15, 2021

razajafri Jan 15, 2021

tgravescs Jan 15, 2021

razajafri Jan 15, 2021

razajafri Jan 15, 2021

tgravescs Jan 15, 2021

tgravescs Jan 15, 2021

tgravescs Jan 15, 2021

jlowe commented Jan 15, 2021

jlowe left a comment

tgravescs Jan 22, 2021

razajafri Jan 22, 2021

tgravescs Jan 22, 2021

razajafri Jan 22, 2021

tgravescs Jan 22, 2021

razajafri commented Jan 22, 2021

jlowe commented Jan 22, 2021

jlowe commented Jan 22, 2021

gerashegalov Jan 22, 2021

gerashegalov Jan 22, 2021

gerashegalov Jan 22, 2021

gerashegalov Jan 22, 2021

razajafri Jan 22, 2021

gerashegalov Jan 22, 2021

razajafri commented Jan 22, 2021

razajafri commented Jan 22, 2021

gerashegalov left a comment



		if(!schema.forall(field => GpuOverrides.isSupportedType(field.dataType))) {
		meta.willNotWorkOnGpu("Not all datatypes are supported")

Decimal Support for writing Parquet #1531

Decimal Support for writing Parquet #1531

Conversation

razajafri commented Jan 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlowe commented Jan 15, 2021

jlowe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Jan 22, 2021

jlowe commented Jan 22, 2021

jlowe commented Jan 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Jan 22, 2021

razajafri commented Jan 22, 2021

gerashegalov left a comment

Choose a reason for hiding this comment

razajafri commented Jan 15, 2021 •

edited

Loading