-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decimal Support for writing Parquet #1531
Conversation
Signed-off-by: Raza Jafri <[email protected]>
@@ -32,16 +33,30 @@ | |||
writer_confs={'spark.sql.legacy.parquet.datetimeRebaseModeInWrite': 'CORRECTED', | |||
'spark.sql.legacy.parquet.int96RebaseModeInWrite': 'CORRECTED'} | |||
|
|||
# https://github.com/rapidsai/cudf/issues/7152 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could write the tests and just xfail them and we should file a followup issue to track it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... ok. I will see if I can do that and still use our data_gens
|
||
parquet_ts_write_options = ['INT96', 'TIMESTAMP_MICROS', 'TIMESTAMP_MILLIS'] | ||
|
||
@pytest.mark.parametrize('parquet_gens', parquet_write_gens_list, ids=idfn) | ||
@pytest.mark.parametrize('reader_confs', reader_opt_confs) | ||
@pytest.mark.parametrize('v1_enabled_list', ["", "parquet"]) | ||
@pytest.mark.parametrize('ts_type', parquet_ts_write_options) | ||
@allow_non_gpu("CoalesceExec") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because CoalesceExec
doesn't support Decimals
yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to rather write a separate test for Decimals so we are at least testing coalesce with other types? we already have other tests testing coalesce though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I don't want it to be ok if other types have that
@@ -41,6 +41,11 @@ object GpuParquetFileFormat { | |||
spark: SparkSession, | |||
options: Map[String, String], | |||
schema: StructType): Option[GpuParquetFileFormat] = { | |||
|
|||
if(!schema.forall(field => GpuOverrides.isSupportedType(field.dataType, allowDecimal = true))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit space after if
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuOrcFileFormat.scala
Outdated
Show resolved
Hide resolved
|
||
|
||
if(!schema.forall(field => GpuOverrides.isSupportedType(field.dataType))) { | ||
meta.willNotWorkOnGpu("Not all datatypes are supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the very least print the data types here or tell us which one isn't supported
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
Converting this to draft since it depends on a cudf change that is still pending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to update the static documentation in TypeChecks to update the Input/Output table to specify Parquet supports decimals.
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
|
||
parquet_ts_write_options = ['INT96', 'TIMESTAMP_MICROS', 'TIMESTAMP_MILLIS'] | ||
|
||
@pytest.mark.parametrize('parquet_gens', parquet_write_gens_list, ids=idfn) | ||
@pytest.mark.parametrize('reader_confs', reader_opt_confs) | ||
@pytest.mark.parametrize('v1_enabled_list', ["", "parquet"]) | ||
@pytest.mark.parametrize('ts_type', parquet_ts_write_options) | ||
def test_write_round_trip(spark_tmp_path, parquet_gens, v1_enabled_list, ts_type, reader_confs): | ||
def test_parquet_write_round_trip(spark_tmp_path, parquet_gens, v1_enabled_list, ts_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need parquet in the name when its in parquet_write_test? just make it wrap inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it. I made the change to be able to execute the test by name otherwise it would run the orc test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that is fine. you can also specify the test file. src/main/python/parquet_write_test.py -k test_write_round_trip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's really helpful for future.
@@ -112,6 +118,7 @@ def test_compress_write_round_trip(spark_tmp_path, compress, v1_enabled_list, re | |||
|
|||
@pytest.mark.parametrize('parquet_gens', parquet_write_gens_list, ids=idfn) | |||
@pytest.mark.parametrize('ts_type', parquet_ts_write_options) | |||
@allow_non_gpu("CoalesceExec") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assume this isn't needed if in parquet_write_gens_list?
Signed-off-by: Raza Jafri <[email protected]>
Signed-off-by: Raza Jafri <[email protected]>
@jlowe I have converted this from a draft as the cudf issue is merged. Appreciate the review. Can you PTAL? |
build |
1 similar comment
build |
@@ -41,6 +41,13 @@ object GpuParquetFileFormat { | |||
spark: SparkSession, | |||
options: Map[String, String], | |||
schema: StructType): Option[GpuParquetFileFormat] = { | |||
|
|||
val unSupportedTypes = | |||
schema.filter(field => !GpuOverrides.isSupportedType(field.dataType, allowDecimal = true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider schema.filterNot
for readability
|
||
val unSupportedTypes = | ||
schema.filter(field => !GpuOverrides.isSupportedType(field.dataType, allowDecimal = true)) | ||
if (!unSupportedTypes.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider if (unSupportedTypes.nonEmpty)
def precisionsList(t: DataType): List[Int] = { | ||
t match { | ||
case d: DecimalType => List(d.precision) | ||
case s: StructType => s.flatMap(f => precisionsList(f.dataType)).toList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could save the conversion toList
if the return type for precisionsList
were Seq[Int]
options: Map[String, String], | ||
schema: StructType): Option[GpuOrcFileFormat] = { | ||
|
||
val unSupportedTypes = schema.filter(field => !GpuOverrides.isSupportedType(field.dataType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks the same as in GpuParquetFileFormat.scala, consider a shared utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not exactly the same. The predicate being tested is different. If you still think we can benefit from refactoring this I can do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the difference is parameterizable but it's not that big of a deal given it's just a few lines, up to you
Signed-off-by: Raza Jafri <[email protected]>
@gerashegalov Thanks for the review I have incorporated most of your comments in the PR. The only thing that I haven't done is the utility method. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Decimal Support for writing Parquet Signed-off-by: Raza Jafri <[email protected]> * addressed review comments Signed-off-by: Raza Jafri <[email protected]> * updated static doc Signed-off-by: Raza Jafri <[email protected]> * generated supported_ops Signed-off-by: Raza Jafri <[email protected]> * addressed review comments Signed-off-by: Raza Jafri <[email protected]> Co-authored-by: Raza Jafri <[email protected]>
* Decimal Support for writing Parquet Signed-off-by: Raza Jafri <[email protected]> * addressed review comments Signed-off-by: Raza Jafri <[email protected]> * updated static doc Signed-off-by: Raza Jafri <[email protected]> * generated supported_ops Signed-off-by: Raza Jafri <[email protected]> * addressed review comments Signed-off-by: Raza Jafri <[email protected]> Co-authored-by: Raza Jafri <[email protected]>
…IDIA#1531) Signed-off-by: spark-rapids automation <[email protected]>
This PR adds support for writing Decimal types to Parquet file.
There is an issue in cudf at the time of writing this PR (rapidsai/cudf#7152) i.e. Decimals with precision < 10 are not able to be read back using Spark-cpu.
This depends on rapidsai/cudf#7153
Signed-off-by: Raza Jafri [email protected]