Support reading decimal columns from parquet files #1294

sperlingxx · 2020-12-07T05:45:14Z

This pull request is to enable reading decimal columns from parquet files via turning allowDecimal to true. This pull request also provides test coverage for decimal reading.
But there exist some limits on decimal reading: we can only read decimal columns whose storage type is INT32/64. For now, cuDF doesn't support FIXED_LENGTH_BYTE_ARRAY. INT32 will be read as INT64 because we only support DECIMAL64 in spark-rapids.

Signed-off-by: sperlingxx <[email protected]>

sperlingxx · 2020-12-07T05:48:57Z

build

sperlingxx · 2020-12-07T06:10:24Z

build

sperlingxx · 2020-12-07T07:40:51Z

build

integration_tests/src/main/python/parquet_test.py

Signed-off-by: sperlingxx <[email protected]>

sperlingxx · 2020-12-08T08:01:28Z

build

sperlingxx · 2020-12-08T08:17:21Z

build

Signed-off-by: sperlingxx <[email protected]>

sperlingxx · 2020-12-08T08:31:13Z

build

revans2

@sameerz @jlowe @tgravescs I personally feel that this is getting to be way too late to put this into the 0.3 release. We really should be looking at bug fixes instead of new feature work. Especially because we know that rapidsai/cudf#6909 is still an outstanding issue. But if this is something we need to get in, then we need to at a minimum we need to throw an exception if we get back a data type we don't expect for a decimal column.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

jlowe · 2020-12-08T15:13:12Z

I personally feel that this is getting to be way too late to put this into the 0.3 release.

Agree. I would like to see the legacy decimal encoding supported before this goes in, otherwise we're left in a situation where the plugin crashes a query that used to work without it.

sperlingxx · 2020-12-09T02:56:51Z

build

sperlingxx · 2020-12-09T04:03:28Z

build

jlowe · 2020-12-09T14:54:28Z

@sperlingxx per the above discussion, I think this PR should be retargeted to branch-0.4.

sperlingxx · 2020-12-10T01:30:02Z

@sperlingxx per the above discussion, I think this PR should be retargeted to branch-0.4.

Yes! So, I labeled this pull request with WIP.

jlowe

Yes! So, I labeled this pull request with WIP.

That's fine, adding a request to retarget this PR to branch-0.4 so it cannot be accidentally merged in the interim.

sperlingxx · 2020-12-16T09:20:07Z

build

sperlingxx · 2020-12-16T09:41:44Z

build

jlowe

Test failures are related, I'm guessing because the 3.1.0 shims were not updated.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

shims/spark300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/Spark300Shims.scala

sperlingxx · 2020-12-17T02:30:12Z

build

integration_tests/src/main/python/parquet_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Signed-off-by: sperlingxx <[email protected]>

sperlingxx · 2020-12-18T10:14:48Z

build

jlowe · 2020-12-18T16:35:03Z

I also noticed that this PR didn't generate a new docs/supported_ops.md as it should. We're not currently generating the documentation for scans from the ScanMeta, but it is hardcoded in SupportedOpsDocs.help which should be updated accordingly. After updating the help method, performing a usual mvn verify build will generate a new supported_ops.md file that should be added to this PR.

sperlingxx · 2020-12-28T09:11:05Z

build

jlowe

This is looking better, but the supported_ops.md still states at the bottom of the page that Parquet does not support reading the Decimal type. The static table in TypeChecks needs to be updated to reflect that decimal is support for Parquet input and the supported_ops.md file regenerated by mvn verify.

sperlingxx · 2021-01-05T04:58:18Z

build

sperlingxx · 2021-01-05T05:00:08Z

This is looking better, but the supported_ops.md still states at the bottom of the page that Parquet does not support reading the Decimal type. The static table in TypeChecks needs to be updated to reflect that decimal is support for Parquet input and the supported_ops.md file regenerated by mvn verify.

The static table has been updated.

Signed-off-by: sperlingxx <[email protected]>

…IDIA#1294) Signed-off-by: spark-rapids automation <[email protected]>

sperlingxx added 2 commits December 7, 2020 13:23

support reading decimal columns from parquet files

8487b19

Signed-off-by: sperlingxx <[email protected]>

fix

f606897

Signed-off-by: sperlingxx <[email protected]>

fix typo

ecb6b8d

Merge remote-tracking branch 'origin/branch-0.3' into dec_pq_read

bded4fd

revans2 reviewed Dec 7, 2020

View reviewed changes

integration_tests/src/main/python/parquet_test.py Outdated Show resolved Hide resolved

sperlingxx added 3 commits December 8, 2020 15:05

support reading decimal32

c3afe24

Signed-off-by: sperlingxx <[email protected]>

Merge remote-tracking branch 'origin/branch-0.3' into dec_pq_read

7b9111b

adapt decimalTypeEnabled conf

8e2fdd0

fix typo

baf29fc

fix typo

0f74284

Signed-off-by: sperlingxx <[email protected]>

revans2 reviewed Dec 8, 2020

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala Outdated Show resolved Hide resolved

This was referenced Dec 8, 2020

[FEA] Support Parquet Read of Decimal FIXED_LENGTH_BYTE_ARRAY #1324

Closed

[FEA] Support decimal type #42

Closed

enforce safeClose on gpu resources

57a5e3d

sperlingxx changed the title ~~Support reading decimal columns from parquet files~~ [WIP] Support reading decimal columns from parquet files Dec 9, 2020

sameerz added the feature request New feature or request label Dec 10, 2020

jlowe suggested changes Dec 10, 2020

View reviewed changes

sperlingxx added 2 commits December 16, 2020 15:59

Merge remote-tracking branch 'origin/branch-0.4' into dec_pq_read

445a107

keep up with latest master

63ac10e

sperlingxx changed the title ~~[WIP] Support reading decimal columns from parquet files~~ [REVIEW] Support reading decimal columns from parquet files Dec 16, 2020

revert supported_ops.md

6bdeb99

jlowe suggested changes Dec 16, 2020

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

shims/spark300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/Spark300Shims.scala Show resolved Hide resolved

fix

022d65c

revans2 reviewed Dec 17, 2020

View reviewed changes

small fix

b375546

Signed-off-by: sperlingxx <[email protected]>

sameerz added this to the Jan 4 - Jan 15 milestone Dec 18, 2020

DecimalType TagSupport: OrcReader/CsvReader

6d333c5

jlowe suggested changes Jan 4, 2021

View reviewed changes

revert

a2d6bee

jlowe changed the title ~~[REVIEW] Support reading decimal columns from parquet files~~ Support reading decimal columns from parquet files Jan 5, 2021

jlowe approved these changes Jan 5, 2021

View reviewed changes

revans2 approved these changes Jan 5, 2021

View reviewed changes

revans2 merged commit 6635370 into NVIDIA:branch-0.4 Jan 5, 2021

sperlingxx deleted the dec_pq_read branch April 8, 2021 03:05

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Support reading decimal columns from parquet files (NVIDIA#1294)

4ef9d90

Signed-off-by: sperlingxx <[email protected]>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021

Support reading decimal columns from parquet files (NVIDIA#1294)

9f819be

Signed-off-by: sperlingxx <[email protected]>

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Update submodule cudf to fa09cca6d9fb799f07cb1205d5bee2896ad594e3 (NV…

69c18b8

…IDIA#1294) Signed-off-by: spark-rapids automation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading decimal columns from parquet files #1294

Support reading decimal columns from parquet files #1294

sperlingxx commented Dec 7, 2020 •

edited

Loading

sperlingxx commented Dec 7, 2020

sperlingxx commented Dec 7, 2020

sperlingxx commented Dec 7, 2020

sperlingxx commented Dec 8, 2020

sperlingxx commented Dec 8, 2020

sperlingxx commented Dec 8, 2020

revans2 left a comment

jlowe commented Dec 8, 2020

sperlingxx commented Dec 9, 2020

sperlingxx commented Dec 9, 2020

jlowe commented Dec 9, 2020

sperlingxx commented Dec 10, 2020

jlowe left a comment

sperlingxx commented Dec 16, 2020

sperlingxx commented Dec 16, 2020

jlowe left a comment

sperlingxx commented Dec 17, 2020

sperlingxx commented Dec 18, 2020

jlowe commented Dec 18, 2020

sperlingxx commented Dec 28, 2020

jlowe left a comment

sperlingxx commented Jan 5, 2021

sperlingxx commented Jan 5, 2021

Support reading decimal columns from parquet files #1294

Support reading decimal columns from parquet files #1294

Conversation

sperlingxx commented Dec 7, 2020 • edited Loading

sperlingxx commented Dec 7, 2020

sperlingxx commented Dec 7, 2020

sperlingxx commented Dec 7, 2020

sperlingxx commented Dec 8, 2020

sperlingxx commented Dec 8, 2020

sperlingxx commented Dec 8, 2020

revans2 left a comment

Choose a reason for hiding this comment

jlowe commented Dec 8, 2020

sperlingxx commented Dec 9, 2020

sperlingxx commented Dec 9, 2020

jlowe commented Dec 9, 2020

sperlingxx commented Dec 10, 2020

jlowe left a comment

Choose a reason for hiding this comment

sperlingxx commented Dec 16, 2020

sperlingxx commented Dec 16, 2020

jlowe left a comment

Choose a reason for hiding this comment

sperlingxx commented Dec 17, 2020

sperlingxx commented Dec 18, 2020

jlowe commented Dec 18, 2020

sperlingxx commented Dec 28, 2020

jlowe left a comment

Choose a reason for hiding this comment

sperlingxx commented Jan 5, 2021

sperlingxx commented Jan 5, 2021

sperlingxx commented Dec 7, 2020 •

edited

Loading