-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading decimal columns from parquet files #1294
Conversation
Signed-off-by: sperlingxx <[email protected]>
Signed-off-by: sperlingxx <[email protected]>
build |
build |
build |
build |
build |
Signed-off-by: sperlingxx <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sameerz @jlowe @tgravescs I personally feel that this is getting to be way too late to put this into the 0.3 release. We really should be looking at bug fixes instead of new feature work. Especially because we know that rapidsai/cudf#6909 is still an outstanding issue. But if this is something we need to get in, then we need to at a minimum we need to throw an exception if we get back a data type we don't expect for a decimal column.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
Agree. I would like to see the legacy decimal encoding supported before this goes in, otherwise we're left in a situation where the plugin crashes a query that used to work without it. |
build |
1 similar comment
build |
@sperlingxx per the above discussion, I think this PR should be retargeted to branch-0.4. |
Yes! So, I labeled this pull request with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! So, I labeled this pull request with WIP.
That's fine, adding a request to retarget this PR to branch-0.4 so it cannot be accidentally merged in the interim.
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test failures are related, I'm guessing because the 3.1.0 shims were not updated.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
shims/spark300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/Spark300Shims.scala
Show resolved
Hide resolved
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: sperlingxx <[email protected]>
build |
I also noticed that this PR didn't generate a new docs/supported_ops.md as it should. We're not currently generating the documentation for scans from the |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking better, but the supported_ops.md still states at the bottom of the page that Parquet does not support reading the Decimal type. The static table in TypeChecks needs to be updated to reflect that decimal is support for Parquet input and the supported_ops.md file regenerated by mvn verify
.
build |
The static table has been updated. |
Signed-off-by: sperlingxx <[email protected]>
Signed-off-by: sperlingxx <[email protected]>
…IDIA#1294) Signed-off-by: spark-rapids automation <[email protected]>
This pull request is to enable reading decimal columns from parquet files via turning
allowDecimal
totrue
. This pull request also provides test coverage for decimal reading.But there exist some limits on decimal reading: we can only read decimal columns whose storage type is
INT32/64
. For now, cuDF doesn't supportFIXED_LENGTH_BYTE_ARRAY
.INT32
will be read asINT64
because we only supportDECIMAL64
in spark-rapids.