Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21396][SQL] Fixes MatchError when UDTs are passed through Hive Thriftserver #20385

Closed
wants to merge 2 commits into from

Conversation

atallahhezbor
Copy link
Contributor

Signed-off-by: Atallah Hezbor [email protected]

What changes were proposed in this pull request?

This PR proposes modifying the match statement that gets the columns of a row in HiveThriftServer. There was previously no case for UserDefinedType, so querying a table that contained them would throw a match error. The changes catch that case and return the string representation.

How was this patch tested?

While I would have liked to add a unit test, I couldn't easily incorporate UDTs into the HiveThriftServer2Suites pipeline. With some guidance I would be happy to push a commit with tests.

Instead I did a manual test by loading a DataFrame with Point UDT in a spark shell with a HiveThriftServer. Then in beeline, connecting to the server and querying that table.

Here is the result before the change

0: jdbc:hive2://localhost:10000> select * from chicago;
Error: scala.MatchError: org.apache.spark.sql.PointUDT@2d980dc3 (of class org.apache.spark.sql.PointUDT) (state=,code=0)

And after the change:

0: jdbc:hive2://localhost:10000> select * from chicago;
+---------------------------------------+--------------+------------------------+---------------------+--+
|                __fid__                | case_number  |          dtg           |        geom         |
+---------------------------------------+--------------+------------------------+---------------------+--+
| 109602f9-54f8-414b-8c6f-42b1a337643e  | 2            | 2016-01-01 19:00:00.0  | POINT (-77 38)      |
| 709602f9-fcff-4429-8027-55649b6fd7ed  | 1            | 2015-12-31 19:00:00.0  | POINT (-76.5 38.5)  |
| 009602f9-fcb5-45b1-a867-eb8ba10cab40  | 3            | 2016-01-02 19:00:00.0  | POINT (-78 39)      |
+---------------------------------------+--------------+------------------------+---------------------+--+

@gatorsmile
Copy link
Member

cc @liufengdb

@felixcheung
Copy link
Member

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86597 has finished for PR 20385 at commit c8fb436.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -102,6 +102,8 @@ private[hive] class SparkExecuteStatementOperation(
to += from.getAs[Timestamp](ordinal)
case BinaryType =>
to += from.getAs[Array[Byte]](ordinal)
case udt: UserDefinedType[_] =>
to += from.get(ordinal).toString
Copy link

@liufengdb liufengdb Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible from.get(ordinal) returns null, then a null pointer exception. I think a better way to add this case is by the method HiveUtils.toHiveString, which can potentially be reused and tested.

Copy link

@liufengdb liufengdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a test is necessary.

@gatorsmile
Copy link
Member

The unit test cases are needed for HiveUtils.toHiveString.

@atallahhezbor
Copy link
Contributor Author

@liufengdb @gatorsmile I'm happy to write a unit test if you require it. Though as I mentioned before, I did not see a clear way of testing the functionality in SparkExecuteStatementOperation. It looks to me that a lot of those functions are not tested directly.

Also @gatorsmile, I'm not sure I understand you. If I move the functionality to HiveUtils.toHiveString, are you saying I should write a unit test of that whole function?

@gatorsmile
Copy link
Member

@atallahhezbor Yeah! Please help us improve the test coverage. We do not have a clear way to test the functionality in SparkExecuteStatementOperation

Adding unit test cases for HiveUtils.toHiveString is enough if we move the code changes to HiveUtils.toHiveString

@gatorsmile
Copy link
Member

ok to test

@liufengdb
Copy link

Actually, one more thing, do you need to consider the UDT as one attribute of a structured type? https://github.com/apache/spark/pull/20385/files#diff-842e3447fc453de26c706db1cac8f2c4L467

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86887 has finished for PR 20385 at commit e05041f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@atallahhezbor Could you submit another PR to address the comment from @liufengdb ?

This fix is nice to have in Spark 2.3. Let merge this now.

Thanks! Merged to master/2.3

@asfgit asfgit closed this in b2e7677 Feb 1, 2018
asfgit pushed a commit that referenced this pull request Feb 1, 2018
… Thriftserver

Signed-off-by: Atallah Hezbor <atallahhezborgmail.com>

## What changes were proposed in this pull request?

This PR proposes modifying the match statement that gets the columns of a row in HiveThriftServer. There was previously no case for `UserDefinedType`, so querying a table that contained them would throw a match error. The changes catch that case and return the string representation.

## How was this patch tested?

While I would have liked to add a unit test, I couldn't easily incorporate UDTs into the ``HiveThriftServer2Suites`` pipeline. With some guidance I would be happy to push a commit with tests.

Instead I did a manual test by loading a `DataFrame` with Point UDT in a spark shell with a HiveThriftServer. Then in beeline, connecting to the server and querying that table.

Here is the result before the change
```
0: jdbc:hive2://localhost:10000> select * from chicago;
Error: scala.MatchError: org.apache.spark.sql.PointUDT2d980dc3 (of class org.apache.spark.sql.PointUDT) (state=,code=0)

```

And after the change:
```
0: jdbc:hive2://localhost:10000> select * from chicago;
+---------------------------------------+--------------+------------------------+---------------------+--+
|                __fid__                | case_number  |          dtg           |        geom         |
+---------------------------------------+--------------+------------------------+---------------------+--+
| 109602f9-54f8-414b-8c6f-42b1a337643e  | 2            | 2016-01-01 19:00:00.0  | POINT (-77 38)      |
| 709602f9-fcff-4429-8027-55649b6fd7ed  | 1            | 2015-12-31 19:00:00.0  | POINT (-76.5 38.5)  |
| 009602f9-fcb5-45b1-a867-eb8ba10cab40  | 3            | 2016-01-02 19:00:00.0  | POINT (-78 39)      |
+---------------------------------------+--------------+------------------------+---------------------+--+
```

Author: Atallah Hezbor <[email protected]>

Closes #20385 from atallahhezbor/udts_over_hive.

(cherry picked from commit b2e7677)
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants