[SPARK-4192][SQL] Internal API for Python UDT #3068

mengxr · 2014-11-03T04:31:14Z

Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before SQLContext.applySchema, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.

@marmbrus @jkbradley @davies

SparkQA · 2014-11-03T04:34:58Z

Test build #22796 has started for PR 3068 at commit 4e84fce.

This patch merges cleanly.

SparkQA · 2014-11-03T04:36:07Z

Test build #22796 has finished for PR 3068 at commit 4e84fce.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-03T04:36:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22796/
Test FAILed.

SparkQA · 2014-11-03T04:49:55Z

Test build #22798 has started for PR 3068 at commit e98d9d0.

This patch merges cleanly.

SparkQA · 2014-11-03T04:52:28Z

Test build #22799 has started for PR 3068 at commit 75223db.

This patch merges cleanly.

SparkQA · 2014-11-03T04:53:22Z

Test build #22798 has finished for PR 3068 at commit e98d9d0.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class UserDefinedType(DataType):
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable
- public abstract class UserDefinedType<UserType> extends DataType implements Serializable

AmplabJenkins · 2014-11-03T04:53:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22798/
Test FAILed.

SparkQA · 2014-11-03T04:55:28Z

Test build #22799 has finished for PR 3068 at commit 75223db.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-03T04:55:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22799/
Test FAILed.

davies · 2014-11-03T05:11:01Z

python/pyspark/sql.py

@@ -39,6 +39,7 @@
 from array import array
 from operator import itemgetter
 from itertools import imap
+import importlib


importlib is not available in python2.6, use __import__ instead

SparkQA · 2014-11-03T05:49:48Z

Test build #22802 has started for PR 3068 at commit f19eb2b.

This patch merges cleanly.

davies · 2014-11-03T05:50:41Z

python/pyspark/sql.py

@@ -775,11 +954,22 @@ def _verify_type(obj, dataType):
    Traceback (most recent call last):
        ...
    ValueError:...
+    >>> from pyspark.tests import ExamplePoint, ExamplePointUDT
+    >>> _verify_type(ExamplePoint(1.0, 2.0), ExamplePointUDT())


it's better to remove these tests for ExamplePoint, it should be in tests.py (or already covered)

It is in the same group of other doctests for this private function. I didn't find one for _verify_type in SQLTests.

Your tests in tests.py have covered these internal functions, so I think it's fine to not have them here.

SparkQA · 2014-11-03T05:52:54Z

Test build #22802 has finished for PR 3068 at commit f19eb2b.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-03T05:52:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22802/
Test FAILed.

mengxr · 2014-11-03T06:00:44Z

test this please

davies · 2014-11-03T06:04:48Z

python/pyspark/sql.py

+        return False
+
+
+def _python_to_sql_converter(dataType):


Can _create_converter do this?

_create_converter doesn't do this. It is used to drop the names if user provides Row objects, called by _drop_schema. I think we need to refactor the code a little bit during QA.

SparkQA · 2014-11-03T07:59:07Z

Test build #22803 has finished for PR 3068 at commit 7c4a6a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-03T07:59:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22803/
Test PASSed.

mengxr · 2014-11-03T20:01:27Z

@davies I moved import ExamplePoint code to global test setup. Though the code paths are tested in SQLTests, it is still nice to have unit tests for each function in sql.py.

SparkQA · 2014-11-03T20:04:53Z

Test build #22822 has started for PR 3068 at commit 2c9d7e4.

This patch merges cleanly.

davies · 2014-11-03T20:59:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala

+      case null => (null, null)
+      case udt =>
+        val split = udt.lastIndexOf(".")
+        (udt.substring(0, split), udt.substring(split + 1))


Nit: we could use pyClass without pyModule (similar to class), do the split in Python.

davies · 2014-11-03T21:18:04Z

LGTM, just some minor comments.

BTW, serialize/deserialize usually means dump the object into bytes, could we change UDF.serialize/deserialize to a better name? They convert the user defined object into object of Spark SQL type actually.

SparkQA · 2014-11-03T21:52:16Z

Test build #22822 has finished for PR 3068 at commit 2c9d7e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-03T21:52:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22822/
Test PASSed.

davies · 2014-11-03T22:43:11Z

Note: In order to let pythonUDF can work over UDT, we should let UDT be picklable (register into Pyrolite, similar to Vector).

output sqlType as well

SparkQA · 2014-11-03T23:15:04Z

Test build #22832 has started for PR 3068 at commit dba5ea7.

This patch does not merge cleanly.

SparkQA · 2014-11-04T00:39:53Z

Test build #22833 has started for PR 3068 at commit acff637.

This patch merges cleanly.

SparkQA · 2014-11-04T01:13:35Z

Test build #22832 has finished for PR 3068 at commit dba5ea7.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-04T01:13:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22832/
Test PASSed.

davies · 2014-11-04T01:38:45Z

LGTM, let's ship it.

SparkQA · 2014-11-04T02:27:38Z

Test build #22833 has finished for PR 3068 at commit acff637.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class UserDefinedType(DataType):

AmplabJenkins · 2014-11-04T02:27:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22833/
Test PASSed.

marmbrus · 2014-11-04T02:29:22Z

LGTM
On Nov 3, 2014 6:27 PM, "UCB AMPLab" [email protected] wrote:

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22833/
Test PASSed.

—
Reply to this email directly or view it on GitHub
#3068 (comment).

mengxr · 2014-11-04T03:31:25Z

@marmbrus @davies Thanks! I've merged this into master and branch-1.2.

Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python. marmbrus jkbradley davies Author: Xiangrui Meng <[email protected]> Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits: acff637 [Xiangrui Meng] merge master dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well 2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion 7c4a6a9 [Xiangrui Meng] address comments 75223db [Xiangrui Meng] minor update f740379 [Xiangrui Meng] remove UDT from default imports e98d9d0 [Xiangrui Meng] fix py style 4e84fce [Xiangrui Meng] remove local hive tests and add more tests 39f19e0 [Xiangrui Meng] add tests b7f666d [Xiangrui Meng] add Python UDT (cherry picked from commit 04450d1) Signed-off-by: Xiangrui Meng <[email protected]>

Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley. ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~ marmbrus jkbradley Author: Xiangrui Meng <[email protected]> Closes #3070 from mengxr/SPARK-3573 and squashes the following commits: 3a0b6e5 [Xiangrui Meng] organize imports 236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples (cherry picked from commit 1a9c6cd) Signed-off-by: Xiangrui Meng <[email protected]>

Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley. ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~ marmbrus jkbradley Author: Xiangrui Meng <[email protected]> Closes #3070 from mengxr/SPARK-3573 and squashes the following commits: 3a0b6e5 [Xiangrui Meng] organize imports 236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples

mengxr added 3 commits November 2, 2014 19:12

add Python UDT

b7f666d

add tests

39f19e0

remove local hive tests and add more tests

4e84fce

mengxr added 2 commits November 2, 2014 20:41

fix py style

e98d9d0

remove UDT from default imports

f740379

minor update

75223db

davies reviewed Nov 3, 2014
View reviewed changes

address comments

7c4a6a9

davies reviewed Nov 3, 2014
View reviewed changes

mengxr force-pushed the SPARK-4192-sql branch from f19eb2b to 7c4a6a9 Compare November 3, 2014 05:57

davies reviewed Nov 3, 2014
View reviewed changes

mengxr changed the title ~~[SPARK-4192][SQL] Python UDT~~ [SPARK-4192][SQL] Internal API for Python UDT Nov 3, 2014

move import to global setup; update needsConversion

2c9d7e4

davies reviewed Nov 3, 2014
View reviewed changes

only use pyClass for Python UDT

dba5ea7

output sqlType as well

merge master

acff637

asfgit closed this in 04450d1 Nov 4, 2014

[SPARK-4192][SQL] Internal API for Python UDT #3068

[SPARK-4192][SQL] Internal API for Python UDT #3068

Conversation

mengxr commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

Choose a reason for hiding this comment

SparkQA commented Nov 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

mengxr commented Nov 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

mengxr commented Nov 3, 2014

SparkQA commented Nov 3, 2014

Choose a reason for hiding this comment

davies commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

davies commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

davies commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

marmbrus commented Nov 4, 2014

mengxr commented Nov 4, 2014