-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4192][SQL] Internal API for Python UDT #3068
Conversation
Test build #22796 has started for PR 3068 at commit
|
Test build #22796 has finished for PR 3068 at commit
|
Test FAILed. |
Test build #22798 has started for PR 3068 at commit
|
Test build #22799 has started for PR 3068 at commit
|
Test build #22798 has finished for PR 3068 at commit
|
Test FAILed. |
Test build #22799 has finished for PR 3068 at commit
|
Test FAILed. |
@@ -39,6 +39,7 @@ | |||
from array import array | |||
from operator import itemgetter | |||
from itertools import imap | |||
import importlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
importlib is not available in python2.6, use __import__
instead
Test build #22802 has started for PR 3068 at commit
|
@@ -775,11 +954,22 @@ def _verify_type(obj, dataType): | |||
Traceback (most recent call last): | |||
... | |||
ValueError:... | |||
>>> from pyspark.tests import ExamplePoint, ExamplePointUDT | |||
>>> _verify_type(ExamplePoint(1.0, 2.0), ExamplePointUDT()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to remove these tests for ExamplePoint, it should be in tests.py (or already covered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in the same group of other doctests for this private function. I didn't find one for _verify_type
in SQLTests
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your tests in tests.py have covered these internal functions, so I think it's fine to not have them here.
Test build #22802 has finished for PR 3068 at commit
|
Test FAILed. |
f19eb2b
to
7c4a6a9
Compare
test this please |
return False | ||
|
||
|
||
def _python_to_sql_converter(dataType): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can _create_converter
do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_create_converter
doesn't do this. It is used to drop the names if user provides Row objects, called by _drop_schema
. I think we need to refactor the code a little bit during QA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Test build #22803 has finished for PR 3068 at commit
|
Test PASSed. |
@davies I moved |
Test build #22822 has started for PR 3068 at commit
|
case null => (null, null) | ||
case udt => | ||
val split = udt.lastIndexOf(".") | ||
(udt.substring(0, split), udt.substring(split + 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we could use pyClass without pyModule (similar to class
), do the split in Python.
LGTM, just some minor comments. BTW, serialize/deserialize usually means dump the object into bytes, could we change UDF.serialize/deserialize to a better name? They convert the user defined object into object of Spark SQL type actually. |
Test build #22822 has finished for PR 3068 at commit
|
Test PASSed. |
Note: In order to let pythonUDF can work over UDT, we should let UDT be picklable (register into Pyrolite, similar to Vector). |
output sqlType as well
Test build #22832 has started for PR 3068 at commit
|
Test build #22833 has started for PR 3068 at commit
|
Test build #22832 has finished for PR 3068 at commit
|
Test PASSed. |
LGTM, let's ship it. |
Test build #22833 has finished for PR 3068 at commit
|
Test PASSed. |
LGTM
|
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python. marmbrus jkbradley davies Author: Xiangrui Meng <[email protected]> Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits: acff637 [Xiangrui Meng] merge master dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well 2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion 7c4a6a9 [Xiangrui Meng] address comments 75223db [Xiangrui Meng] minor update f740379 [Xiangrui Meng] remove UDT from default imports e98d9d0 [Xiangrui Meng] fix py style 4e84fce [Xiangrui Meng] remove local hive tests and add more tests 39f19e0 [Xiangrui Meng] add tests b7f666d [Xiangrui Meng] add Python UDT (cherry picked from commit 04450d1) Signed-off-by: Xiangrui Meng <[email protected]>
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley. ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~ marmbrus jkbradley Author: Xiangrui Meng <[email protected]> Closes #3070 from mengxr/SPARK-3573 and squashes the following commits: 3a0b6e5 [Xiangrui Meng] organize imports 236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples (cherry picked from commit 1a9c6cd) Signed-off-by: Xiangrui Meng <[email protected]>
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley. ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~ marmbrus jkbradley Author: Xiangrui Meng <[email protected]> Closes #3070 from mengxr/SPARK-3573 and squashes the following commits: 3a0b6e5 [Xiangrui Meng] organize imports 236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before
SQLContext.applySchema
, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.@marmbrus @jkbradley @davies