[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays #26910

WeichenXu123 · 2019-12-16T14:33:18Z

What changes were proposed in this pull request?

PySpark UDF to convert MLlib vectors to dense arrays.
Example:

from pyspark.ml.functions import vector_to_array
df.select(vector_to_array(col("features"))

Why are the changes needed?

If a PySpark user wants to convert MLlib sparse/dense vectors in a DataFrame into dense arrays, an efficient approach is to do that in JVM. However, it requires PySpark user to write Scala code and register it as a UDF. Often this is infeasible for a pure python project.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

SparkQA · 2019-12-16T14:38:24Z

Test build #115396 has finished for PR 26910 at commit 794a10b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-16T16:37:32Z

Test build #115401 has finished for PR 26910 at commit afc71af.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/main/scala/org/apache/spark/ml/functions.scala

SparkQA · 2019-12-17T04:49:30Z

Test build #115424 has finished for PR 26910 at commit e2bb6c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr

@WeichenXu123 Made another pass. I'm a little concerned about the data conversion from JVM to Python, particularly, what happens if I use a Pandas UDF to wrap the vector_to_array function. Say:

df = spark.read...

@pandas_udf("int")
def predict(batch):
  # expect batch to be a pd.Series of numpy array here
  ...
  return preds

predictions = df.select(predict(vector_to_array(col("features")))

predictions.write....

Does the array data get boxed somewhere in the conversion path? I assume not but we should double confirm. Could you either profile the JVM or verify the code path?

cc: @HyukjinKwon

mllib/src/main/scala/org/apache/spark/ml/functions.scala

mengxr · 2019-12-17T17:02:20Z

mllib/src/main/scala/org/apache/spark/ml/functions.scala

+      case v: OldVector => v.toArray
+      case _ => throw new IllegalArgumentException(
+        "function vector_to_array require an argument of type " +
+        "`org.apache.spark.ml.linalg.Vector` or `org.apache.spark.mllib.linalg.Vector`.")


Mention input type (or null) in the error message.

I mean including null or vec.getClass.getName in the error msg to help debugging.

Please also add a test for the error message.

mllib/src/main/scala/org/apache/spark/ml/functions.scala

mllib/src/test/scala/org/apache/spark/ml/FunctionsSuite.scala

python/pyspark/ml/functions.py

mllib/src/main/scala/org/apache/spark/ml/functions.scala

python/pyspark/ml/functions.py

mllib/src/main/scala/org/apache/spark/ml/functions.scala

HyukjinKwon · 2019-12-18T02:47:35Z

@cloud-fan, I know UDT became private and we plan to redesign it later.
However, what about we allow the case when the UDT is cast into its own sqlType, or do you know why we don't allow this case?

scala> val df = Seq(Tuple1(org.apache.spark.ml.linalg.Vectors.dense(1.0, 2.0, 3.0))).toDF("vec")
df: org.apache.spark.sql.DataFrame = [vec: vector]

scala> df.selectExpr("cast(vec as string)").show()
+-------------+
|          vec|
+-------------+
|[1.0,2.0,3.0]|
+-------------+

scala> df.selectExpr("cast(vec as struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`vec`' due to data type mismatch: cannot cast struct<type:tinyint,size:int,indices:array<int>,values:array<double>> to struct<type:tinyint,size:int,indices:array<int>,values:array<double>>; line 1 pos 0;
'Project [unresolvedalias(cast(vec#74 as struct<type:tinyint,size:int,indices:array<int>,values:array<double>>), None)]
+- Project [_1#71 AS vec#74]
   +- LocalRelation [_1#71]

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:146)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:137)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:310)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)

Currently, UDT can be cast to string but cannot be its own sqlType (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala#L88-L99).

It's internally InternalRow so I think it seems fine to allow this case for now.

WeichenXu123 · 2019-12-18T08:27:15Z

Quote from @mengxr

what happens if I use a Pandas UDF to wrap the vector_to_array function, Does the array data get boxed somewhere in the conversion path ?

I think if this is true, this should be array type issue, not related to my vector_to_array function. Isn't it ? @HyukjinKwon

HyukjinKwon · 2019-12-18T08:52:16Z

I don't think such things happen anyway. The problem is UDF itself as it needs to convert Catalyst type to Scala type and needs to convert it back, which is pretty slow.

SparkQA · 2019-12-18T10:52:26Z

Test build #115495 has finished for PR 26910 at commit 5aacfbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2019-12-19T02:12:48Z

@HyukjinKwon @mengxr @srowen Any more comments ? Thanks:)

HyukjinKwon · 2019-12-19T02:24:51Z

Once we allow #26910 (comment) then I think we won't need this function. WDYT @cloud-fan.

mengxr · 2019-12-19T04:13:41Z

@HyukjinKwon I don't think UDT change should block this PR. Even we can cast it to the sqlType, it is still tedious for a PySpark user to do this simple conversion. And Pandas UDF doesn't support nested columns. So they need to move the nested columns to top level.

mengxr · 2019-12-19T04:14:57Z

@WeichenXu123 You haven't addressed the comment on running doctest yet.

mllib/src/main/scala/org/apache/spark/ml/functions.scala

HyukjinKwon · 2019-12-19T05:24:17Z

@mengxr, sure, Once we allow cast and extraction from UDT directly (e.g., vector.values), we can deprecate and remove out this API later. I don't mind adding this first because doing cast & extraction against UDT would be probably a big job.

SparkQA · 2019-12-19T09:24:42Z

Test build #115554 has finished for PR 26910 at commit a41d01a.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-19T13:29:59Z

Test build #115565 has finished for PR 26910 at commit 22865e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-12-20T18:18:35Z

mllib/src/main/scala/org/apache/spark/ml/functions.scala

+      case v: OldVector => v.toArray
+      case _ => throw new IllegalArgumentException(
+        "function vector_to_array require an argument of type " +
+        "`org.apache.spark.ml.linalg.Vector` or `org.apache.spark.mllib.linalg.Vector`.")


I mean including null or vec.getClass.getName in the error msg to help debugging.

mengxr · 2019-12-20T18:19:12Z

mllib/src/main/scala/org/apache/spark/ml/functions.scala

+      case v: OldVector => v.toArray
+      case _ => throw new IllegalArgumentException(
+        "function vector_to_array require an argument of type " +
+        "`org.apache.spark.ml.linalg.Vector` or `org.apache.spark.mllib.linalg.Vector`.")


Please also add a test for the error message.

SparkQA · 2019-12-21T06:27:58Z

Test build #115639 has finished for PR 26910 at commit d257dce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2020-01-07T00:19:53Z

LGTM. Merged into master. Thanks!

init pr

794a10b

fix scala style

afc71af

srowen reviewed Dec 17, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/functions.scala Outdated Show resolved Hide resolved

mengxr requested changes Dec 17, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/functions.scala Outdated Show resolved Hide resolved

mllib/src/main/scala/org/apache/spark/ml/functions.scala Outdated Show resolved Hide resolved

address comments

e2bb6c0

mengxr requested changes Dec 17, 2019

View reviewed changes

HyukjinKwon reviewed Dec 18, 2019

View reviewed changes

python/pyspark/ml/functions.py Show resolved Hide resolved

HyukjinKwon reviewed Dec 18, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/functions.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 18, 2019

View reviewed changes

python/pyspark/ml/functions.py Show resolved Hide resolved

HyukjinKwon reviewed Dec 18, 2019

View reviewed changes

python/pyspark/ml/functions.py Show resolved Hide resolved

HyukjinKwon reviewed Dec 18, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/functions.scala Outdated Show resolved Hide resolved

address comments

5aacfbc

cloud-fan reviewed Dec 19, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/functions.scala Show resolved Hide resolved

WeichenXu123 added 2 commits December 19, 2019 16:48

add doctest

66e3f5e

fix

a41d01a

fix doctest

22865e0

mengxr requested changes Dec 20, 2019

View reviewed changes

WeichenXu123 added 2 commits December 21, 2019 11:47

address comments

05a525d

update

d257dce

mengxr approved these changes Jan 7, 2020

View reviewed changes

asfgit closed this in 88542bc Jan 7, 2020

mengxr mentioned this pull request Jan 7, 2020

Support reading SparseVectors and Vectors uber/petastorm#425

Open

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays #26910

[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays #26910

WeichenXu123 commented Dec 16, 2019 •

edited

Loading

SparkQA commented Dec 16, 2019

SparkQA commented Dec 16, 2019

SparkQA commented Dec 17, 2019

mengxr left a comment

mengxr Dec 17, 2019

mengxr Dec 20, 2019

mengxr Dec 20, 2019

HyukjinKwon commented Dec 18, 2019

WeichenXu123 commented Dec 18, 2019

HyukjinKwon commented Dec 18, 2019 •

edited

Loading

SparkQA commented Dec 18, 2019

WeichenXu123 commented Dec 19, 2019

HyukjinKwon commented Dec 19, 2019 •

edited

Loading

mengxr commented Dec 19, 2019

mengxr commented Dec 19, 2019 •

edited

Loading

HyukjinKwon commented Dec 19, 2019 •

edited

Loading

SparkQA commented Dec 19, 2019

SparkQA commented Dec 19, 2019

mengxr Dec 20, 2019

mengxr Dec 20, 2019

SparkQA commented Dec 21, 2019

mengxr commented Jan 7, 2020

[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays #26910

[SPARK-30154][ML] PySpark UDF to convert MLlib vectors to dense arrays #26910

Conversation

WeichenXu123 commented Dec 16, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 16, 2019

SparkQA commented Dec 16, 2019

SparkQA commented Dec 17, 2019

mengxr left a comment

Choose a reason for hiding this comment

mengxr Dec 17, 2019

Choose a reason for hiding this comment

mengxr Dec 20, 2019

Choose a reason for hiding this comment

mengxr Dec 20, 2019

Choose a reason for hiding this comment

HyukjinKwon commented Dec 18, 2019

WeichenXu123 commented Dec 18, 2019

HyukjinKwon commented Dec 18, 2019 • edited Loading

SparkQA commented Dec 18, 2019

WeichenXu123 commented Dec 19, 2019

HyukjinKwon commented Dec 19, 2019 • edited Loading

mengxr commented Dec 19, 2019

mengxr commented Dec 19, 2019 • edited Loading

HyukjinKwon commented Dec 19, 2019 • edited Loading

SparkQA commented Dec 19, 2019

SparkQA commented Dec 19, 2019

mengxr Dec 20, 2019

Choose a reason for hiding this comment

mengxr Dec 20, 2019

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2019

mengxr commented Jan 7, 2020

WeichenXu123 commented Dec 16, 2019 •

edited

Loading

HyukjinKwon commented Dec 18, 2019 •

edited

Loading

HyukjinKwon commented Dec 19, 2019 •

edited

Loading

mengxr commented Dec 19, 2019 •

edited

Loading

HyukjinKwon commented Dec 19, 2019 •

edited

Loading