Support reading SparseVectors and Vectors #425

abditag2 · 2019-09-05T06:24:53Z

Would be very useful if petastorm can read these two data types from HDFS.

selitvin · 2019-09-10T06:21:32Z

Agreed. We can try and add. Not sure about the time-frame for this though.

mengxr · 2019-12-06T17:36:25Z

Efficient conversion requires Scala UDFs. Maybe we should add utility methods to Spark so in petastorm we can do the following:

from pyspark.ml.functions import vector_to_dense_array
df.select(vector_to_dense_array(col("features")).alias("features"))

This approach doesn't require Scala code in petastorm. Created a Spark JIRA: https://issues.apache.org/jira/browse/SPARK-30154.

mengxr · 2020-01-07T01:19:12Z

FYI. The UDF was merged into Spark master: apache/spark#26910

Provide feedback