-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11237][ML] Add pmml export for k-means in Spark ML #20907
[SPARK-11237][ML] Add pmml export for k-means in Spark ML #20907
Conversation
Test build #88599 has finished for PR 20907 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, one question
// Save model data: cluster centers | ||
val data: Array[ClusterData] = instance.clusterCenters.zipWithIndex.map { | ||
case (center, idx) => | ||
ClusterData(idx, center) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this type change Data
-> ClusterData
change the schema of the output parquet file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure. I'll manually test we can load the old format first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait no this shouldn't change anything, were saving this with a DataFrame and the schema is the same.
See the schema from 1: res3: org.apache.spark.sql.types.StructType = StructType(StructField(clusterIdx,IntegerType,false), StructField(clusterCenter,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
and the new one org.apache.spark.sql.types.StructType = StructType(StructField(clusterIdx,IntegerType,false), StructField(clusterCenter,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// model. | ||
val pmmlClusteringModel = pmml.getModels.get(0).asInstanceOf[ClusteringModel] | ||
assert(pmmlClusteringModel.getNumberOfClusters === clusterCenters.length) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this missing a call to testPMMLWrite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah :( Thanks for catching that.
Test build #88766 has finished for PR 20907 at commit
|
Test build #88823 has finished for PR 20907 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, is this still a wip? Just noticed you changed the title, so is this ready?
Yup, this is ready. If you think its good feel free to merge it otherwise I'll merge it during my regular Friday review time :) |
LGTM |
One question I have is, how do users know if a model (e.g. KMeansModel after this change) supports pmml & internal formats without looking into source code? I did a search on the current docs, but didn't find any info. |
LGTM too! +1 on the documentation which can be a followup PR. Merged into master, and thanks. DB Tsai | Siri Open Source Technologies | Apple, Inc |
What changes were proposed in this pull request?
Adding PMML export to Spark ML's KMeans Model.
How was this patch tested?
New unit test for Spark ML PMML export based on the old Spark MLlib unit test.