-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD #3070
Conversation
Test build #22805 has started for PR 3070 at commit
|
Test build #22805 has finished for PR 3070 at commit
|
Test FAILed. |
Test build #22806 has started for PR 3070 at commit
|
@@ -46,6 +46,11 @@ | |||
<version>${project.version}</version> | |||
</dependency> | |||
<dependency> | |||
<groupId>org.apache.spark</groupId> | |||
<artifactId>spark-sql_${scala.binary.version}</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still feels weird to me, MLlib depending on SQL. It seems like they are both wanting to depend on a SchemaRDD
that is specific to neither. I'm afraid of making the jar hell in Spark worse by attaching more subprojects together. That said, the SQL module itself doesn't, for instance, bring in Hive. Is this going to add much to the MLlib deps? or can the commonality not be factored out into Core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Yes, it feels weird if we say ML depends on SQL, the "query language". Spark SQL provides RDD with schema support and execution plan optimization, both of which are need by MLlib. We need flexible table-like datasets and I/O support, and operations that "carry over" additional columns during the training phrase. It is natural to say that ML depends on RDD with schema support and execution plan optimization.
I agree that we should factor the common part out or make SchemaRDD a first-class citizen in Core, but that definitely takes time for both design and development. This dependence change has no effect on the content we deliver to users, and UDTs are internal to Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be pretty difficult to have a SchemaRDD that didn't at least depend on catalyst and then there still would be no way to execute the projections and structured data input/output that MLlib wants to. I think really the problem might be in naming. Catalyst / Spark SQL core are really more about manipulating structured data using Spark and we actually considered not even having SQL in the name (unfortunately Spark Schema doesn't have the same ring to it).
The SQL project has already been carefully factored into pieces to minimize the number of dependencies, and so I believe that the only additional dependency that we are bringing in here is Parquet (which is kind of the point of this example).
Test build #22806 has finished for PR 3070 at commit
|
Test PASSed. |
val meanLabel = labels.fold(0.0)(_ + _) / numLabels | ||
println(s"Selected label column with average value $meanLabel") | ||
|
||
val featuresSchemaRDD: SchemaRDD = origData.select('features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the right way to select a column within "features"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either of the following is okay: select("features".attr)
or select('feature)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also work for any arbitrary column name ? i.e if I am taking in the features column name as a command line argument, how would it look ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select(colName.attr)
works if colName
is a String. The column name needs to be legal for SQL/Catalyst.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using the DSL like we are in this example, any String
column name is legal. The SQL/HiveQL parsers are a little more restrictive about what they consider legal, but with backticks you can can access just about anything.
LGTM though I'll depend on @davies for feedback on the Python API on the other PR [https://github.com//pull/3068] |
Test build #22858 has started for PR 3070 at commit
|
Test FAILed. |
Test build #22859 has started for PR 3070 at commit
|
Test build #22858 has finished for PR 3070 at commit
|
Test PASSed. |
Test build #22863 has started for PR 3070 at commit
|
Test build #22859 has finished for PR 3070 at commit
|
Test PASSed. |
|
||
|
||
def summarize(dataset): | ||
print "schema: %s" % dataset.schema().json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset.print_schema()
will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset.printSchema()
doesn't output json, which contains more information:
{
"type" : "struct",
"fields" : [ {
"name" : "label",
"type" : "double",
"nullable" : false,
"metadata" : { }
}, {
"name" : "features",
"type" : {
"type" : "udt",
"class" : "org.apache.spark.mllib.linalg.VectorUDT",
"pyClass" : "pyspark.mllib.linalg.VectorUDT",
"sqlType" : {
"type" : "struct",
"fields" : [ {
"name" : "type",
"type" : "byte",
"nullable" : false,
"metadata" : { }
}, {
"name" : "size",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "indices",
"type" : {
"type" : "array",
"elementType" : "integer",
"containsNull" : false
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "values",
"type" : {
"type" : "array",
"elementType" : "double",
"containsNull" : false
},
"nullable" : true,
"metadata" : { }
} ]
}
},
"nullable" : true,
"metadata" : { }
} ]
}
Just checked the updated storage format for dense/sparse & the new test. LGTM |
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley. ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~ marmbrus jkbradley Author: Xiangrui Meng <[email protected]> Closes #3070 from mengxr/SPARK-3573 and squashes the following commits: 3a0b6e5 [Xiangrui Meng] organize imports 236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples (cherry picked from commit 1a9c6cd) Signed-off-by: Xiangrui Meng <[email protected]>
Thanks all for reviewing the code! I've merged this into master and branch-1.2. |
Test build #22863 has finished for PR 3070 at commit
|
Test PASSed. |
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from @jkbradley.
This PR contains the changes from #3068 . I will rebase after #3068 is merged.@marmbrus @jkbradley