-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and… #9595
Conversation
Test build #45516 has finished for PR 9595 at commit
|
cc @Lewuathe |
StructField("label", DoubleType, nullable = false) :: | ||
StructField("features", new VectorUDT(), nullable = false) :: Nil | ||
) | ||
extends HadoopFsRelation with Logging with Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary to mixin Logging
trait here? HadoopFsRelation
already does it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, will correct that
Test build #45593 has finished for PR 9595 at commit
|
@Lewuathe Would you mind help review this ? Thanks |
import org.apache.spark.mllib.linalg.{Vector, VectorUDT} | ||
import org.apache.spark.mllib.util.MLUtils | ||
import org.apache.spark.rdd.RDD | ||
import org.apache.spark.sql.{DataFrameReader, DataFrame, Row, SQLContext} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to change to under score? We can keep this.
import org.apache.spark.sql.{DataFrameReader, DataFrame, Row, SQLContext}
Test build #46952 has finished for PR 9595 at commit
|
@zjffdu LGTM. Could you create another JIRA for supporting multiple input path on |
Sure, create SPARK-12086 for that |
: BaseRelation = { | ||
val path = parameters.getOrElse("path", | ||
throw new IllegalArgumentException("'path' must be specified")) | ||
override def createRelation(sqlContext: SQLContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chop down arguments and use 4-space indentation
@zjffdu Sorry for the delay! I made a pass and left some comments inline. You also need to rebase master to resolve conflicts. Please let me know whether you have time to update this PR. Btw, another follow-up work would be exposing options to format the output values. Now we use default format, which outputs 16 digits per double value. It might be too long for common use cases. Could you create a JIRA for this? Thanks! |
Thanks @mengxr for review, will update the patch and create a followup jira. |
df.write.save(writepath) | ||
|
||
val df2 = sqlContext.read.format("libsvm").load(writepath) | ||
val row1 = df.first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bug of test code, I am verifying df rather than df2 so that the test passed
Test build #49797 has finished for PR 9595 at commit
|
It's weird that I pass the scala style check in my local box. |
import com.google.common.base.Objects | ||
import org.apache.hadoop.fs.{Path, FileStatus} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know we are checking the ordering of imports now. F
should come before P
. You can use Scala Import Organizer with IntelliJ to organize imports quickly.
@mengxr any way to retrigger the build ? BTW what do you mean "exposing options to format the output values", is there a new feature of Spark for encoding double using compact way ? I may miss something here. |
test this please |
LIBSVM is a text format and hence we need to consider the cost of storing numerical values. In the current implementation, the output could be some text like |
test this please |
Test build #49813 has finished for PR 9595 at commit
|
Test build #49814 has finished for PR 9595 at commit
|
Thanks for clarifying. In that case, we may lose precision when reading. Maybe make it for libsvm specific is better, anyway we can discuss it in the jira. |
import org.apache.hadoop.fs.{FileStatus, Path} | ||
import org.apache.hadoop.io.{NullWritable, Text} | ||
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat | ||
import org.apache.hadoop.mapreduce.{RecordWriter, TaskAttemptContext} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
should come before l
. Please try Scala Import Organizer:) This is what I got:
import java.io.IOException
import com.google.common.base.Objects
import org.apache.hadoop.fs.{FileStatus, Path}
import org.apache.hadoop.io.{NullWritable, Text}
import org.apache.hadoop.mapreduce.{RecordWriter, TaskAttemptContext}
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
import org.apache.spark.annotation.Since
import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, DataFrameReader, Row, SQLContext}
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._
Test build #49822 has finished for PR 9595 at commit
|
test this please |
Test build #49831 has finished for PR 9595 at commit
|
Test build #49851 has finished for PR 9595 at commit
|
Test build #49853 has finished for PR 9595 at commit
|
Test build #49857 has finished for PR 9595 at commit
|
test this please |
Test build #50007 has finished for PR 9595 at commit
|
Thanks @mengxr. Not sure why the test fails. Will take a look at it when I have time. |
The failed test is irrelevant to this PR, which is tracked here: https://issues.apache.org/jira/browse/SPARK-10086. I will ask Jenkins to make another try. |
test this please |
Test build #50032 has finished for PR 9595 at commit
|
Let's wait for #10909 first. |
test this please |
Test build #50078 has finished for PR 9595 at commit
|
LGTM. Merged into master. Thanks! |
… Add LibSVMOutputWriter
The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter