[Spark-7983] [MLlib] Add require for one-based indices in loadLibSVMFile #6538

hhbyyh · 2015-05-31T11:23:40Z

jira: https://issues.apache.org/jira/browse/SPARK-7983

Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT).

add a quick check.

srowen · 2015-05-31T12:37:33Z

mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala

@@ -82,6 +82,8 @@ object MLUtils {
          val value = indexAndValue(1).toDouble
          (index, value)
        }.unzip
+        require(indices.size == 0 || indices(0) >= 0,


Don't you mean >= 1?

Thanks for review . I'm thinking it has already been converted to zero based on line 81.

Ah right, missed that in the fold. Would it make more sense to check the value where it's read rather than after the -1?

I'm afraid adding it to line 81 will raise a performance concern.

Just another conditional? seems vanishingly small compared to other work done here. The current check assumes that it's in correct libsvm format and that the indices are ordered ascending. The code doesn't actually rely on this though and works even with missorted indices. However the current check wouldn't work if "0:..." occurred later in the line.

That's true.
And if we're adding the one-based check inside the iteration for each indice-value pair, do you think we should check the ascending orders also?
I thought the original code avoid it intentionally for performance.

I think the sparse vector constructor doesn't verify its input since it's created so frequently. But this seems like a reasonable place to check input, when loading from libsvm. I was mistaken; this will silently fail if the libsvm format is wrong and the indices aren't sorted. So I suppose your original check is fine but if the ordering of the indices is also checked.

SparkQA · 2015-05-31T13:15:11Z

Test build #33851 has finished for PR 6538 at commit 5bd1f9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-05-31T13:17:50Z

In fact I'd make a test?

hhbyyh · 2015-05-31T13:39:44Z

@srowen. Great idea. I will add an unit test.

SparkQA · 2015-05-31T17:12:55Z

Test build #33855 has finished for PR 6538 at commit 9956365.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-06-01T13:53:11Z

Test build #33887 has finished for PR 6538 at commit 6e4f8ca.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

srowen · 2015-06-01T21:27:59Z

mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala

@@ -82,6 +82,19 @@ object MLUtils {
          val value = indexAndValue(1).toDouble
          (index, value)
        }.unzip
+
+        // check if indices is one-based and in ascending order


indices is -> indices are
You can also use require I suppose.
Haven't we lost the check for a index of 0 though now?

You could write this check in one line as something like
require(indices.sliding(2).forall(p => p(0) < p(1)))

indices(0) is compared with -1 in current implementation.
Your way looks much more compact. The only concern is if it will create many small Iterators. I'll run some benchmark and get back. Thanks for the great suggestion.

There seems to be some performance difference.

val indices = 1 to 200000000 var start = System.nanoTime() var previous = -1 var i = 0 val indicesLength = indices.size while (i < indicesLength) { if (indices(i) <= previous) { throw new IllegalArgumentException("indices should be one-based and in ascending order") } previous = indices(i) i += 1 } println("while: " + (System.nanoTime() - start).toDouble / 1e9) start = System.nanoTime() val g = indices.sliding(2).forall(p => p(0) < p(1)) println("sliding: " + (System.nanoTime() - start).toDouble / 1e9)

while: 0.088418226
sliding: 37.128311352

I also used jconsole to collect the memory usage. The "sliding" way consumes significantly higher memory and triggers GC frequently/

Yeah, that's fair enough. I don't think this is a performance-critical block anyway, but the cleverness is probably a step too far in complexity.

This needs a little rebase, and i think you can still use require for the check.
This allows indices(0) == 0. Do you mean to initialize previous = 0?

I will use require for the check.

indices is -> indices are.

And since there's a -1 in line 81, it's acceptable to have indices(0) == 0 here. (changed to zero-based).
The first ut has covered the check for zero-based LIBSVM file.
Let me know if I should change more. Thanks a lot for the careful check.

Oh you're right. I keep mixing up how the check is implemented.

SparkQA · 2015-06-02T12:34:34Z

Test build #33979 timed out for PR 6538 at commit 20a2811 after a configured wait of 150m.

mengxr · 2015-06-02T16:48:20Z

test this please

SparkQA · 2015-06-02T19:22:42Z

Test build #34000 has finished for PR 6538 at commit 20a2811.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

srowen · 2015-06-02T20:58:38Z

@hhbyyh this LGTM but can you rebase to resolve the merge conflict?

SparkQA · 2015-06-03T01:41:35Z

Test build #34045 has finished for PR 6538 at commit 96460f1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-03T03:38:06Z

Test build #34046 has finished for PR 6538 at commit 4310710.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-06-03T06:22:26Z

mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala

+        // check if indices are one-based and in ascending order
+        var previous = -1
+        var i = 0
+        val indicesLength = indices.size


nit: indices.length (size introduces one extra method call, and scala compiler doesn't optimize it.)

SparkQA · 2015-06-03T09:17:08Z

Test build #34075 has finished for PR 6538 at commit 79d9c11.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jira: https://issues.apache.org/jira/browse/SPARK-7983 Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT). add a quick check. Author: Yuhao Yang <[email protected]> Closes apache#6538 from hhbyyh/loadSVM and squashes the following commits: 79d9c11 [Yuhao Yang] optimization as respond to comments 4310710 [Yuhao Yang] merge conflict 96460f1 [Yuhao Yang] merge conflict 20a2811 [Yuhao Yang] use require 6e4f8ca [Yuhao Yang] add check for ascending order 9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception 5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM

add require for one-based in loadLIBSVM

5bd1f9a

hhbyyh changed the title ~~[Spark-7983] [MLlib] add require for one-based in loadLIBSVM~~ [Spark-7983] [MLlib] Add require for one-based indices in loadLibSVMFile May 31, 2015

srowen reviewed May 31, 2015
View reviewed changes

add ut for 0-based loadlibsvm exception

9956365

add check for ascending order

6e4f8ca

srowen reviewed Jun 1, 2015
View reviewed changes

use require

20a2811

hhbyyh added 2 commits June 3, 2015 09:32

merge conflict

96460f1

merge conflict

4310710

mengxr reviewed Jun 3, 2015
View reviewed changes

optimization as respond to comments

79d9c11

asfgit closed this in 28dbde3 Jun 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-7983] [MLlib] Add require for one-based indices in loadLibSVMFile #6538

[Spark-7983] [MLlib] Add require for one-based indices in loadLibSVMFile #6538

hhbyyh commented May 31, 2015

srowen May 31, 2015

hhbyyh May 31, 2015

srowen May 31, 2015

hhbyyh May 31, 2015

srowen May 31, 2015

hhbyyh May 31, 2015

srowen May 31, 2015

SparkQA commented May 31, 2015

srowen commented May 31, 2015

hhbyyh commented May 31, 2015

SparkQA commented May 31, 2015

SparkQA commented Jun 1, 2015

srowen Jun 1, 2015

hhbyyh Jun 2, 2015

hhbyyh Jun 2, 2015

srowen Jun 2, 2015

hhbyyh Jun 2, 2015

srowen Jun 2, 2015

SparkQA commented Jun 2, 2015

mengxr commented Jun 2, 2015

SparkQA commented Jun 2, 2015

srowen commented Jun 2, 2015

SparkQA commented Jun 3, 2015

SparkQA commented Jun 3, 2015

mengxr Jun 3, 2015

SparkQA commented Jun 3, 2015

[Spark-7983] [MLlib] Add require for one-based indices in loadLibSVMFile #6538

[Spark-7983] [MLlib] Add require for one-based indices in loadLibSVMFile #6538

Conversation

hhbyyh commented May 31, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 31, 2015

srowen commented May 31, 2015

hhbyyh commented May 31, 2015

SparkQA commented May 31, 2015

SparkQA commented Jun 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 2, 2015

mengxr commented Jun 2, 2015

SparkQA commented Jun 2, 2015

srowen commented Jun 2, 2015

SparkQA commented Jun 3, 2015

SparkQA commented Jun 3, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 3, 2015