Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-7983] [MLlib] Add require for one-based indices in loadLibSVMFile #6538

Closed
wants to merge 7 commits into from

Conversation

hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented May 31, 2015

jira: https://issues.apache.org/jira/browse/SPARK-7983

Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT).

add a quick check.

@hhbyyh hhbyyh changed the title [Spark-7983] [MLlib] add require for one-based in loadLIBSVM [Spark-7983] [MLlib] Add require for one-based indices in loadLibSVMFile May 31, 2015
@@ -82,6 +82,8 @@ object MLUtils {
val value = indexAndValue(1).toDouble
(index, value)
}.unzip
require(indices.size == 0 || indices(0) >= 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you mean >= 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review . I'm thinking it has already been converted to zero based on line 81.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, missed that in the fold. Would it make more sense to check the value where it's read rather than after the -1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid adding it to line 81 will raise a performance concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another conditional? seems vanishingly small compared to other work done here. The current check assumes that it's in correct libsvm format and that the indices are ordered ascending. The code doesn't actually rely on this though and works even with missorted indices. However the current check wouldn't work if "0:..." occurred later in the line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true.
And if we're adding the one-based check inside the iteration for each indice-value pair, do you think we should check the ascending orders also?
I thought the original code avoid it intentionally for performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sparse vector constructor doesn't verify its input since it's created so frequently. But this seems like a reasonable place to check input, when loading from libsvm. I was mistaken; this will silently fail if the libsvm format is wrong and the indices aren't sorted. So I suppose your original check is fine but if the ordering of the indices is also checked.

@SparkQA
Copy link

SparkQA commented May 31, 2015

Test build #33851 has finished for PR 6538 at commit 5bd1f9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 31, 2015

In fact I'd make a test?

@hhbyyh
Copy link
Contributor Author

hhbyyh commented May 31, 2015

@srowen. Great idea. I will add an unit test.

@SparkQA
Copy link

SparkQA commented May 31, 2015

Test build #33855 has finished for PR 6538 at commit 9956365.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 1, 2015

Test build #33887 has finished for PR 6538 at commit 6e4f8ca.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@@ -82,6 +82,19 @@ object MLUtils {
val value = indexAndValue(1).toDouble
(index, value)
}.unzip

// check if indices is one-based and in ascending order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indices is -> indices are
You can also use require I suppose.
Haven't we lost the check for a index of 0 though now?

You could write this check in one line as something like
require(indices.sliding(2).forall(p => p(0) < p(1)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indices(0) is compared with -1 in current implementation.
Your way looks much more compact. The only concern is if it will create many small Iterators. I'll run some benchmark and get back. Thanks for the great suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some performance difference.

    val indices = 1 to 200000000
    var start = System.nanoTime()

    var previous = -1
    var i = 0
    val indicesLength = indices.size
    while (i < indicesLength) {
      if (indices(i) <= previous) {
        throw new IllegalArgumentException("indices should be one-based and in ascending order")
      }
      previous = indices(i)
      i += 1
    }

    println("while: " + (System.nanoTime() - start).toDouble / 1e9)
    start = System.nanoTime()

    val g = indices.sliding(2).forall(p => p(0) < p(1))
    println("sliding: " + (System.nanoTime() - start).toDouble / 1e9)

while: 0.088418226
sliding: 37.128311352

I also used jconsole to collect the memory usage. The "sliding" way consumes significantly higher memory and triggers GC frequently/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair enough. I don't think this is a performance-critical block anyway, but the cleverness is probably a step too far in complexity.

This needs a little rebase, and i think you can still use require for the check.
This allows indices(0) == 0. Do you mean to initialize previous = 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I will use require for the check.
  2. indices is -> indices are.

And since there's a -1 in line 81, it's acceptable to have indices(0) == 0 here. (changed to zero-based).
The first ut has covered the check for zero-based LIBSVM file.
Let me know if I should change more. Thanks a lot for the careful check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right. I keep mixing up how the check is implemented.

@SparkQA
Copy link

SparkQA commented Jun 2, 2015

Test build #33979 timed out for PR 6538 at commit 20a2811 after a configured wait of 150m.

@mengxr
Copy link
Contributor

mengxr commented Jun 2, 2015

test this please

@SparkQA
Copy link

SparkQA commented Jun 2, 2015

Test build #34000 has finished for PR 6538 at commit 20a2811.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 2, 2015

@hhbyyh this LGTM but can you rebase to resolve the merge conflict?

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34045 has finished for PR 6538 at commit 96460f1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34046 has finished for PR 6538 at commit 4310710.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// check if indices are one-based and in ascending order
var previous = -1
var i = 0
val indicesLength = indices.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indices.length (size introduces one extra method call, and scala compiler doesn't optimize it.)

@SparkQA
Copy link

SparkQA commented Jun 3, 2015

Test build #34075 has finished for PR 6538 at commit 79d9c11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 28dbde3 Jun 3, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
jira: https://issues.apache.org/jira/browse/SPARK-7983

Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT).

add a quick check.

Author: Yuhao Yang <[email protected]>

Closes apache#6538 from hhbyyh/loadSVM and squashes the following commits:

79d9c11 [Yuhao Yang] optimization as respond to comments
4310710 [Yuhao Yang] merge conflict
96460f1 [Yuhao Yang] merge conflict
20a2811 [Yuhao Yang] use require
6e4f8ca [Yuhao Yang] add check for ascending order
9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception
5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
jira: https://issues.apache.org/jira/browse/SPARK-7983

Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT).

add a quick check.

Author: Yuhao Yang <[email protected]>

Closes apache#6538 from hhbyyh/loadSVM and squashes the following commits:

79d9c11 [Yuhao Yang] optimization as respond to comments
4310710 [Yuhao Yang] merge conflict
96460f1 [Yuhao Yang] merge conflict
20a2811 [Yuhao Yang] use require
6e4f8ca [Yuhao Yang] add check for ascending order
9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception
5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants