-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-7983] [MLlib] Add require for one-based indices in loadLibSVMFile #6538
Conversation
@@ -82,6 +82,8 @@ object MLUtils { | |||
val value = indexAndValue(1).toDouble | |||
(index, value) | |||
}.unzip | |||
require(indices.size == 0 || indices(0) >= 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you mean >= 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review . I'm thinking it has already been converted to zero based on line 81.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, missed that in the fold. Would it make more sense to check the value where it's read rather than after the -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid adding it to line 81 will raise a performance concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just another conditional? seems vanishingly small compared to other work done here. The current check assumes that it's in correct libsvm format and that the indices are ordered ascending. The code doesn't actually rely on this though and works even with missorted indices. However the current check wouldn't work if "0:..." occurred later in the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true.
And if we're adding the one-based check inside the iteration for each indice-value pair, do you think we should check the ascending orders also?
I thought the original code avoid it intentionally for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the sparse vector constructor doesn't verify its input since it's created so frequently. But this seems like a reasonable place to check input, when loading from libsvm. I was mistaken; this will silently fail if the libsvm format is wrong and the indices aren't sorted. So I suppose your original check is fine but if the ordering of the indices is also checked.
Test build #33851 has finished for PR 6538 at commit
|
In fact I'd make a test? |
@srowen. Great idea. I will add an unit test. |
Test build #33855 has finished for PR 6538 at commit
|
Test build #33887 has finished for PR 6538 at commit
|
@@ -82,6 +82,19 @@ object MLUtils { | |||
val value = indexAndValue(1).toDouble | |||
(index, value) | |||
}.unzip | |||
|
|||
// check if indices is one-based and in ascending order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indices is -> indices are
You can also use require
I suppose.
Haven't we lost the check for a index of 0 though now?
You could write this check in one line as something like
require(indices.sliding(2).forall(p => p(0) < p(1)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indices(0) is compared with -1 in current implementation.
Your way looks much more compact. The only concern is if it will create many small Iterators. I'll run some benchmark and get back. Thanks for the great suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be some performance difference.
val indices = 1 to 200000000
var start = System.nanoTime()
var previous = -1
var i = 0
val indicesLength = indices.size
while (i < indicesLength) {
if (indices(i) <= previous) {
throw new IllegalArgumentException("indices should be one-based and in ascending order")
}
previous = indices(i)
i += 1
}
println("while: " + (System.nanoTime() - start).toDouble / 1e9)
start = System.nanoTime()
val g = indices.sliding(2).forall(p => p(0) < p(1))
println("sliding: " + (System.nanoTime() - start).toDouble / 1e9)
while: 0.088418226
sliding: 37.128311352
I also used jconsole to collect the memory usage. The "sliding" way consumes significantly higher memory and triggers GC frequently/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fair enough. I don't think this is a performance-critical block anyway, but the cleverness is probably a step too far in complexity.
This needs a little rebase, and i think you can still use require
for the check.
This allows indices(0) == 0
. Do you mean to initialize previous = 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I will use
require
for the check. - indices is -> indices are.
And since there's a -1
in line 81, it's acceptable to have indices(0) == 0 here. (changed to zero-based).
The first ut has covered the check for zero-based LIBSVM file.
Let me know if I should change more. Thanks a lot for the careful check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right. I keep mixing up how the check is implemented.
Test build #33979 timed out for PR 6538 at commit |
test this please |
Test build #34000 has finished for PR 6538 at commit
|
@hhbyyh this LGTM but can you rebase to resolve the merge conflict? |
Test build #34045 has finished for PR 6538 at commit
|
Test build #34046 has finished for PR 6538 at commit
|
// check if indices are one-based and in ascending order | ||
var previous = -1 | ||
var i = 0 | ||
val indicesLength = indices.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indices.length
(size
introduces one extra method call, and scala compiler doesn't optimize it.)
Test build #34075 has finished for PR 6538 at commit
|
jira: https://issues.apache.org/jira/browse/SPARK-7983 Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT). add a quick check. Author: Yuhao Yang <[email protected]> Closes apache#6538 from hhbyyh/loadSVM and squashes the following commits: 79d9c11 [Yuhao Yang] optimization as respond to comments 4310710 [Yuhao Yang] merge conflict 96460f1 [Yuhao Yang] merge conflict 20a2811 [Yuhao Yang] use require 6e4f8ca [Yuhao Yang] add check for ascending order 9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception 5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM
jira: https://issues.apache.org/jira/browse/SPARK-7983 Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT). add a quick check. Author: Yuhao Yang <[email protected]> Closes apache#6538 from hhbyyh/loadSVM and squashes the following commits: 79d9c11 [Yuhao Yang] optimization as respond to comments 4310710 [Yuhao Yang] merge conflict 96460f1 [Yuhao Yang] merge conflict 20a2811 [Yuhao Yang] use require 6e4f8ca [Yuhao Yang] add check for ascending order 9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception 5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM
jira: https://issues.apache.org/jira/browse/SPARK-7983
Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT).
add a quick check.