-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10906][MLlib] More efficient SparseMatrix.equals #8960
[SPARK-10906][MLlib] More efficient SparseMatrix.equals #8960
Conversation
Can one of the admins verify this patch? |
case m: Matrix => | ||
val thisIteratorSet = toBreeze.activeIterator.toSet | ||
val mIteratorSet = m.toBreeze.activeIterator.toSet.filter(p => p._2 != 0.0) | ||
thisIteratorSet == mIteratorSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than converting to sets (fairly expensive), how about comparing the iterators directly? It's a bit more complex (handling zeros) but would then be linear time, with a small constant. It would also allow early stopping as soon as a non-matching value was found.
Also, this method should compare the matrix dimensions before looking at the values.
You're right about them not using the same order in activeIterator; that's annoying. I like the tests you added before using the iterators. But I don't think the use of the iterators will handle zero values properly. Can you go ahead and add one or more unit tests, particularly where there are both implicit and explicit zeros, in order to catch such issues? |
In a sparse representation, an implicit zero is not in values or indices arrays, but an explicit zero is (with value 0). I'm sending your PR a PR to demonstrate the problem. |
Ping! Will you have a chance to update this soon? |
…ithub.com/jkbradley/spark into jkbradley-rahulpalamuttam-RahulP_SparseMatrixEquals Conflicts: mllib/src/test/scala/org/apache/spark/mllib/linalg/MatricesSuite.scala
@jkbradley |
Are there any more edge conditions left to consider in this patch? |
@rahulpalamuttam May I ask your use case for Btw, a better solution would be updating the implementation in breeze instead of Spark, which benefits both projects. |
@rahulpalamuttam Sorry for the delay in reviewing! How would you feel about updating the implementation in Breeze, rather than in Spark? I expect you could use much of the code you've already written, and I'd be happy to help review your PR to Breeze if it's helpful. |
@jkbradley Do you want me to update the PR with the test code and remove the implementation code? |
@jkbradley @mengxr |
@rahulpalamuttam Yes, please, it'd be great to update this PR to be a test only (assuming it fails without the Breeze fix?). I'll test and comment on the Breeze PR. |
Actually the test passes regardless of whether the fix is there or not. |
Right, sorry, I was losing track of things. In that case, let's close this issue and focus on improving the Breeze method. Thanks! |
I closed this PR based on the discussion. |
@jkbradley
Calls toBreeze.activeIterator which, for CSCMatrix, returns an iterator
with only non zero values. The iterators are then converted to sets and
checked for equality. Since the input parameter can be of type DenseMatrix
the second set is filtered for non-zero values. This is because activeIterator
for DenseMatrix returns all values.