Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix of sparse vector conversion #661

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,11 @@ object Vectors {
new DenseVector(v.toArray) // Can't use underlying array directly, so make a new one
}
case v: BSV[Double] =>
new SparseVector(v.length, v.index, v.data)
if (v.index.length == v.used) {
new SparseVector(v.length, v.index, v.data)
} else {
new SparseVector(v.length, v.index.slice(0, v.used), v.data.slice(0, v.used))
}
case v: BV[_] =>
sys.error("Unsupported Breeze vector type: " + v.getClass.getName)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ package org.apache.spark.mllib.linalg

import org.scalatest.FunSuite

import breeze.linalg.{DenseVector => BDV, SparseVector => BSV}
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, VectorBuilder => BVB}

/**
* Test Breeze vector conversions.
Expand Down Expand Up @@ -55,4 +55,16 @@ class BreezeVectorConversionSuite extends FunSuite {
assert(vec.indices.eq(indices), "should not copy data")
assert(vec.values.eq(values), "should not copy data")
}

test("sparse breeze by vector builder to vector") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure we can convert BSV to SparseVector correctly when there are used elements, we should use

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/SparseVector.scala#L54

with activeSize < index.length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr There always has activeSize < index.length in Breeze SparseVector. I don't see the problem. Do you mean we also check it in the test case?
btw, maybe we could keep a same activeSize in our SparseVector? That would avoid the data copying issue in conversion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case tests whether fromBreeze can map a BSV created by the builder correctly. But the PR handles the case when index.length > used. To test it, you should create a BSV with index.length > used and verify that fromBreeze works correctly. Try to construct a SparseArray directly and use it to construct a BSV.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we don't carry unused data in MLlib because we need to serialize/deserialize the data.

val builder = new BVB[Double](n)
for (i <- 0 until indices.length) {
builder.add(indices(i), values(i))
}
val breeze = builder.toSparseVector
val vec = Vectors.fromBreeze(breeze).asInstanceOf[SparseVector]
assert(vec.size === n)
assert(vec.indices === indices)
assert(vec.values === values)
}
}