[Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound #4200

hhbyyh · 2015-01-26T07:35:45Z

JIRA link: https://issues.apache.org/jira/browse/SPARK-5406

The code in breeze svd imposes the upper bound for LocalLAPACK in RowMatrix.computeSVD
code from breeze svd (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala)
val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
* scala.math.min(m, n) + 4 * scala.math.min(m, n))
)
val work = new ArrayDouble

As a result, 7 * n * n + 4 * n < Int.MaxValue at least (depends on JVM)

In some worse cases, like n = 25000, work size will become positive again (80032704) and bring wired behavior.

The PR is only the beginning, to support Genbase ( an important biological benchmark that would help promote Spark to genetic applications, http://www.paradigm4.com/wp-content/uploads/2014/06/Genomics-Benchmark-Technical-Report.pdf),
which needs to compute svd for matrix up to 60K * 70K. I found many potential issues and would like to know if there's any plan undergoing that would expand the range of matrix computation based on Spark.
Thanks.

SparkQA · 2015-01-26T07:37:44Z

Test build #26087 has started for PR 4200 at commit e48a6e4.

This patch merges cleanly.

SparkQA · 2015-01-26T07:38:37Z

Test build #26087 has finished for PR 4200 at commit e48a6e4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-26T07:38:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26087/
Test FAILed.

SparkQA · 2015-01-26T08:02:44Z

Test build #26088 has started for PR 4200 at commit 23860e4.

This patch merges cleanly.

hhbyyh · 2015-01-26T08:26:11Z

Sorry for the style violation, The comment line was added after test build.

SparkQA · 2015-01-26T09:11:15Z

Test build #26088 has finished for PR 4200 at commit 23860e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-26T09:11:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26088/
Test PASSed.

srowen · 2015-01-26T09:17:58Z

Th evaluation of this criteria seems correct, but this isn't the solution, I think. As you approach this limit, you're allocating a single ~16GB array. This is already at the kind of scale where this implementation is not sufficient. Therefore should the code not select the distributed version in this case, rather than just fail?

hhbyyh · 2015-01-26T10:11:30Z

Thanks for review, @srowen
Indeed a more proper policy for svd mode selection is needed. I'm having some troubles with my cluster now and I'll perform investigation for distributed mode after it's back to normal.
And, the require can still be useful as a must-have criteria check before invocation of brzSvd, right?

BTW, I'd like to know if there was successful experience computing svd on Spark with a scale about 20K * 20K ?

srowen · 2015-01-26T10:14:37Z

You can at least make this part of the logic under "auto" mode above right? Yes, if the user asks for local mode, it will either have to be an error or automatically change modes. I have not run the SVD at any scale, but the mode you're choosing is not appropriate for anything that large. use the distributed version.

hhbyyh · 2015-01-26T10:38:57Z

Well, a clear problem for current auto mode is it will choose local mode as long as k > n / 2, regardless of matrix scale
I've not fully evaluated the capacity and constraints of LocalARPACK and DistARPACK, (like whether or not DistARPACK will be better), and until then, rewriting auto seems rash.
The purpose for this PR is to make the limitation from brzSvd loud and clear, I'll post the update for auto mode when it's ready.

mengxr · 2015-01-26T21:50:25Z

@hhbyyh It really depends on whether you need the full SVD or the truncated SVD. For the 60k * 70k problem, do you need its full SVD? If that is the case, I would say this is not yet supported. If you only need top 50 singular values, the dist mode should be able to handle it.

The limit is correct but I agree with @srowen that this is already beyond what we can really handle on a single machine. I would keep this hard limit and set a soft limit at 5000 (or 10000). We output a warning message if n is greater than this soft limit.

mengxr · 2015-01-26T21:52:02Z

@hhbyyh Just curious, the commits from you always appear at the very bottom of the PR page. It made me feel that you just updated your PR. Is it a GitHub bug or did you do something special?

hhbyyh · 2015-01-28T03:21:39Z

@mengxr Sorry I was on something else yesterday.
Are you suggesting putting a soft limit in the auto mode for local and keep the hard limit in case LocalLAPACK ?
I agree with the general idea.
Just when I tried on my local machine, it took only about 2 hours to compute the full svd for a 10K * 10K matrix. And I even haven't install the NativeSystemBLAS. So I guess the limit for a single machine will be quite near the hard limit (17515). I'll try with other scale and distributed mode today.
BTW, about the commit on page bottom, probably because the time of my local machine is set forward by 1 day ...

mengxr · 2015-01-28T05:37:21Z

10000x10000 is definitely doable with multi-threaded native BLAS on a single machine. But usually the full SVD is not necessary for the application. This is why I want to put a soft limit and throw a warning message, which might help users re-consider whether they need a full SVD.

That's interesting. Did GitHub show "added some commits 20 hours in the future"?

SparkQA · 2015-01-29T13:22:47Z

Test build #26318 has started for PR 4200 at commit f7864d0.

This patch merges cleanly.

hhbyyh · 2015-01-29T13:40:22Z

PR updated. Several findings:

LocalLAPACK and LocalARPACK shares similar upper bound, "requested array exceeds vm limit" when n = 17000. For 15000, it will take more than 5 hours but doable.
k is actually ignored in LocalLAPACK mode. It always computes full svd.
computeGramianMatrix also has upper limit somewhere < 17000. Actually that's why 1 fails at 17000. I'll look into it. Yet I need more time to locate the root cause for that.
Under DistARPACK mode, for 17K * 17K full svd, I got a lot of future times out and job failed. I'm trying k = 10 with 17K * 17K (1 hour now), and seems all worker CPUs are always idle.

My intention is to expand the range of matrix computing for Spark...
I'll probably try to optimize the DistARPACK mode.

SparkQA · 2015-01-29T14:33:22Z

Test build #26318 has finished for PR 4200 at commit f7864d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-29T14:33:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26318/
Test PASSed.

hhbyyh · 2015-02-02T03:03:56Z

@mengxr Sorry to disturb. I know you are probably quite busy with many PR in review.

Can you please provide some comments if got a minute? I will close the PR if it's regarded as unnecessary for now~ Thanks.

mengxr · 2015-02-02T03:42:32Z

The changes look good to me. We may want to investigate more on the limits, but the current setting is certainly better than master. I've merged it. Thanks for testing!

hhbyyh · 2015-02-02T06:02:11Z

Thanks for review

### What changes were proposed in this pull request? This pr aims to upgrade jackson from 2.16.0 to 2.16.1 ### Why are the changes needed? The new version bring some fix: - [#4200](FasterXML/jackson-databind#4200): JsonSetter(contentNulls = FAIL) is ignored in delegating JsonCreator argument - [#4216](FasterXML/jackson-databind#4216): Primitive array deserializer not being captured by DeserializerModifier - [#4219](FasterXML/jackson-databind#4219): JsonNode.findValues() and findParents() missing expected values in 2.16.0 The full release notes as follows: - https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.16.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44494 from LuciferYang/SPARK-46508. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

This pr aims to upgrade jackson from 2.16.0 to 2.16.1 The new version bring some fix: - [apache#4200](FasterXML/jackson-databind#4200): JsonSetter(contentNulls = FAIL) is ignored in delegating JsonCreator argument - [apache#4216](FasterXML/jackson-databind#4216): Primitive array deserializer not being captured by DeserializerModifier - [apache#4219](FasterXML/jackson-databind#4219): JsonNode.findValues() and findParents() missing expected values in 2.16.0 The full release notes as follows: - https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.16.1 No Pass Github Actions No Closes apache#44494 from LuciferYang/SPARK-46508. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

hhbyyh added 2 commits January 27, 2015 15:02

make latent svd computation constraint clear

e48a6e4

fix comment style

23860e4

update auto logic for rowMatrix svd

f7864d0

asfgit closed this in d85cd4e Feb 2, 2015

hhbyyh deleted the rowMatrix branch February 2, 2015 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound #4200

[Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound #4200

hhbyyh commented Jan 26, 2015

SparkQA commented Jan 26, 2015

SparkQA commented Jan 26, 2015

AmplabJenkins commented Jan 26, 2015

SparkQA commented Jan 26, 2015

hhbyyh commented Jan 26, 2015

SparkQA commented Jan 26, 2015

AmplabJenkins commented Jan 26, 2015

srowen commented Jan 26, 2015

hhbyyh commented Jan 26, 2015

srowen commented Jan 26, 2015

hhbyyh commented Jan 26, 2015

mengxr commented Jan 26, 2015

mengxr commented Jan 26, 2015

hhbyyh commented Jan 28, 2015

mengxr commented Jan 28, 2015

SparkQA commented Jan 29, 2015

hhbyyh commented Jan 29, 2015

SparkQA commented Jan 29, 2015

AmplabJenkins commented Jan 29, 2015

hhbyyh commented Feb 2, 2015

mengxr commented Feb 2, 2015

hhbyyh commented Feb 2, 2015

[Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound #4200

[Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound #4200

Conversation

hhbyyh commented Jan 26, 2015

SparkQA commented Jan 26, 2015

SparkQA commented Jan 26, 2015

AmplabJenkins commented Jan 26, 2015

SparkQA commented Jan 26, 2015

hhbyyh commented Jan 26, 2015

SparkQA commented Jan 26, 2015

AmplabJenkins commented Jan 26, 2015

srowen commented Jan 26, 2015

hhbyyh commented Jan 26, 2015

srowen commented Jan 26, 2015

hhbyyh commented Jan 26, 2015

mengxr commented Jan 26, 2015

mengxr commented Jan 26, 2015

hhbyyh commented Jan 28, 2015

mengxr commented Jan 28, 2015

SparkQA commented Jan 29, 2015

hhbyyh commented Jan 29, 2015

SparkQA commented Jan 29, 2015

AmplabJenkins commented Jan 29, 2015

hhbyyh commented Feb 2, 2015

mengxr commented Feb 2, 2015

hhbyyh commented Feb 2, 2015