Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound #4200

Closed
wants to merge 3 commits into from

Conversation

hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Jan 26, 2015

JIRA link: https://issues.apache.org/jira/browse/SPARK-5406

The code in breeze svd imposes the upper bound for LocalLAPACK in RowMatrix.computeSVD
code from breeze svd (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala)
val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
* scala.math.min(m, n) + 4 * scala.math.min(m, n))
)
val work = new ArrayDouble

As a result, 7 * n * n + 4 * n < Int.MaxValue at least (depends on JVM)

In some worse cases, like n = 25000, work size will become positive again (80032704) and bring wired behavior.

The PR is only the beginning, to support Genbase ( an important biological benchmark that would help promote Spark to genetic applications, http://www.paradigm4.com/wp-content/uploads/2014/06/Genomics-Benchmark-Technical-Report.pdf),
which needs to compute svd for matrix up to 60K * 70K. I found many potential issues and would like to know if there's any plan undergoing that would expand the range of matrix computation based on Spark.
Thanks.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26087 has started for PR 4200 at commit e48a6e4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26087 has finished for PR 4200 at commit e48a6e4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26087/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26088 has started for PR 4200 at commit 23860e4.

  • This patch merges cleanly.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 26, 2015

Sorry for the style violation, The comment line was added after test build.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26088 has finished for PR 4200 at commit 23860e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26088/
Test PASSed.

@srowen
Copy link
Member

srowen commented Jan 26, 2015

Th evaluation of this criteria seems correct, but this isn't the solution, I think. As you approach this limit, you're allocating a single ~16GB array. This is already at the kind of scale where this implementation is not sufficient. Therefore should the code not select the distributed version in this case, rather than just fail?

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 26, 2015

Thanks for review, @srowen
Indeed a more proper policy for svd mode selection is needed. I'm having some troubles with my cluster now and I'll perform investigation for distributed mode after it's back to normal.
And, the require can still be useful as a must-have criteria check before invocation of brzSvd, right?

BTW, I'd like to know if there was successful experience computing svd on Spark with a scale about 20K * 20K ?

@srowen
Copy link
Member

srowen commented Jan 26, 2015

You can at least make this part of the logic under "auto" mode above right? Yes, if the user asks for local mode, it will either have to be an error or automatically change modes. I have not run the SVD at any scale, but the mode you're choosing is not appropriate for anything that large. use the distributed version.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 26, 2015

Well, a clear problem for current auto mode is it will choose local mode as long as k > n / 2, regardless of matrix scale
I've not fully evaluated the capacity and constraints of LocalARPACK and DistARPACK, (like whether or not DistARPACK will be better), and until then, rewriting auto seems rash.
The purpose for this PR is to make the limitation from brzSvd loud and clear, I'll post the update for auto mode when it's ready.

@mengxr
Copy link
Contributor

mengxr commented Jan 26, 2015

@hhbyyh It really depends on whether you need the full SVD or the truncated SVD. For the 60k * 70k problem, do you need its full SVD? If that is the case, I would say this is not yet supported. If you only need top 50 singular values, the dist mode should be able to handle it.

The limit is correct but I agree with @srowen that this is already beyond what we can really handle on a single machine. I would keep this hard limit and set a soft limit at 5000 (or 10000). We output a warning message if n is greater than this soft limit.

@mengxr
Copy link
Contributor

mengxr commented Jan 26, 2015

@hhbyyh Just curious, the commits from you always appear at the very bottom of the PR page. It made me feel that you just updated your PR. Is it a GitHub bug or did you do something special?

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 28, 2015

@mengxr Sorry I was on something else yesterday.
Are you suggesting putting a soft limit in the auto mode for local and keep the hard limit in case LocalLAPACK ?
I agree with the general idea.
Just when I tried on my local machine, it took only about 2 hours to compute the full svd for a 10K * 10K matrix. And I even haven't install the NativeSystemBLAS. So I guess the limit for a single machine will be quite near the hard limit (17515). I'll try with other scale and distributed mode today.
BTW, about the commit on page bottom, probably because the time of my local machine is set forward by 1 day ...

@mengxr
Copy link
Contributor

mengxr commented Jan 28, 2015

10000x10000 is definitely doable with multi-threaded native BLAS on a single machine. But usually the full SVD is not necessary for the application. This is why I want to put a soft limit and throw a warning message, which might help users re-consider whether they need a full SVD.

That's interesting. Did GitHub show "added some commits 20 hours in the future"?

@SparkQA
Copy link

SparkQA commented Jan 29, 2015

Test build #26318 has started for PR 4200 at commit f7864d0.

  • This patch merges cleanly.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jan 29, 2015

PR updated. Several findings:

  1. LocalLAPACK and LocalARPACK shares similar upper bound, "requested array exceeds vm limit" when n = 17000. For 15000, it will take more than 5 hours but doable.
  2. k is actually ignored in LocalLAPACK mode. It always computes full svd.
  3. computeGramianMatrix also has upper limit somewhere < 17000. Actually that's why 1 fails at 17000. I'll look into it. Yet I need more time to locate the root cause for that.
  4. Under DistARPACK mode, for 17K * 17K full svd, I got a lot of future times out and job failed. I'm trying k = 10 with 17K * 17K (1 hour now), and seems all worker CPUs are always idle.

My intention is to expand the range of matrix computing for Spark...
I'll probably try to optimize the DistARPACK mode.

@SparkQA
Copy link

SparkQA commented Jan 29, 2015

Test build #26318 has finished for PR 4200 at commit f7864d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26318/
Test PASSed.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Feb 2, 2015

@mengxr Sorry to disturb. I know you are probably quite busy with many PR in review.

Can you please provide some comments if got a minute? I will close the PR if it's regarded as unnecessary for now~ Thanks.

@asfgit asfgit closed this in d85cd4e Feb 2, 2015
@mengxr
Copy link
Contributor

mengxr commented Feb 2, 2015

The changes look good to me. We may want to investigate more on the limits, but the current setting is certainly better than master. I've merged it. Thanks for testing!

@hhbyyh hhbyyh deleted the rowMatrix branch February 2, 2015 06:02
@hhbyyh
Copy link
Contributor Author

hhbyyh commented Feb 2, 2015

Thanks for review

dongjoon-hyun pushed a commit that referenced this pull request Dec 27, 2023
### What changes were proposed in this pull request?
This pr aims to upgrade jackson from 2.16.0 to 2.16.1

### Why are the changes needed?
The new version bring some fix:

- [#4200](FasterXML/jackson-databind#4200): JsonSetter(contentNulls = FAIL) is ignored in delegating JsonCreator argument
- [#4216](FasterXML/jackson-databind#4216): Primitive array deserializer not being captured by DeserializerModifier
- [#4219](FasterXML/jackson-databind#4219): JsonNode.findValues() and findParents() missing expected values in 2.16.0

The full release notes as follows:

- https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.16.1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass Github Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44494 from LuciferYang/SPARK-46508.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024
This pr aims to upgrade jackson from 2.16.0 to 2.16.1

The new version bring some fix:

- [apache#4200](FasterXML/jackson-databind#4200): JsonSetter(contentNulls = FAIL) is ignored in delegating JsonCreator argument
- [apache#4216](FasterXML/jackson-databind#4216): Primitive array deserializer not being captured by DeserializerModifier
- [apache#4219](FasterXML/jackson-databind#4219): JsonNode.findValues() and findParents() missing expected values in 2.16.0

The full release notes as follows:

- https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.16.1

No

Pass Github Actions

No

Closes apache#44494 from LuciferYang/SPARK-46508.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants