-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-5406][MLlib] LocalLAPACK mode in RowMatrix.computeSVD should have much smaller upper bound #4200
Conversation
Test build #26087 has started for PR 4200 at commit
|
Test build #26087 has finished for PR 4200 at commit
|
Test FAILed. |
Test build #26088 has started for PR 4200 at commit
|
Sorry for the style violation, The comment line was added after test build. |
Test build #26088 has finished for PR 4200 at commit
|
Test PASSed. |
Th evaluation of this criteria seems correct, but this isn't the solution, I think. As you approach this limit, you're allocating a single ~16GB array. This is already at the kind of scale where this implementation is not sufficient. Therefore should the code not select the distributed version in this case, rather than just fail? |
Thanks for review, @srowen BTW, I'd like to know if there was successful experience computing svd on Spark with a scale about 20K * 20K ? |
You can at least make this part of the logic under "auto" mode above right? Yes, if the user asks for local mode, it will either have to be an error or automatically change modes. I have not run the SVD at any scale, but the mode you're choosing is not appropriate for anything that large. use the distributed version. |
Well, a clear problem for current |
@hhbyyh It really depends on whether you need the full SVD or the truncated SVD. For the 60k * 70k problem, do you need its full SVD? If that is the case, I would say this is not yet supported. If you only need top 50 singular values, the dist mode should be able to handle it. The limit is correct but I agree with @srowen that this is already beyond what we can really handle on a single machine. I would keep this hard limit and set a soft limit at |
@hhbyyh Just curious, the commits from you always appear at the very bottom of the PR page. It made me feel that you just updated your PR. Is it a GitHub bug or did you do something special? |
@mengxr Sorry I was on something else yesterday. |
10000x10000 is definitely doable with multi-threaded native BLAS on a single machine. But usually the full SVD is not necessary for the application. This is why I want to put a soft limit and throw a warning message, which might help users re-consider whether they need a full SVD. That's interesting. Did GitHub show "added some commits 20 hours in the future"? |
Test build #26318 has started for PR 4200 at commit
|
PR updated. Several findings:
My intention is to expand the range of matrix computing for Spark... |
Test build #26318 has finished for PR 4200 at commit
|
Test PASSed. |
@mengxr Sorry to disturb. I know you are probably quite busy with many PR in review. Can you please provide some comments if got a minute? I will close the PR if it's regarded as unnecessary for now~ Thanks. |
The changes look good to me. We may want to investigate more on the limits, but the current setting is certainly better than master. I've merged it. Thanks for testing! |
Thanks for review |
### What changes were proposed in this pull request? This pr aims to upgrade jackson from 2.16.0 to 2.16.1 ### Why are the changes needed? The new version bring some fix: - [#4200](FasterXML/jackson-databind#4200): JsonSetter(contentNulls = FAIL) is ignored in delegating JsonCreator argument - [#4216](FasterXML/jackson-databind#4216): Primitive array deserializer not being captured by DeserializerModifier - [#4219](FasterXML/jackson-databind#4219): JsonNode.findValues() and findParents() missing expected values in 2.16.0 The full release notes as follows: - https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.16.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44494 from LuciferYang/SPARK-46508. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
This pr aims to upgrade jackson from 2.16.0 to 2.16.1 The new version bring some fix: - [apache#4200](FasterXML/jackson-databind#4200): JsonSetter(contentNulls = FAIL) is ignored in delegating JsonCreator argument - [apache#4216](FasterXML/jackson-databind#4216): Primitive array deserializer not being captured by DeserializerModifier - [apache#4219](FasterXML/jackson-databind#4219): JsonNode.findValues() and findParents() missing expected values in 2.16.0 The full release notes as follows: - https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.16.1 No Pass Github Actions No Closes apache#44494 from LuciferYang/SPARK-46508. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
JIRA link: https://issues.apache.org/jira/browse/SPARK-5406
The code in breeze svd imposes the upper bound for LocalLAPACK in RowMatrix.computeSVD
code from breeze svd (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala)
val workSize = ( 3
* scala.math.min(m, n)
* scala.math.min(m, n)
+ scala.math.max(scala.math.max(m, n), 4 * scala.math.min(m, n)
* scala.math.min(m, n) + 4 * scala.math.min(m, n))
)
val work = new ArrayDouble
As a result, 7 * n * n + 4 * n < Int.MaxValue at least (depends on JVM)
In some worse cases, like n = 25000, work size will become positive again (80032704) and bring wired behavior.
The PR is only the beginning, to support Genbase ( an important biological benchmark that would help promote Spark to genetic applications, http://www.paradigm4.com/wp-content/uploads/2014/06/Genomics-Benchmark-Technical-Report.pdf),
which needs to compute svd for matrix up to 60K * 70K. I found many potential issues and would like to know if there's any plan undergoing that would expand the range of matrix computation based on Spark.
Thanks.