-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection #1484
Conversation
QA tests have started for PR 1484. This patch merges cleanly. |
QA results for PR 1484: |
QA tests have started for PR 1484. This patch merges cleanly. |
QA results for PR 1484: |
@mengxr Could you review or comment this? Thanks! |
Sure. We had some transformers implemented under
and we can hide the implementation from public interfaces. Please let me know whether this sounds good to you. |
|
@avulanov I have the same concern about calling I want to add another candidate to what you proposed:
We can discuss the class hierarchy later since they are not user-facing. A problem with all the candidates here is we cannot apply the same transformation on
|
|
@mengxr Btw., discretization is needed for feature selection. Do you plan to merge this https://issues.apache.org/jira/browse/SPARK-1303 ? |
Btw, I will re-visit the discretization PR after v1.1 to make sure it doesn't have performance issues. |
|
@avulanov In 1.1, we have For the transformer name, |
@mengxr Sure! Thanks for suggestion. |
Test build #23232 has started for PR 1484 at commit
|
@mengxr Just to clarify: I'll implement |
Test build #23232 has finished for PR 1484 at commit
|
Test PASSed. |
@avulanov We have ChiSq tests implemented under "mllib.stat.Statistics": Could you please call the method there and select top features based on the test statistics? This would make us have a single place for ChiSq implementation. |
@mengxr |
No, |
Ok, thanks! Sorry, I didn't understand the API from the first sight :) |
Test build #23329 has started for PR 1484 at commit
|
Test build #23329 has finished for PR 1484 at commit
|
Test FAILed. |
@mengxr for some reason I cannot see the trace of the build, it seems that I need to login to Jenkins, but I don't have an account there |
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23329/console I saw |
@mengxr Could you suggest why the test fails? |
Test build #559 has started for PR 1484 at commit
|
test this please |
Test build #26451 has started for PR 1484 at commit
|
Test build #26451 has finished for PR 1484 at commit
|
Test PASSed. |
import org.apache.spark.mllib.stat.Statistics | ||
import org.apache.spark.rdd.RDD | ||
|
||
import scala.collection.mutable.ArrayBuilder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
organize imports (If you use idea intellij, there is a useful plugin: https://plugins.jetbrains.com/plugin/7350)
Test build #26524 has started for PR 1484 at commit
|
@mengxr Thank you for your comments! Done! Do you have any plans to add feature discretization capabilities to MLlib? There are few links in the head of this thread. |
LGTM pending Jenkins ... |
Test build #26524 has finished for PR 1484 at commit
|
Test PASSed. |
Yes, it would be nice to add feature discretization to MLlib. We had a PP, but as you've tried it doesn't scale well. I don't have concrete scalable algorithms in mind now. We can discuss more on the JIRA page. |
Merged into master. Thanks! |
…lling policy (apache#1484) ### What changes were proposed in this pull request? This PR aims to support two new executor rolling policies. - `PEAK_JVM_ONHEAP_MEMORY` policy chooses an executor with the biggest peak JVM on-heap memory. - `PEAK_JVM_OFFHEAP_MEMORY` policy chooses an executor with the biggest peak JVM off-heap memory. ### Why are the changes needed? Although peak memory is a kind of historic value, these two new policies add a capability to maintain the memory usage of Spark jobs minimally as much as possible. ### Does this PR introduce _any_ user-facing change? Yes, but this is a new feature. ### How was this patch tested? Pass the CIs. Closes apache#37418 from dongjoon-hyun/SPARK-39987. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 3df7124) Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 84cd907) Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]>
The following is implemented:
Needs some optimization in matrix operations.
This request is a try to implement feature selection for MLLIB, the previous work by the issue author @izendejas was not finished (https://issues.apache.org/jira/browse/SPARK-1473). This request is also related to data discretization issues: https://issues.apache.org/jira/browse/SPARK-1303 and https://issues.apache.org/jira/browse/SPARK-1216 that weren't merged.