-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20040][ML][python] pyspark wrapper for ChiSquareTest #17421
Conversation
add to whitelist |
Test build #75192 has finished for PR 17421 at commit
|
Just remembered: you'll also need to update python/docs/pyspark.ml.rst for doc gen |
RAT tests are for checking that the Apache license appears at the top of each file |
Test build #75195 has finished for PR 17421 at commit
|
Test build #75198 has finished for PR 17421 at commit
|
b71caef
to
32a0b0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I made a first review pass.
python/pyspark/ml/tests.py
Outdated
@@ -1692,6 +1692,23 @@ def test_new_java_array(self): | |||
self.assertEqual(_java2py(self.sc, java_array), []) | |||
|
|||
|
|||
class ChiSquareTestTests(SparkSessionTestCase): | |||
|
|||
def test_ChiSquareTest(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little arbitrary, but to follow other examples, write this as: test_chisquaretest
from pyspark.ml.wrapper import _jvm | ||
|
||
|
||
class ChiSquareTest(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark as Experimental (Search for other example of this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we put the triple-quotes on their own line elsewhere in pyspark
python/pyspark/ml/tests.py
Outdated
|
||
def test_ChiSquareTest(self): | ||
labels = [1, 2, 0] | ||
vectors = [_convert_to_vector([0, 1, 2]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use DenseVector, not _convert_to_vector. (use public APIs wherever possible)
python/pyspark/ml/tests.py
Outdated
vectors = [_convert_to_vector([0, 1, 2]), | ||
_convert_to_vector([1, 1, 1]), | ||
_convert_to_vector([2, 1, 0])] | ||
data = zip(labels, vectors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also be nicer to write this in a per-row format, rather than zipping labels and vectors which are defined separately. See other examples of createDataFrame in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the doc test
python/pyspark/ml/tests.py
Outdated
data = zip(labels, vectors) | ||
df = self.spark.createDataFrame(data, ['label', 'feat']) | ||
res = ChiSquareTest.test(df, 'feat', 'label') | ||
# pValues = res.select("pValues").collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Noting that this can be updated once the Spark SQL bug is fixed)
python/pyspark/ml/stat.py
Outdated
|
||
|
||
class ChiSquareTest(object): | ||
""" Conduct Pearson's independence test for every feature against the label. For each feature, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just saw you changed this from the Scala doc b/c I left "RDD" there. Would you mind correcting the Scala doc too?
|
||
The null hypothesis is that the occurrence of the outcomes is statistically independent. | ||
|
||
:param dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy param text from the Scala doc, unless there's a need to customize it for Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the return value text
Test build #75199 has finished for PR 17421 at commit
|
Test build #3612 has finished for PR 17421 at commit
|
Test build #3615 has finished for PR 17421 at commit
|
add to whitelist |
LGTM pending tests |
e00fc49
to
60d268c
Compare
60d268c
to
3e7163c
Compare
Test build #75269 has finished for PR 17421 at commit
|
Test build #75270 has finished for PR 17421 at commit
|
@@ -431,6 +431,7 @@ def __hash__(self): | |||
"pyspark.ml.linalg.__init__", | |||
"pyspark.ml.recommendation", | |||
"pyspark.ml.regression", | |||
"pyspark.ml.stat", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just took it out in 314cf51 , but since this is adding back in ml.stat we also need to update setup.py (you might need to update your branch from the latest master to see this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk thanks for catching that, should be fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, do we need to update setup.py? This is creating a module, not a package, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sub-modules aren't automatically packaged so we do need to explicitly add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jkbradley, I reverted setup.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk If we need to add pyspark.ml.stat to setup.py, then why are we not adding the other analogous modules: pyspark.ml.{classification, clustering, regression,...}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yah sorry, its anything which is a new sub-directory and when I was reading this PR yesterday I thought this was a new directory, but looking it today that isn't the case, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no problem, I just wanted to check.
114baf0
to
3e7163c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick read through, thanks for working on this :)
python/pyspark/ml/stat.py
Outdated
globs['spark'] = spark | ||
import tempfile | ||
|
||
temp_path = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test is using the temp path?
from numpy import ( | ||
abs, all, arange, array, array_equal, dot, exp, inf, mean, ones, random, tile, zeros) | ||
from numpy import sum as array_sum | ||
from numpy import abs, all, arange, array, array_equal, inf, ones, tile, zeros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning up the numpy imports :) +1
@@ -431,6 +431,7 @@ def __hash__(self): | |||
"pyspark.ml.linalg.__init__", | |||
"pyspark.ml.recommendation", | |||
"pyspark.ml.regression", | |||
"pyspark.ml.stat", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sub-modules aren't automatically packaged so we do need to explicitly add it.
Test build #75278 has finished for PR 17421 at commit
|
Test build #75280 has finished for PR 17421 at commit
|
Test build #75281 has finished for PR 17421 at commit
|
Test build #3617 has finished for PR 17421 at commit
|
1ce5966
to
e79f968
Compare
LGTM pending tests |
Test build #75329 has finished for PR 17421 at commit
|
Test build #75330 has finished for PR 17421 at commit
|
Merging with master |
What changes were proposed in this pull request?
A pyspark wrapper for spark.ml.stat.ChiSquareTest
How was this patch tested?
unit tests
doctests