[SPARK-20040][ML][python] pyspark wrapper for ChiSquareTest #17421

MrBago · 2017-03-25T00:10:57Z

What changes were proposed in this pull request?

A pyspark wrapper for spark.ml.stat.ChiSquareTest

How was this patch tested?

unit tests
doctests

jkbradley · 2017-03-25T00:15:36Z

add to whitelist

SparkQA · 2017-03-25T00:18:34Z

Test build #75192 has finished for PR 17421 at commit a6bc10c.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ChiSquareTest(object):

jkbradley · 2017-03-25T00:43:18Z

Just remembered: you'll also need to update python/docs/pyspark.ml.rst for doc gen

jkbradley · 2017-03-25T00:43:41Z

RAT tests are for checking that the Apache license appears at the top of each file

SparkQA · 2017-03-25T01:09:40Z

Test build #75195 has finished for PR 17421 at commit 37e187b.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-25T02:09:36Z

Test build #75198 has finished for PR 17421 at commit b71caef.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the PR! I made a first review pass.

jkbradley · 2017-03-25T00:50:04Z

python/pyspark/ml/tests.py

@@ -1692,6 +1692,23 @@ def test_new_java_array(self):
        self.assertEqual(_java2py(self.sc, java_array), [])


+class ChiSquareTestTests(SparkSessionTestCase):
+
+    def test_ChiSquareTest(self):


This is a little arbitrary, but to follow other examples, write this as: test_chisquaretest

jkbradley · 2017-03-25T00:51:03Z

python/pyspark/ml/stat.py

+from pyspark.ml.wrapper import _jvm
+
+
+class ChiSquareTest(object):


Mark as Experimental (Search for other example of this)

Also, we put the triple-quotes on their own line elsewhere in pyspark

jkbradley · 2017-03-25T00:51:30Z

python/pyspark/ml/tests.py

+
+    def test_ChiSquareTest(self):
+        labels = [1, 2, 0]
+        vectors = [_convert_to_vector([0, 1, 2]),


Use DenseVector, not _convert_to_vector. (use public APIs wherever possible)

jkbradley · 2017-03-25T00:52:41Z

python/pyspark/ml/tests.py

+        vectors = [_convert_to_vector([0, 1, 2]),
+                   _convert_to_vector([1, 1, 1]),
+                   _convert_to_vector([2, 1, 0])]
+        data = zip(labels, vectors)


It can also be nicer to write this in a per-row format, rather than zipping labels and vectors which are defined separately. See other examples of createDataFrame in this file.

Same for the doc test

jkbradley · 2017-03-25T00:54:04Z

python/pyspark/ml/tests.py

+        data = zip(labels, vectors)
+        df = self.spark.createDataFrame(data, ['label', 'feat'])
+        res = ChiSquareTest.test(df, 'feat', 'label')
+        # pValues = res.select("pValues").collect())


(Noting that this can be updated once the Spark SQL bug is fixed)

jkbradley · 2017-03-25T02:44:49Z

python/pyspark/ml/stat.py

+
+
+class ChiSquareTest(object):
+    """ Conduct Pearson's independence test for every feature against the label. For each feature,


I just saw you changed this from the Scala doc b/c I left "RDD" there. Would you mind correcting the Scala doc too?

jkbradley · 2017-03-25T03:08:45Z

python/pyspark/ml/stat.py

+
+    The null hypothesis is that the occurrence of the outcomes is statistically independent.
+
+    :param dataset:


Copy param text from the Scala doc, unless there's a need to customize it for Python

Same for the return value text

SparkQA · 2017-03-25T04:24:03Z

Test build #75199 has finished for PR 17421 at commit 32a0b0c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-26T02:03:07Z

Test build #3612 has finished for PR 17421 at commit 32a0b0c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-26T04:56:06Z

Test build #3615 has finished for PR 17421 at commit 32a0b0c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-27T17:18:55Z

add to whitelist

jkbradley · 2017-03-27T19:03:36Z

LGTM pending tests

SparkQA · 2017-03-27T20:33:52Z

Test build #75269 has finished for PR 17421 at commit 32a0b0c.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-03-27T20:51:54Z

Test build #75270 has finished for PR 17421 at commit e00fc49.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

holdenk · 2017-03-27T21:25:02Z

dev/sparktestsupport/modules.py

@@ -431,6 +431,7 @@ def __hash__(self):
        "pyspark.ml.linalg.__init__",
        "pyspark.ml.recommendation",
        "pyspark.ml.regression",
+        "pyspark.ml.stat",


We just took it out in 314cf51 , but since this is adding back in ml.stat we also need to update setup.py (you might need to update your branch from the latest master to see this).

@holdenk thanks for catching that, should be fixed now.

Wait, do we need to update setup.py? This is creating a module, not a package, right?

Sub-modules aren't automatically packaged so we do need to explicitly add it.

Thanks @jkbradley, I reverted setup.py.

@holdenk If we need to add pyspark.ml.stat to setup.py, then why are we not adding the other analogous modules: pyspark.ml.{classification, clustering, regression,...}?

Oh yah sorry, its anything which is a new sub-directory and when I was reading this PR yesterday I thought this was a new directory, but looking it today that isn't the case, sorry.

OK, no problem, I just wanted to check.

holdenk

Quick read through, thanks for working on this :)

holdenk · 2017-03-27T21:41:02Z

python/pyspark/ml/stat.py

+    globs['spark'] = spark
+    import tempfile
+
+    temp_path = tempfile.mkdtemp()


I don't think this test is using the temp path?

holdenk · 2017-03-27T21:43:35Z

python/pyspark/ml/tests.py

-from numpy import (
-    abs, all, arange, array, array_equal, dot, exp, inf, mean, ones, random, tile, zeros)
-from numpy import sum as array_sum
+from numpy import abs, all, arange, array, array_equal, inf, ones, tile, zeros


Thanks for cleaning up the numpy imports :) +1

holdenk · 2017-03-27T22:48:51Z

dev/sparktestsupport/modules.py

@@ -431,6 +431,7 @@ def __hash__(self):
        "pyspark.ml.linalg.__init__",
        "pyspark.ml.recommendation",
        "pyspark.ml.regression",
+        "pyspark.ml.stat",


Sub-modules aren't automatically packaged so we do need to explicitly add it.

SparkQA · 2017-03-27T22:55:44Z

Test build #75278 has finished for PR 17421 at commit 3e7163c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-28T00:22:12Z

Test build #75280 has finished for PR 17421 at commit 114baf0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-28T01:28:18Z

Test build #75281 has finished for PR 17421 at commit 3e7163c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-28T20:17:06Z

Test build #3617 has finished for PR 17421 at commit 3e7163c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-28T22:38:25Z

LGTM pending tests

SparkQA · 2017-03-29T00:59:16Z

Test build #75329 has finished for PR 17421 at commit 1ce5966.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-29T01:07:24Z

Test build #75330 has finished for PR 17421 at commit e79f968.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-29T02:11:57Z

Merging with master
Thanks!

MrBago force-pushed the chiSquareTestWrapper branch from b71caef to 32a0b0c Compare March 25, 2017 02:16

jkbradley reviewed Mar 25, 2017

View reviewed changes

MrBago added 3 commits March 27, 2017 12:06

Added pyspark wrapper for ChiSquareTest and associated tests.

1c6acd7

Docs and license fixed for ChiSquareTest.

9f177c6

Fix style for docs.

aa40d58

MrBago force-pushed the chiSquareTestWrapper branch from e00fc49 to 60d268c Compare March 27, 2017 20:09

Integrated code review feedback. Mostly related to style.

3e7163c

MrBago force-pushed the chiSquareTestWrapper branch from 60d268c to 3e7163c Compare March 27, 2017 20:11

holdenk reviewed Mar 27, 2017

View reviewed changes

MrBago force-pushed the chiSquareTestWrapper branch from 114baf0 to 3e7163c Compare March 27, 2017 22:38

holdenk reviewed Mar 27, 2017

View reviewed changes

Update stat.py to not use temp_path.

e79f968

MrBago force-pushed the chiSquareTestWrapper branch from 1ce5966 to e79f968 Compare March 28, 2017 22:19

asfgit closed this in a5c8770 Mar 29, 2017

		from pyspark.ml.wrapper import _jvm


		class ChiSquareTest(object):



		class ChiSquareTest(object):
		""" Conduct Pearson's independence test for every feature against the label. For each feature,


		The null hypothesis is that the occurrence of the outcomes is statistically independent.

		:param dataset:

[SPARK-20040][ML][python] pyspark wrapper for ChiSquareTest #17421

[SPARK-20040][ML][python] pyspark wrapper for ChiSquareTest #17421

Conversation

MrBago commented Mar 25, 2017

What changes were proposed in this pull request?

How was this patch tested?

jkbradley commented Mar 25, 2017

SparkQA commented Mar 25, 2017

jkbradley commented Mar 25, 2017

jkbradley commented Mar 25, 2017

SparkQA commented Mar 25, 2017

SparkQA commented Mar 25, 2017

jkbradley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 25, 2017

SparkQA commented Mar 26, 2017

SparkQA commented Mar 26, 2017

jkbradley commented Mar 27, 2017

jkbradley commented Mar 27, 2017

SparkQA commented Mar 27, 2017

SparkQA commented Mar 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 27, 2017

SparkQA commented Mar 28, 2017

SparkQA commented Mar 28, 2017

SparkQA commented Mar 28, 2017

jkbradley commented Mar 28, 2017

SparkQA commented Mar 29, 2017

SparkQA commented Mar 29, 2017

jkbradley commented Mar 29, 2017