Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20040][ML][python] pyspark wrapper for ChiSquareTest #17421

Closed
wants to merge 5 commits into from

Conversation

MrBago
Copy link
Contributor

@MrBago MrBago commented Mar 25, 2017

What changes were proposed in this pull request?

A pyspark wrapper for spark.ml.stat.ChiSquareTest

How was this patch tested?

unit tests
doctests

@jkbradley
Copy link
Member

add to whitelist

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75192 has finished for PR 17421 at commit a6bc10c.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ChiSquareTest(object):

@jkbradley
Copy link
Member

Just remembered: you'll also need to update python/docs/pyspark.ml.rst for doc gen

@jkbradley
Copy link
Member

RAT tests are for checking that the Apache license appears at the top of each file

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75195 has finished for PR 17421 at commit 37e187b.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75198 has finished for PR 17421 at commit b71caef.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MrBago MrBago force-pushed the chiSquareTestWrapper branch from b71caef to 32a0b0c Compare March 25, 2017 02:16
Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I made a first review pass.

@@ -1692,6 +1692,23 @@ def test_new_java_array(self):
self.assertEqual(_java2py(self.sc, java_array), [])


class ChiSquareTestTests(SparkSessionTestCase):

def test_ChiSquareTest(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little arbitrary, but to follow other examples, write this as: test_chisquaretest

from pyspark.ml.wrapper import _jvm


class ChiSquareTest(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark as Experimental (Search for other example of this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we put the triple-quotes on their own line elsewhere in pyspark


def test_ChiSquareTest(self):
labels = [1, 2, 0]
vectors = [_convert_to_vector([0, 1, 2]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use DenseVector, not _convert_to_vector. (use public APIs wherever possible)

vectors = [_convert_to_vector([0, 1, 2]),
_convert_to_vector([1, 1, 1]),
_convert_to_vector([2, 1, 0])]
data = zip(labels, vectors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also be nicer to write this in a per-row format, rather than zipping labels and vectors which are defined separately. See other examples of createDataFrame in this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the doc test

data = zip(labels, vectors)
df = self.spark.createDataFrame(data, ['label', 'feat'])
res = ChiSquareTest.test(df, 'feat', 'label')
# pValues = res.select("pValues").collect())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Noting that this can be updated once the Spark SQL bug is fixed)



class ChiSquareTest(object):
""" Conduct Pearson's independence test for every feature against the label. For each feature,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw you changed this from the Scala doc b/c I left "RDD" there. Would you mind correcting the Scala doc too?


The null hypothesis is that the occurrence of the outcomes is statistically independent.

:param dataset:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy param text from the Scala doc, unless there's a need to customize it for Python

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the return value text

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75199 has finished for PR 17421 at commit 32a0b0c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #3612 has finished for PR 17421 at commit 32a0b0c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #3615 has finished for PR 17421 at commit 32a0b0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

add to whitelist

@jkbradley
Copy link
Member

LGTM pending tests

@MrBago MrBago force-pushed the chiSquareTestWrapper branch from e00fc49 to 60d268c Compare March 27, 2017 20:09
@MrBago MrBago force-pushed the chiSquareTestWrapper branch from 60d268c to 3e7163c Compare March 27, 2017 20:11
@SparkQA
Copy link

SparkQA commented Mar 27, 2017

Test build #75269 has finished for PR 17421 at commit 32a0b0c.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2017

Test build #75270 has finished for PR 17421 at commit e00fc49.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@@ -431,6 +431,7 @@ def __hash__(self):
"pyspark.ml.linalg.__init__",
"pyspark.ml.recommendation",
"pyspark.ml.regression",
"pyspark.ml.stat",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just took it out in 314cf51 , but since this is adding back in ml.stat we also need to update setup.py (you might need to update your branch from the latest master to see this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk thanks for catching that, should be fixed now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, do we need to update setup.py? This is creating a module, not a package, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sub-modules aren't automatically packaged so we do need to explicitly add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jkbradley, I reverted setup.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk If we need to add pyspark.ml.stat to setup.py, then why are we not adding the other analogous modules: pyspark.ml.{classification, clustering, regression,...}?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yah sorry, its anything which is a new sub-directory and when I was reading this PR yesterday I thought this was a new directory, but looking it today that isn't the case, sorry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no problem, I just wanted to check.

@MrBago MrBago force-pushed the chiSquareTestWrapper branch from 114baf0 to 3e7163c Compare March 27, 2017 22:38
Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick read through, thanks for working on this :)

globs['spark'] = spark
import tempfile

temp_path = tempfile.mkdtemp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is using the temp path?

from numpy import (
abs, all, arange, array, array_equal, dot, exp, inf, mean, ones, random, tile, zeros)
from numpy import sum as array_sum
from numpy import abs, all, arange, array, array_equal, inf, ones, tile, zeros
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up the numpy imports :) +1

@@ -431,6 +431,7 @@ def __hash__(self):
"pyspark.ml.linalg.__init__",
"pyspark.ml.recommendation",
"pyspark.ml.regression",
"pyspark.ml.stat",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sub-modules aren't automatically packaged so we do need to explicitly add it.

@SparkQA
Copy link

SparkQA commented Mar 27, 2017

Test build #75278 has finished for PR 17421 at commit 3e7163c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75280 has finished for PR 17421 at commit 114baf0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #75281 has finished for PR 17421 at commit 3e7163c.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #3617 has finished for PR 17421 at commit 3e7163c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MrBago MrBago force-pushed the chiSquareTestWrapper branch from 1ce5966 to e79f968 Compare March 28, 2017 22:19
@jkbradley
Copy link
Member

LGTM pending tests

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75329 has finished for PR 17421 at commit 1ce5966.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75330 has finished for PR 17421 at commit e79f968.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master
Thanks!

@asfgit asfgit closed this in a5c8770 Mar 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants