[SPARK-15194] [ML] Add Python ML API for MultivariateGaussian #13248

praveendareddy21 · 2016-05-22T05:46:56Z

What changes were proposed in this pull request?

Added MultivariateGaussian in pyspark ML to match scala's ML API

How was this patch tested?

Tested locally and also added testcases from scala's testsuite

AmplabJenkins · 2016-05-22T05:47:12Z

Can one of the admins verify this patch?

MechCoder · 2016-06-04T00:51:27Z

python/pyspark/ml/stat/distribution.py

+#
+
+from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
+import numpy as np


This import should be moved above.

MechCoder · 2016-06-04T01:40:50Z

@praveendareddy21 Just made a first pass. Also please run PEP8 on your code

vectorijk · 2016-06-04T07:28:18Z

python/pyspark/ml/stat/distribution.py

+    This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In
+     the event that the covariance matrix is singular, the density will be computed in a
+    reduced dimensional subspace under which the distribution is supported.
+    (see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])


you could use

`<http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case>`_

to make sure link displayed correctly in documentation.

vectorijk · 2016-06-04T07:36:10Z

@praveendareddy21 For generating documentation for this API correctly, you could include this in spark/python/docs/pyspark.ml.rst

pyspark.ml.stat module
----------------------------

.. automodule:: pyspark.ml.stat
    :members:
    :undoc-members:
    :inherited-members:

Also, just like @MechCoder said, you could run spark/dev/lint-python to make sure you pass all PEP8 checking.

vectorijk · 2016-06-24T10:25:40Z

ping @praveendareddy21 Is this still active? If not, I could help with this.

praveendareddy21 · 2016-06-25T00:47:49Z

@vectorijk I will be pushing the changes in few days.

## What changes were proposed in this pull request? Network word count example for structured streaming ## How was this patch tested? Run locally Author: James Thomas <[email protected]> Author: James Thomas <[email protected]> Closes apache#13816 from jjthomas/master. (cherry picked from commit 3554713) Signed-off-by: Tathagata Das <[email protected]>

…Writer` and `DataStreamWriter` ## What changes were proposed in this pull request? Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start Author: Burak Yavuz <[email protected]> Closes apache#13952 from brkyvz/minor-doc-fix. (cherry picked from commit 5545b79) Signed-off-by: Shixiong Zhu <[email protected]>

## What changes were proposed in this pull request? Add unit tests for csv data for SPARKR ## How was this patch tested? unit tests Author: Felix Cheung <[email protected]> Closes apache#13904 from felixcheung/rcsv. (cherry picked from commit 823518c) Signed-off-by: Shivaram Venkataraman <[email protected]>

## What changes were proposed in this pull request? Fixed the following error: ``` >>> sqlContext.readStream Traceback (most recent call last): File "<stdin>", line 1, in <module> File "...", line 442, in readStream return DataStreamReader(self._wrapped) NameError: global name 'DataStreamReader' is not defined ``` ## How was this patch tested? The added test. Author: Shixiong Zhu <[email protected]> Closes apache#13958 from zsxwing/fix-import. (cherry picked from commit 5bf8881) Signed-off-by: Tathagata Das <[email protected]>

## What changes were proposed in this pull request? This patch removes the blind fallback into Hive for functions. Instead, it creates a whitelist and adds only a small number of functions to the whitelist, i.e. the ones we intend to support in the long run in Spark. ## How was this patch tested? Updated tests to reflect the change. Author: Reynold Xin <[email protected]> Closes apache#13939 from rxin/hive-whitelist. (cherry picked from commit 363bced) Signed-off-by: Reynold Xin <[email protected]>

….PCA ## What changes were proposed in this pull request? model loading backward compatibility for ml.feature.PCA. ## How was this patch tested? existing ut and manual test for loading models saved by Spark 1.6. Author: Yanbo Liang <[email protected]> Closes apache#13937 from yanboliang/spark-16245. (cherry picked from commit 0df5ce1) Signed-off-by: Xiangrui Meng <[email protected]>

## What changes were proposed in this pull request? There are some duplicated code for options in DataFrame reader/writer API, this PR clean them up, it also fix a bug for `escapeQuotes` of csv(). ## How was this patch tested? Existing tests. Author: Davies Liu <[email protected]> Closes apache#13948 from davies/csv_options.

…rk.sql to pyspark.sql.streaming ## What changes were proposed in this pull request? - Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to make them consistent with scala packaging - Exposed the necessary classes in sql.streaming package so that they appear in the docs - Added pyspark.sql.streaming module to the docs ## How was this patch tested? - updated unit tests. - generated docs for testing visibility of pyspark.sql.streaming classes. Author: Tathagata Das <[email protected]> Closes apache#13955 from tdas/SPARK-16266.

…doc is incorrect for toJavaRDD, … ## What changes were proposed in this pull request? Change the return type mentioned in the JavaDoc for `toJavaRDD` / `javaRDD` to match the actual return type & be consistent with the scala rdd return type. ## How was this patch tested? Docs only change. Author: Holden Karau <[email protected]> Closes apache#13954 from holdenk/trivial-streaming-tojavardd-doc-fix. (cherry picked from commit 757dc2c) Signed-off-by: Tathagata Das <[email protected]>

…tions that reference no input attributes ## What changes were proposed in this pull request? `MAX(COUNT(*))` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime. The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT(*))` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute. This PR fixes this issue by removing the aforementioned branch. ## How was this patch tested? New test case added in `AnalysisErrorSuite`. Author: Cheng Lian <[email protected]> Closes apache#13968 from liancheng/spark-16291-nested-agg-functions. (cherry picked from commit d1e8108) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Some appNames in ML examples are incorrect, mostly in PySpark but one in Scala. This corrects the names. ## How was this patch tested? Style, local tests Author: Bryan Cutler <[email protected]> Closes apache#13949 from BryanCutler/pyspark-example-appNames-fix-SPARK-16261. (cherry picked from commit 21385d0) Signed-off-by: Nick Pentreath <[email protected]>

## What changes were proposed in this pull request? Fix wrong arguments description of ```survreg``` in SparkR. ## How was this patch tested? ```Arguments``` section of ```survreg``` doc before this PR (with wrong description for ```path``` and missing ```overwrite```): ![image](https://cloud.githubusercontent.com/assets/1962026/16447548/fe7a5ed4-3da1-11e6-8b96-b5bf2083b07e.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/16447617/368e0b18-3da2-11e6-8277-45640fb11859.png) Author: Yanbo Liang <[email protected]> Closes apache#13970 from yanboliang/spark-16143-followup. (cherry picked from commit c6a220d) Signed-off-by: Xiangrui Meng <[email protected]>

## What changes were proposed in this pull request? This PR implements `sentences` SQL function. ## How was this patch tested? Pass the Jenkins tests with a new testcase. Author: Dongjoon Hyun <[email protected]> Closes apache#14004 from dongjoon-hyun/SPARK_16285. (cherry picked from commit a54438c) Signed-off-by: Wenchen Fan <[email protected]>

…partition ## What changes were proposed in this pull request? tallSkinnyQR of RowMatrix should aware of empty partition, which could cause exception from Breeze qr decomposition. See the [archived dev mail](https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3CCAF7ADNrycvPL3qX-VZJhq4OYmiUUhoscut_tkOm63Cm18iK1tQmail.gmail.com%3E) for more details. ## How was this patch tested? Scala unit test. Author: Xusen Yin <[email protected]> Closes apache#14049 from yinxusen/SPARK-16369. (cherry picked from commit 255d74f) Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? Adds an quoteAll option for writing CSV which will quote all fields. See https://issues.apache.org/jira/browse/SPARK-13638 ## How was this patch tested? Added a test to verify the output columns are quoted for all fields in the Dataframe Author: Jurriaan Pruis <[email protected]> Closes apache#13374 from jurriaan/csv-quote-all. (cherry picked from commit 38cf8f2) Signed-off-by: Reynold Xin <[email protected]>

## What changes were proposed in this pull request? This uses the try/finally pattern to ensure streams are closed after use. `UnsafeShuffleWriter` wasn't closing compression streams, causing them to leak resources until garbage collected. This was causing a problem with codecs that use off-heap memory. ## How was this patch tested? Current tests are sufficient. This should not change behavior. Author: Ryan Blue <[email protected]> Closes apache#14093 from rdblue/SPARK-16420-unsafe-shuffle-writer-leak. (cherry picked from commit 67e085e) Signed-off-by: Reynold Xin <[email protected]>

## What changes were proposed in this pull request? This PR adds parse_url SQL functions in order to remove Hive fallback. A new implementation of apache#13999 ## How was this patch tested? Pass the exist tests including new testcases. Author: wujian <[email protected]> Closes apache#14008 from janplus/SPARK-16281. (cherry picked from commit f5fef69) Signed-off-by: Reynold Xin <[email protected]>

…r scala 2.10 ## What changes were proposed in this pull request? This PR adds hive-thriftserver profile to scala 2.10 build created by release-build.sh. Author: Yin Huai <[email protected]> Closes apache#14108 from yhuai/SPARK-16453. (cherry picked from commit 60ba436) Signed-off-by: Yin Huai <[email protected]>

## What changes were proposed in this pull request? Currently, JDBC Writer uses dialects to get datatypes, but doesn't to quote field names. This PR uses dialects to quote the field names, too. **Reported Error Scenario (MySQL case)** ```scala scala> val url="jdbc:mysql://localhost:3306/temp" scala> val prop = new java.util.Properties scala> prop.setProperty("user","root") scala> spark.createDataset(Seq("a","b","c")).toDF("order") scala> df.write.mode("overwrite").jdbc(url, "temptable", prop) ...MySQLSyntaxErrorException: ... near 'order TEXT ) ``` ## How was this patch tested? Pass the Jenkins tests and manually do the above case. Author: Dongjoon Hyun <[email protected]> Closes apache#14107 from dongjoon-hyun/SPARK-16387. (cherry picked from commit 3b22291) Signed-off-by: Reynold Xin <[email protected]>

## What changes were proposed in this pull request? Allow for kafka topic subscriptions based on a regex pattern. ## How was this patch tested? Unit tests, manual tests Author: cody koeninger <[email protected]> Closes apache#14026 from koeninger/SPARK-13569. (cherry picked from commit fd6e8f0) Signed-off-by: Tathagata Das <[email protected]>

…rest api "/applications//jobs" if array "stageIds" is empty ## What changes were proposed in this pull request? Avoid error finding max of empty Seq when stageIds is empty. It does fix the immediate problem; I don't know if it results in meaningful output, but not an error at least. ## How was this patch tested? Jenkins tests Author: Sean Owen <[email protected]> Closes apache#14105 from srowen/SPARK-16376. (cherry picked from commit 6cef018) Signed-off-by: Reynold Xin <[email protected]>

…ByteBuffer ## What changes were proposed in this pull request? It's possible to also change the callers to not pass in empty chunks, but it seems cleaner to just allow `ChunkedByteBuffer` to handle empty arrays. cc JoshRosen ## How was this patch tested? Unit tests, also checked that the original reproduction case in apache#11748 (comment) is resolved. Author: Eric Liang <[email protected]> Closes apache#14099 from ericl/spark-16432. (cherry picked from commit d8b06f1) Signed-off-by: Reynold Xin <[email protected]>

## What changes were proposed in this pull request? Documentation changes to indicate that fine-grained mode is now deprecated. No code changes were made, and all fine-grained mode instructions were left in place. We can remove all of that once the deprecation cycle completes (Does Spark have a standard deprecation cycle? One major version?) Blocked on apache#14059 ## How was this patch tested? Viewed in Github Author: Michael Gummelt <[email protected]> Closes apache#14078 from mgummelt/deprecate-fine-grained. (cherry picked from commit b1db26a) Signed-off-by: Reynold Xin <[email protected]>

… and CreatableRelationProvider without Extending SchemaRelationProvider #### What changes were proposed in this pull request? When users try to implement a data source API with extending only `RelationProvider` and `CreatableRelationProvider`, they will hit an error when resolving the relation. ```Scala spark.read .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .load() .write. format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .save() ``` The error they hit is like ``` org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.; org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) ``` Actually, the bug fix is simple. [`DataSource.createRelation(sparkSession.sqlContext, mode, options, data)`](https://github.com/gatorsmile/spark/blob/dd644f8117e889cebd6caca58702a7c7e3d88bef/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L429) already returns a BaseRelation. We should not assign schema to `userSpecifiedSchema`. That schema assignment only makes sense for the data sources that extend `FileFormat`. #### How was this patch tested? Added a test case. Author: gatorsmile <[email protected]> Closes apache#14075 from gatorsmile/dataSource. (cherry picked from commit 7374e51) Signed-off-by: Wenchen Fan <[email protected]>

lins05 · 2016-07-10T14:43:31Z

python/pyspark/ml/stat/distribution.py

+    >>> x = Vectors.dense([1.0, 1.0])
+    >>> m = MultivariateGaussian(mu, sigma)
+    >>> m.pdf(x)
+    0.0682586811486


To run the doctest, I think we need to call the doctest.testmod() explicitly like other modules do. Check mllib/util.py.

Also need to add this module to the python_test_goals to pyspark_ml module object of dev/sparktestsupport/modules.py

praveendareddy21 · 2016-07-11T14:54:45Z

@MechCoder @drcrallen @jjthomas @vectorijk
Kindly, review my new changes.

MechCoder · 2016-07-20T21:18:36Z

Can you please reopen the pull request across the spark master branch?

praveendareddy21 · 2016-07-27T04:20:28Z

Reopened the pull request on master branch.
#14375

holdenk · 2016-08-03T21:52:41Z

Can you close this one then?

praveendareddy21 added 2 commits May 21, 2016 22:22

added Multivariate gaussian in ML Pyspark

a7250b4

added testcase for python multivariate

0c58e88

MechCoder reviewed Jun 4, 2016
View reviewed changes

vectorijk reviewed Jun 4, 2016
View reviewed changes

jjthomas and others added 12 commits June 28, 2016 16:13

dongjoon-hyun and others added 12 commits July 8, 2016 17:07

lins05 reviewed Jul 10, 2016
View reviewed changes

praveendareddy21 added 5 commits July 10, 2016 11:02

added Multivariate gaussian in ML Pyspark

b254fba

added testcase for python multivariate

ff450f5

ran pep8 and other style changes

8968543

resolved merge conflicts in local branch

39b3952

resolved merge conflicts on local branch

901d2b0

vanzin mentioned this pull request Aug 4, 2016

MAINTENANCE. Cleaning up stale PRs. #14495

Closed

asfgit closed this in 53e766c Aug 4, 2016

praveendareddy21 mentioned this pull request Oct 7, 2016

[SPARK-15194] [ML] Add Python ML API for MultivariateGaussian #14375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15194] [ML] Add Python ML API for MultivariateGaussian #13248

[SPARK-15194] [ML] Add Python ML API for MultivariateGaussian #13248

praveendareddy21 commented May 22, 2016 •

edited

Loading

AmplabJenkins commented May 22, 2016

MechCoder Jun 4, 2016

MechCoder commented Jun 4, 2016

vectorijk Jun 4, 2016

vectorijk commented Jun 4, 2016

vectorijk commented Jun 24, 2016

praveendareddy21 commented Jun 25, 2016

lins05 Jul 10, 2016

praveendareddy21 commented Jul 11, 2016

MechCoder commented Jul 20, 2016

praveendareddy21 commented Jul 27, 2016

holdenk commented Aug 3, 2016

[SPARK-15194] [ML] Add Python ML API for MultivariateGaussian #13248

[SPARK-15194] [ML] Add Python ML API for MultivariateGaussian #13248

Conversation

praveendareddy21 commented May 22, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented May 22, 2016

MechCoder Jun 4, 2016

Choose a reason for hiding this comment

MechCoder commented Jun 4, 2016

vectorijk Jun 4, 2016

Choose a reason for hiding this comment

vectorijk commented Jun 4, 2016

vectorijk commented Jun 24, 2016

praveendareddy21 commented Jun 25, 2016

lins05 Jul 10, 2016

Choose a reason for hiding this comment

praveendareddy21 commented Jul 11, 2016

MechCoder commented Jul 20, 2016

praveendareddy21 commented Jul 27, 2016

holdenk commented Aug 3, 2016

praveendareddy21 commented May 22, 2016 •

edited

Loading