Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15194] [ML] Add Python ML API for MultivariateGaussian #13248

Closed

Conversation

praveendareddy21
Copy link

@praveendareddy21 praveendareddy21 commented May 22, 2016

What changes were proposed in this pull request?

Added MultivariateGaussian in pyspark ML to match scala's ML API

How was this patch tested?

Tested locally and also added testcases from scala's testsuite

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

#

from pyspark.ml.linalg import DenseVector, DenseMatrix, Vector
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import should be moved above.

@MechCoder
Copy link
Contributor

@praveendareddy21 Just made a first pass. Also please run PEP8 on your code

This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution. In
the event that the covariance matrix is singular, the density will be computed in a
reduced dimensional subspace under which the distribution is supported.
(see [[http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use

`<http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Degenerate_case>`_

to make sure link displayed correctly in documentation.

@vectorijk
Copy link
Contributor

@praveendareddy21 For generating documentation for this API correctly, you could include this in spark/python/docs/pyspark.ml.rst

pyspark.ml.stat module
----------------------------

.. automodule:: pyspark.ml.stat
    :members:
    :undoc-members:
    :inherited-members:

Also, just like @MechCoder said, you could run spark/dev/lint-python to make sure you pass all PEP8 checking.

@vectorijk
Copy link
Contributor

ping @praveendareddy21 Is this still active? If not, I could help with this.

@praveendareddy21
Copy link
Author

@vectorijk I will be pushing the changes in few days.

jjthomas and others added 12 commits June 28, 2016 16:13
## What changes were proposed in this pull request?

Network word count example for structured streaming

## How was this patch tested?

Run locally

Author: James Thomas <[email protected]>
Author: James Thomas <[email protected]>

Closes apache#13816 from jjthomas/master.

(cherry picked from commit 3554713)
Signed-off-by: Tathagata Das <[email protected]>
…Writer` and `DataStreamWriter`

## What changes were proposed in this pull request?

Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start

Author: Burak Yavuz <[email protected]>

Closes apache#13952 from brkyvz/minor-doc-fix.

(cherry picked from commit 5545b79)
Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request?

Add unit tests for csv data for SPARKR

## How was this patch tested?

unit tests

Author: Felix Cheung <[email protected]>

Closes apache#13904 from felixcheung/rcsv.

(cherry picked from commit 823518c)
Signed-off-by: Shivaram Venkataraman <[email protected]>
## What changes were proposed in this pull request?

Fixed the following error:
```
>>> sqlContext.readStream
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...", line 442, in readStream
    return DataStreamReader(self._wrapped)
NameError: global name 'DataStreamReader' is not defined
```

## How was this patch tested?

The added test.

Author: Shixiong Zhu <[email protected]>

Closes apache#13958 from zsxwing/fix-import.

(cherry picked from commit 5bf8881)
Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request?
This patch removes the blind fallback into Hive for functions. Instead, it creates a whitelist and adds only a small number of functions to the whitelist, i.e. the ones we intend to support in the long run in Spark.

## How was this patch tested?
Updated tests to reflect the change.

Author: Reynold Xin <[email protected]>

Closes apache#13939 from rxin/hive-whitelist.

(cherry picked from commit 363bced)
Signed-off-by: Reynold Xin <[email protected]>
….PCA

## What changes were proposed in this pull request?
model loading backward compatibility for ml.feature.PCA.

## How was this patch tested?
existing ut and manual test for loading models saved by Spark 1.6.

Author: Yanbo Liang <[email protected]>

Closes apache#13937 from yanboliang/spark-16245.

(cherry picked from commit 0df5ce1)
Signed-off-by: Xiangrui Meng <[email protected]>
## What changes were proposed in this pull request?

There are some duplicated code for options in DataFrame reader/writer API, this PR clean them up, it also fix a bug for `escapeQuotes` of csv().

## How was this patch tested?

Existing tests.

Author: Davies Liu <[email protected]>

Closes apache#13948 from davies/csv_options.
…rk.sql to pyspark.sql.streaming

## What changes were proposed in this pull request?

- Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to make them consistent with scala packaging
- Exposed the necessary classes in sql.streaming package so that they appear in the docs
- Added pyspark.sql.streaming module to the docs

## How was this patch tested?
- updated unit tests.
- generated docs for testing visibility of pyspark.sql.streaming classes.

Author: Tathagata Das <[email protected]>

Closes apache#13955 from tdas/SPARK-16266.
…doc is incorrect for toJavaRDD, …

## What changes were proposed in this pull request?

Change the return type mentioned in the JavaDoc for `toJavaRDD` / `javaRDD` to match the actual return type & be consistent with the scala rdd return type.

## How was this patch tested?

Docs only change.

Author: Holden Karau <[email protected]>

Closes apache#13954 from holdenk/trivial-streaming-tojavardd-doc-fix.

(cherry picked from commit 757dc2c)
Signed-off-by: Tathagata Das <[email protected]>
…tions that reference no input attributes

## What changes were proposed in this pull request?

`MAX(COUNT(*))` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime.

The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT(*))` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute.

This PR fixes this issue by removing the aforementioned branch.

## How was this patch tested?

New test case added in `AnalysisErrorSuite`.

Author: Cheng Lian <[email protected]>

Closes apache#13968 from liancheng/spark-16291-nested-agg-functions.

(cherry picked from commit d1e8108)
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

Some appNames in ML examples are incorrect, mostly in PySpark but one in Scala.  This corrects the names.

## How was this patch tested?
Style, local tests

Author: Bryan Cutler <[email protected]>

Closes apache#13949 from BryanCutler/pyspark-example-appNames-fix-SPARK-16261.

(cherry picked from commit 21385d0)
Signed-off-by: Nick Pentreath <[email protected]>
## What changes were proposed in this pull request?
Fix wrong arguments description of ```survreg``` in SparkR.

## How was this patch tested?
```Arguments``` section of ```survreg``` doc before this PR (with wrong description for ```path``` and missing ```overwrite```):
![image](https://cloud.githubusercontent.com/assets/1962026/16447548/fe7a5ed4-3da1-11e6-8b96-b5bf2083b07e.png)

After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/16447617/368e0b18-3da2-11e6-8277-45640fb11859.png)

Author: Yanbo Liang <[email protected]>

Closes apache#13970 from yanboliang/spark-16143-followup.

(cherry picked from commit c6a220d)
Signed-off-by: Xiangrui Meng <[email protected]>
dongjoon-hyun and others added 12 commits July 8, 2016 17:07
## What changes were proposed in this pull request?

This PR implements `sentences` SQL function.

## How was this patch tested?

Pass the Jenkins tests with a new testcase.

Author: Dongjoon Hyun <[email protected]>

Closes apache#14004 from dongjoon-hyun/SPARK_16285.

(cherry picked from commit a54438c)
Signed-off-by: Wenchen Fan <[email protected]>
…partition

## What changes were proposed in this pull request?

tallSkinnyQR of RowMatrix should aware of empty partition, which could cause exception from Breeze qr decomposition.

See the [archived dev mail](https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3CCAF7ADNrycvPL3qX-VZJhq4OYmiUUhoscut_tkOm63Cm18iK1tQmail.gmail.com%3E) for more details.

## How was this patch tested?

Scala unit test.

Author: Xusen Yin <[email protected]>

Closes apache#14049 from yinxusen/SPARK-16369.

(cherry picked from commit 255d74f)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

Adds an quoteAll option for writing CSV which will quote all fields.
See https://issues.apache.org/jira/browse/SPARK-13638

## How was this patch tested?

Added a test to verify the output columns are quoted for all fields in the Dataframe

Author: Jurriaan Pruis <[email protected]>

Closes apache#13374 from jurriaan/csv-quote-all.

(cherry picked from commit 38cf8f2)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

This uses the try/finally pattern to ensure streams are closed after use. `UnsafeShuffleWriter` wasn't closing compression streams, causing them to leak resources until garbage collected. This was causing a problem with codecs that use off-heap memory.

## How was this patch tested?

Current tests are sufficient. This should not change behavior.

Author: Ryan Blue <[email protected]>

Closes apache#14093 from rdblue/SPARK-16420-unsafe-shuffle-writer-leak.

(cherry picked from commit 67e085e)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

This PR adds parse_url SQL functions in order to remove Hive fallback.

A new implementation of apache#13999

## How was this patch tested?

Pass the exist tests including new testcases.

Author: wujian <[email protected]>

Closes apache#14008 from janplus/SPARK-16281.

(cherry picked from commit f5fef69)
Signed-off-by: Reynold Xin <[email protected]>
…r scala 2.10

## What changes were proposed in this pull request?
This PR adds hive-thriftserver profile to scala 2.10 build created by release-build.sh.

Author: Yin Huai <[email protected]>

Closes apache#14108 from yhuai/SPARK-16453.

(cherry picked from commit 60ba436)
Signed-off-by: Yin Huai <[email protected]>
## What changes were proposed in this pull request?

Currently, JDBC Writer uses dialects to get datatypes, but doesn't to quote field names. This PR uses dialects to quote the field names, too.

**Reported Error Scenario (MySQL case)**
```scala
scala> val url="jdbc:mysql://localhost:3306/temp"
scala> val prop = new java.util.Properties
scala> prop.setProperty("user","root")
scala> spark.createDataset(Seq("a","b","c")).toDF("order")
scala> df.write.mode("overwrite").jdbc(url, "temptable", prop)
...MySQLSyntaxErrorException: ... near 'order TEXT )
```

## How was this patch tested?

Pass the Jenkins tests and manually do the above case.

Author: Dongjoon Hyun <[email protected]>

Closes apache#14107 from dongjoon-hyun/SPARK-16387.

(cherry picked from commit 3b22291)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?
Allow for kafka topic subscriptions based on a regex pattern.

## How was this patch tested?
Unit tests, manual tests

Author: cody koeninger <[email protected]>

Closes apache#14026 from koeninger/SPARK-13569.

(cherry picked from commit fd6e8f0)
Signed-off-by: Tathagata Das <[email protected]>
…rest api "/applications//jobs" if array "stageIds" is empty

## What changes were proposed in this pull request?

Avoid error finding max of empty Seq when stageIds is empty. It does fix the immediate problem; I don't know if it results in meaningful output, but not an error at least.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <[email protected]>

Closes apache#14105 from srowen/SPARK-16376.

(cherry picked from commit 6cef018)
Signed-off-by: Reynold Xin <[email protected]>
…ByteBuffer

## What changes were proposed in this pull request?

It's possible to also change the callers to not pass in empty chunks, but it seems cleaner to just allow `ChunkedByteBuffer` to handle empty arrays. cc JoshRosen

## How was this patch tested?

Unit tests, also checked that the original reproduction case in apache#11748 (comment) is resolved.

Author: Eric Liang <[email protected]>

Closes apache#14099 from ericl/spark-16432.

(cherry picked from commit d8b06f1)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

Documentation changes to indicate that fine-grained mode is now deprecated.  No code changes were made, and all fine-grained mode instructions were left in place.  We can remove all of that once the deprecation cycle completes (Does Spark have a standard deprecation cycle?  One major version?)

Blocked on apache#14059

## How was this patch tested?

Viewed in Github

Author: Michael Gummelt <[email protected]>

Closes apache#14078 from mgummelt/deprecate-fine-grained.

(cherry picked from commit b1db26a)
Signed-off-by: Reynold Xin <[email protected]>
… and CreatableRelationProvider without Extending SchemaRelationProvider

#### What changes were proposed in this pull request?
When users try to implement a data source API with extending only `RelationProvider` and `CreatableRelationProvider`, they will hit an error when resolving the relation.
```Scala
spark.read
.format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
  .load()
  .write.
format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
  .save()
```

The error they hit is like
```
org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
```

Actually, the bug fix is simple. [`DataSource.createRelation(sparkSession.sqlContext, mode, options, data)`](https://github.com/gatorsmile/spark/blob/dd644f8117e889cebd6caca58702a7c7e3d88bef/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L429) already returns a BaseRelation. We should not assign schema to `userSpecifiedSchema`. That schema assignment only makes sense for the data sources that extend `FileFormat`.

#### How was this patch tested?
Added a test case.

Author: gatorsmile <[email protected]>

Closes apache#14075 from gatorsmile/dataSource.

(cherry picked from commit 7374e51)
Signed-off-by: Wenchen Fan <[email protected]>
>>> x = Vectors.dense([1.0, 1.0])
>>> m = MultivariateGaussian(mu, sigma)
>>> m.pdf(x)
0.0682586811486
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To run the doctest, I think we need to call the doctest.testmod() explicitly like other modules do. Check mllib/util.py.

Also need to add this module to the python_test_goals to pyspark_ml module object of dev/sparktestsupport/modules.py

@praveendareddy21
Copy link
Author

@MechCoder @drcrallen @jjthomas @vectorijk
Kindly, review my new changes.

@MechCoder
Copy link
Contributor

Can you please reopen the pull request across the spark master branch?

@praveendareddy21
Copy link
Author

Reopened the pull request on master branch.
#14375

@holdenk
Copy link
Contributor

holdenk commented Aug 3, 2016

Can you close this one then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.