Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow schema calculation to be lazy, but ensure it is available on executors. #1

Merged
merged 1 commit into from
Oct 23, 2014

Conversation

marmbrus
Copy link

No description provided.

jkbradley added a commit that referenced this pull request Oct 23, 2014
Allow schema calculation to be lazy, but ensure it is available on executors.
@jkbradley jkbradley merged commit c2cc23d into jkbradley:sql-udt Oct 23, 2014
@jkbradley
Copy link
Owner

@marmbrus Thank you very much!

jkbradley pushed a commit that referenced this pull request Oct 27, 2014
Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support.

Approach: Introduce “hive-version” property,  and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically,

1. For each different hive version, there is a very light layer of shim code to handle API differences, sitting in sql/hive/hive-version, e.g., sql/hive/v0.12.0 or sql/hive/v0.13.1

2. Add a new profile hive-default active by default, which picks up all existing configuration and hive-0.12.0 shim (v0.12.0)  if no hive.version is specified.

3. If user specifies different version (currently only 0.13.1 by -Dhive.version = 0.13.1), hive-versions profile will be activated, which pick up hive-version specific shim layer and configuration, mainly the hive jars and hive-version shim, e.g., v0.13.1.

4. With this approach, nothing is changed with current hive-0.12 support.

No change by default: sbt/sbt -Phive
For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1
For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly

Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1).

Author: Zhan Zhang <[email protected]>
Author: zhzhan <[email protected]>
Author: Patrick Wendell <[email protected]>

Closes apache#2241 from zhzhan/spark-2706 and squashes the following commits:

3ece905 [Zhan Zhang] minor fix
410b668 [Zhan Zhang] solve review comments
cbb4691 [Zhan Zhang] change run-test for new options
0d4d2ed [Zhan Zhang] rebase
497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
8fad1cf [Zhan Zhang] change the pom file and make hive-0.13.1 as the default
ab028d1 [Zhan Zhang] rebase
4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4cb1b93 [zhzhan] Merge pull request #1 from pwendell/pr-2241
b0478c0 [Patrick Wendell] Changes to simplify the build of SPARK-2706
2b50502 [Zhan Zhang] rebase
a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb22863 [Zhan Zhang] correct the typo
20f6cf7 [Zhan Zhang] solve compatability issue
f7912a9 [Zhan Zhang] rebase and solve review feedback
301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
10c3565 [Zhan Zhang] address review comments
6bc9204 [Zhan Zhang] rebase and remove temparory repo
d3aa3f2 [Zhan Zhang] Merge branch 'master' into spark-2706
cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3ced0d7 [Zhan Zhang] rebase
d9b981d [Zhan Zhang] rebase and fix error due to rollback
adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3dd50e8 [Zhan Zhang] solve conflicts and remove unnecessary implicts
d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
dc7bdb3 [Zhan Zhang] solve conflicts
7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d7c3e1e [Zhan Zhang] Merge branch 'master' into spark-2706
68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d48bd18 [Zhan Zhang] address review comments
3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
57ea52e [Zhan Zhang] Merge branch 'master' into spark-2706
2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
9412d24 [Zhan Zhang] address review comments
f4af934 [Zhan Zhang] rebase
1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
128b60b [Zhan Zhang] ignore 0.12.0 test cases for the time being
af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
5f5619f [Zhan Zhang] restructure the directory and different hive version support
05d3683 [Zhan Zhang] solve conflicts
e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
94b4fdc [Zhan Zhang] Spark-2706: hive-0.13.1 support on spark
87ebf3b [Zhan Zhang] Merge branch 'master' into spark-2706
921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f896b2a [Zhan Zhang] Merge branch 'master' into spark-2706
789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f6a8a40 [Zhan Zhang] revert
ba14f28 [Zhan Zhang] test
dbedff3 [Zhan Zhang] Merge remote-tracking branch 'upstream/master'
70964fe [Zhan Zhang] revert
fe0f379 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
70ffd93 [Zhan Zhang] revert
42585ec [Zhan Zhang] test
7d5fce2 [Zhan Zhang] test
jkbradley pushed a commit that referenced this pull request Nov 2, 2014
This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.

The PR includes:
- StreamingKMeans algorithm with decay factor settings
- Usage example
- Additions to documentation clustering page
- Unit tests of basic behavior and decay behaviors

tdas mengxr rezazadeh

Author: freeman <[email protected]>
Author: Jeremy Freeman <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#2942 from freeman-lab/streaming-kmeans and squashes the following commits:

b2e5b4a [freeman] Fixes to docs / examples
078617c [Jeremy Freeman] Merge pull request #1 from mengxr/SPARK-3254
2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters
0411bf5 [freeman] Change decay parameterization
9f7aea9 [freeman] Style fixes
374a706 [freeman] Formatting
ad9bdc2 [freeman] Use labeled points and predictOnValues in examples
77dbd3f [freeman] Make initialization check an assertion
9cfc301 [freeman] Make random seed an argument
44050a9 [freeman] Simpler constructor
c7050d5 [freeman] Fix spacing
2899623 [freeman] Use pattern matching for clarity
a4a316b [freeman] Use collect
1472ec5 [freeman] Doc formatting
ea22ec8 [freeman] Fix imports
2086bdc [freeman] Log cluster center updates
ea9877c [freeman] More documentation
9facbe3 [freeman] Bug fix
5db7074 [freeman] Example usage for StreamingKMeans
f33684b [freeman] Add explanation and example to docs
b5b5f8d [freeman] Add better documentation
a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
b93350f [freeman] Streaming KMeans with decay
jkbradley pushed a commit that referenced this pull request Nov 2, 2014
The additions add the abstract BinaryFileInputFormat and BinaryRecordReader classes for reading in data as a byte stream and converting it to another format using the ```def parseByteArray(inArray: Array[Byte]): T``` function.
As a trivial example ```ByteInputFormat``` and ```ByteRecordReader``` are included which just return the Array[Byte] from a given file.
Finally a RDD for ```BinaryFileInputFormat``` (to allow for easier partitioning changes as was done for WholeFileInput) was added and the appropriate byteFiles to the ```SparkContext``` so the functions can be easily used by others.
A common use case might be to read in a folder
```
sc.byteFiles("s3://mydrive/tif/*.tif").map(rawData => ReadTiffFromByteArray(rawData))
```

Author: Kevin Mader <[email protected]>
Author: Kevin Mader <[email protected]>

Closes apache#1658 from kmader/master and squashes the following commits:

3c49a30 [Kevin Mader] fixing wholetextfileinput to it has the same setMinPartitions function as in BinaryData files
359a096 [Kevin Mader] making the final corrections suggested by @mateiz and renaming a few functions to make their usage clearer
6379be4 [Kevin Mader] reorganizing code
7b9d181 [Kevin Mader] removing developer API, cleaning up imports
8ac288b [Kevin Mader] fixed a single slightly over 100 character line
92bda0d [Kevin Mader] added new tests, renamed files, fixed several of the javaapi functions, formatted code more nicely
a32fef7 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
49174d9 [Kevin Mader] removed unneeded classes added DeveloperApi note to portabledatastreams since the implementation might change
c27a8f1 [Kevin Mader] jenkins crashed before running anything last time, so making minor change
b348ce1 [Kevin Mader] fixed order in check (prefix only appears on jenkins not when I run unit tests locally)
0588737 [Kevin Mader] filename check in "binary file input as byte array" test now ignores prefixes and suffixes which might get added by Hadoop
4163e38 [Kevin Mader] fixing line length and output from FSDataInputStream to DataInputStream to minimize sensitivity to Hadoop API changes
19812a8 [Kevin Mader] Fixed the serialization issue with PortableDataStream since neither CombineFileSplit nor TaskAttemptContext implement the Serializable interface, by using ByteArrays for storing both and then recreating the objects from these bytearrays as needed.
238c83c [Kevin Mader] fixed several scala-style issues, changed structure of binaryFiles, removed excessive classes added new tests. The caching tests still have a serialization issue, but that should be easily fixed as well.
932a206 [Kevin Mader] Update RawFileInput.scala
a01c9cf [Kevin Mader] Update RawFileInput.scala
441f79a [Kevin Mader] fixed a few small comments and dependency
12e7be1 [Kevin Mader] removing imglib from maven (definitely not ready yet)
5deb79e [Kevin Mader] added new portabledatastream to code so that it can be serialized correctly
f032bc0 [Kevin Mader] fixed bug in path name, renamed tests
bc5c0b9 [Kevin Mader] made minor stylistic adjustments from mateiz
df8e528 [Kevin Mader] fixed line lengths and changed java test
9a313d5 [Kevin Mader] making classes that needn't be public private, adding automatic file closure, adding new tests
edf5829 [Kevin Mader] fixing line lengths, adding new lines
f4841dc [Kevin Mader] un-optimizing imports, silly intellij
eacfaa6 [Kevin Mader] Added FixedLengthBinaryInputFormat and RecordReader from freeman-lab and added them to both the JavaSparkContext and the SparkContext as fixedLengthBinaryFile
1622935 [Kevin Mader] changing the line lengths to make jenkins happy
1cfa38a [Kevin Mader] added apache headers, added datainputstream directly as an output option for more complicated readers (HDF5 perhaps), and renamed several of the functions and files to be more consistent. Also added parallel functions to the java api
84035f1 [Kevin Mader] adding binary and byte file support spark
81c5f12 [Kevin Mader] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Nov 2, 2014
Add `metadata: Metadata` to `StructField` to store extra information of columns. `Metadata` is a simple wrapper over `Map[String, Any]` with value types restricted to Boolean, Long, Double, String, Metadata, and arrays of those types. SerDe is via JSON.

Metadata is preserved through simple operations like `SELECT`.

marmbrus liancheng

Author: Xiangrui Meng <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes apache#2701 from mengxr/structfield-metadata and squashes the following commits:

dedda56 [Xiangrui Meng] merge remote
5ef930a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
c35203f [Xiangrui Meng] Merge pull request #1 from marmbrus/pr/2701
886b85c [Michael Armbrust] Expose Metadata and MetadataBuilder through the public scala and java packages.
589f314 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
1e2abcf [Xiangrui Meng] change default value of metadata to None in python
611d3c2 [Xiangrui Meng] move metadata from Expr to NamedExpr
ddfcfad [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
a438440 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
4266f4d [Xiangrui Meng] add StructField.toString back for backward compatibility
3f49aab [Xiangrui Meng] remove StructField.toString
24a9f80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
473a7c5 [Xiangrui Meng] merge master
c9d7301 [Xiangrui Meng] organize imports
1fcbf13 [Xiangrui Meng] change metadata type in StructField for Scala/Java
60cc131 [Xiangrui Meng] add doc and header
60614c7 [Xiangrui Meng] add metadata
e42c452 [Xiangrui Meng] merge master
93518fb [Xiangrui Meng] support metadata in python
905bb89 [Xiangrui Meng] java conversions
618e349 [Xiangrui Meng] make tests work in scala
61b8e0f [Xiangrui Meng] merge master
7e5a322 [Xiangrui Meng] do not output metadata in StructField.toString
c41a664 [Xiangrui Meng] merge master
d8af0ed [Xiangrui Meng] move tests to SQLQuerySuite
67fdebb [Xiangrui Meng] add test on join
d65072e [Xiangrui Meng] remove Map.empty
367d237 [Xiangrui Meng] add test
c194d5e [Xiangrui Meng] add metadata field to StructField and Attribute
jkbradley pushed a commit that referenced this pull request Nov 18, 2014
This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.

To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :

taiwan 0.8077646146334014
korea 0.740913304563621
japan 0.7240667798885471
republic 0.7107151279078352
thailand 0.6953217332072862
tibet 0.6916782118129544
mongolia 0.6800858715972612
macau 0.6794925677480378
singapore 0.6594048695593799
manchuria 0.658989931844148
laos 0.6512978726001666
nepal 0.6380792327845325
mainland 0.6365469459587788
myanmar 0.6358614338840394
macedonia 0.6322366180313249
xinjiang 0.6285291551708028
russia 0.6279951236068411
india 0.6272874944023487
shanghai 0.6234544135576999
macao 0.6220588462925876

The result with 10 partitions and 5 iterations is:
taiwan 0.8310495079388313
india 0.7737171315919039
japan 0.756777901233668
korea 0.7429767187102452
indonesia 0.7407557427278356
pakistan 0.712883426985585
mainland 0.7053379963140822
thailand 0.696298191073948
mongolia 0.693690656871415
laos 0.6913069680735292
macau 0.6903427690029617
republic 0.6766381604813666
malaysia 0.676460699141784
singapore 0.6728790997360923
malaya 0.672345232966194
manchuria 0.6703732292753156
macedonia 0.6637955686322028
myanmar 0.6589462882439646
kazakhstan 0.657017801081494
cambodia 0.6542383836451932

Author: Liquan Pei <[email protected]>
Author: Xiangrui Meng <[email protected]>
Author: Liquan Pei <[email protected]>

Closes apache#1719 from Ishiihara/master and squashes the following commits:

2ba9483 [Liquan Pei] minor fix for Word2Vec test
e248441 [Liquan Pei] minor style change
26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
c14da41 [Xiangrui Meng] fix styles
384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
e93e726 [Liquan Pei] use treeAggregate instead of aggregate
1a8fb41 [Liquan Pei] use weighted sum in combOp
7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
6bcc8be [Liquan Pei] add multiple iteration support
720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
2e92b59 [Liquan Pei] modify according to feedback
57dc50d [Liquan Pei] code formatting
e4a04d3 [Liquan Pei] minor fix
0aafb1b [Liquan Pei] Add comments, minor fixes
8d6befe [Liquan Pei] initial commit

(cherry picked from commit e053c55)
Signed-off-by: Xiangrui Meng <[email protected]>
jkbradley pushed a commit that referenced this pull request Nov 18, 2014
As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.

Notes:
* We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
* I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
* I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
* Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.

Author: Nicholas Chammas <[email protected]>
Author: nchammas <[email protected]>

Closes apache#1744 from nchammas/master and squashes the following commits:

274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
983d963 [nchammas] Merge pull request #5 from apache/master
1db5314 [nchammas] Merge pull request #4 from apache/master
0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
6db9a44 [nchammas] Merge pull request #3 from apache/master
7b4750e [Nicholas Chammas] merge upstream changes
91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
9da347f [nchammas] Merge pull request #2 from apache/master
aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
fe57ed0 [Nicholas Chammas] removing merge conflict backups
9c01d4c [nchammas] Merge pull request #1 from apache/master
9a66cb0 [Nicholas Chammas] resolving merge conflicts
a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
723ed39 [Nicholas Chammas] always delete the report file
0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
75ad552 [Nicholas Chammas] make check output style consistent

(cherry picked from commit d614967)
Signed-off-by: Reynold Xin <[email protected]>
jkbradley pushed a commit that referenced this pull request Nov 18, 2014
See compiled doc at
http://people.apache.org/~rxin/tmp/openstack-swift/_site/storage-openstack-swift.html

This is based on apache#1010. Closes apache#1010.

Author: Reynold Xin <[email protected]>
Author: Gil Vernik <[email protected]>

Closes apache#2298 from rxin/openstack-swift and squashes the following commits:

ff4e394 [Reynold Xin] Two minor comments from Patrick.
279f6de [Reynold Xin] core-sites -> core-site
dfb8fea [Reynold Xin] Updated based on Gil's suggestion.
846f5cb [Reynold Xin] Added a link from overview page.
0447c9f [Reynold Xin] Removed sample code.
e9c3761 [Reynold Xin] Merge pull request apache#1010 from gilv/master
9233fef [Gil Vernik] Fixed typos
6994827 [Gil Vernik] Merge pull request #1 from rxin/openstack
ac0679e [Reynold Xin] Fixed an unclosed tr.
47ce99d [Reynold Xin] Merge branch 'master' into openstack
cca7192 [Gil Vernik] Removed white spases from pom.xml
99f095d [Reynold Xin] Pending openstack changes.
eb22295 [Reynold Xin] Merge pull request apache#1010 from gilv/master
39a9737 [Gil Vernik] Spark integration with Openstack Swift
c977658 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark
2aba763 [Gil Vernik] Fix to docs/openstack-integration.md
9b625b5 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark
eff538d [Gil Vernik] SPARK-938 - Openstack Swift object storage support
ce483d7 [Gil Vernik] SPARK-938 - Openstack Swift object storage support
b6c37ef [Gil Vernik] Openstack Swift support
(cherry picked from commit eddfedd)

Signed-off-by: Patrick Wendell <[email protected]>
jkbradley pushed a commit that referenced this pull request Nov 18, 2014
VertexRDD.apply had a bug where it ignored the merge function for
duplicate vertices and instead used whichever vertex attribute occurred
first. This commit fixes the bug by passing the merge function through
to ShippableVertexPartition.apply, which merges any duplicates using the
merge function and then fills in missing vertices using the specified
default vertex attribute. This commit also adds a unit test for
VertexRDD.apply.

Author: Larry Xiao <[email protected]>
Author: Blie Arkansol <[email protected]>
Author: Ankur Dave <[email protected]>

Closes apache#1903 from larryxiao/2062 and squashes the following commits:

625aa9d [Blie Arkansol] Merge pull request #1 from ankurdave/SPARK-2062
476770b [Ankur Dave] ShippableVertexPartition.initFrom: Don't run mergeFunc on default values
614059f [Larry Xiao] doc update: note about the default null value vertices construction
dfdb3c9 [Larry Xiao] minor fix
1c70366 [Larry Xiao] scalastyle check: wrap line, parameter list indent 4 spaces
e4ca697 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
6a35ea8 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
4fbc29c [Blie Arkansol] undo unnecessary change
efae765 [Larry Xiao] fix mistakes: should be able to call with or without mergeFunc
b2422f9 [Larry Xiao] Merge branch '2062' of github.com:larryxiao/spark into 2062
52dc7f7 [Larry Xiao] pass mergeFunc to VertexPartitionBase, where merge is handled
581e9ee [Larry Xiao] TODO: VertexRDDSuite
20d80a3 [Larry Xiao] [SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc

(cherry picked from commit 3bbbdd8)
Signed-off-by: Ankur Dave <[email protected]>
jkbradley pushed a commit that referenced this pull request Nov 18, 2014
…link to HistoryPage's page #1

There is a Spark logo on the header of HistoryPage.
We can have too many HistoryPages if we run 20+ applications. So I think, it's useful if the logo is as a link to the HistoryPage's page number 1.

Author: Kousuke Saruta <[email protected]>

Closes apache#2690 from sarutak/SPARK-3829 and squashes the following commits:

908c109 [Kousuke Saruta] Removed extra space.
00bfbd7 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3829
dd87480 [Kousuke Saruta] Made header Spark log image as a link to History Server's top page.

(cherry picked from commit b69c9fb)
Signed-off-by: Andrew Or <[email protected]>
jkbradley pushed a commit that referenced this pull request Nov 18, 2014
…ion sometimes

JobProgressPage could not show Fair Scheduler Pools section sometimes.
SparkContext starts webui and then postEnvironmentUpdate. Sometimes JobProgressPage is accessed between webui starting and postEnvironmentUpdate, then the lazy val isFairScheduler will be false. The Fair Scheduler Pools section will not display any more.

Author: yantangzhai <[email protected]>
Author: YanTangZhai <[email protected]>

Closes apache#1966 from YanTangZhai/SPARK-3067 and squashes the following commits:

d4323f8 [yantangzhai] update [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
b6391cc [yantangzhai] revert [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
d2226cd [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
aac7f7b [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
cdef539 [YanTangZhai] Merge pull request #1 from apache/master

(cherry picked from commit dedace8)
Signed-off-by: Andrew Or <[email protected]>
jkbradley pushed a commit that referenced this pull request Nov 21, 2014
Just submit another solution for apache#395

Author: Daoyuan <[email protected]>
Author: Michael Armbrust <[email protected]>
Author: Daoyuan Wang <[email protected]>

Closes apache#837 from adrian-wang/left-semi-join-support and squashes the following commits:

d39cd12 [Daoyuan Wang] Merge pull request #1 from marmbrus/pr/837
6713c09 [Michael Armbrust] Better debugging for failed query tests.
035b73e [Michael Armbrust] Add test for left semi that can't be done with a hash join.
5ec6fa4 [Michael Armbrust] Add left semi to SQL Parser.
4c726e5 [Daoyuan] improvement according to Michael
8d4a121 [Daoyuan] add golden files for leftsemijoin
83a3c8a [Daoyuan] scala style fix
14cff80 [Daoyuan] add support for left semi join

(cherry picked from commit 0cf6002)
Signed-off-by: Michael Armbrust <[email protected]>
jkbradley pushed a commit that referenced this pull request Nov 21, 2014
Thanks goes to @marmbrus for his implementation.

Author: Michael Armbrust <[email protected]>
Author: Zongheng Yang <[email protected]>

Closes apache#1074 from concretevitamin/option-treenode and squashes the following commits:

ef27b85 [Zongheng Yang] Merge pull request #1 from marmbrus/pr/1074
73133c2 [Michael Armbrust] TreeNodes can't be inner classes.
ab78420 [Zongheng Yang] Add a test.
2ccb721 [Michael Armbrust] Add support for transformation of optional children.

(cherry picked from commit 269fc62)
Signed-off-by: Michael Armbrust <[email protected]>
jkbradley pushed a commit that referenced this pull request Nov 22, 2014
In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.

numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.

It also complicate the code a lot, so we may should remove numpy from RDDSampler.

I also did some benchmark to verify that:
```
>>> from pyspark.mllib.random import RandomRDDs
>>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
>>> rdd.count()  # cache it
>>> rdd.sample(True, 0.9).count()    # measure this line
```
the results:

|withReplacement      |  random  | numpy.random |
 ------- | ------------ |  -------
|True | 1.5 s|  1.4 s|
|False|  0.6 s | 0.8 s|

closes apache#2313

Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.

Author: Davies Liu <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#3351 from davies/numpy and squashes the following commits:

5c438d7 [Davies Liu] fix comment
c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
98eb31b [Xiangrui Meng] make poisson sampling slightly faster
ee17d78 [Davies Liu] remove = for float
13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
f583023 [Davies Liu] fix tests
51649f5 [Davies Liu] remove numpy in RDDSampler
78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()

(cherry picked from commit d39f2e9)
Signed-off-by: Xiangrui Meng <[email protected]>
jkbradley pushed a commit that referenced this pull request Nov 22, 2014
In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.

numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.

It also complicate the code a lot, so we may should remove numpy from RDDSampler.

I also did some benchmark to verify that:
```
>>> from pyspark.mllib.random import RandomRDDs
>>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
>>> rdd.count()  # cache it
>>> rdd.sample(True, 0.9).count()    # measure this line
```
the results:

|withReplacement      |  random  | numpy.random |
 ------- | ------------ |  -------
|True | 1.5 s|  1.4 s|
|False|  0.6 s | 0.8 s|

closes apache#2313

Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.

Author: Davies Liu <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#3351 from davies/numpy and squashes the following commits:

5c438d7 [Davies Liu] fix comment
c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
98eb31b [Xiangrui Meng] make poisson sampling slightly faster
ee17d78 [Davies Liu] remove = for float
13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
f583023 [Davies Liu] fix tests
51649f5 [Davies Liu] remove numpy in RDDSampler
78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()
jkbradley pushed a commit that referenced this pull request Dec 3, 2014
…if sql has null

val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql("select null from spark_test.for_test")
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$)
at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)

Author: YanTangZhai <[email protected]>
Author: yantangzhai <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes apache#3538 from YanTangZhai/MatchNullType and squashes the following commits:

e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Dec 3, 2014
…the lineage

The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

Iterative GraphX applications always have long lineage, while checkpoint() on EdgeRDD and VertexRDD themselves cannot shorten the lineage. In contrast, if we perform checkpoint() on their ParitionsRDD, the long lineage can be cut off. Moreover, the existing operations such as cache() in this code is performed on the PartitionsRDD, so checkpoint() should do the same way. More details and explanation can be found in the JIRA.

Author: JerryLead <[email protected]>
Author: Lijie Xu <[email protected]>

Closes apache#3549 from JerryLead/my_graphX_checkpoint and squashes the following commits:

d1aa8d8 [JerryLead] Perform checkpoint() on PartitionsRDD not VertexRDD and EdgeRDD themselves
ff08ed4 [JerryLead] Merge branch 'master' of https://github.com/apache/spark
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Dec 3, 2014
…erflow error

The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA.

Author: JerryLead <[email protected]>
Author: Lijie Xu <[email protected]>

Closes apache#3544 from JerryLead/my_graphX and squashes the following commits:

628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Dec 3, 2014
…ation chain

The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

The f closure of `PartitionsRDD(ZippedPartitionsRDD2)` contains a `$outer` that references EdgeRDD/VertexRDD, which causes task's serialization chain become very long in iterative GraphX applications. As a result, StackOverflow error will occur. If we set "f = null" in `clearDependencies()`, checkpoint() can cut off the long serialization chain. More details and explanation can be found in the JIRA.

Author: JerryLead <[email protected]>
Author: Lijie Xu <[email protected]>

Closes apache#3545 from JerryLead/my_core and squashes the following commits:

f7faea5 [JerryLead] checkpoint() should clear the f to avoid StackOverflow error
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Dec 5, 2014
…if sql has null

val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql("select null from spark_test.for_test")
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$)
at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)

Author: YanTangZhai <[email protected]>
Author: yantangzhai <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes apache#3538 from YanTangZhai/MatchNullType and squashes the following commits:

e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master

(cherry picked from commit 1066427)
Signed-off-by: Michael Armbrust <[email protected]>
jkbradley pushed a commit that referenced this pull request Dec 5, 2014
…the lineage

The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

Iterative GraphX applications always have long lineage, while checkpoint() on EdgeRDD and VertexRDD themselves cannot shorten the lineage. In contrast, if we perform checkpoint() on their ParitionsRDD, the long lineage can be cut off. Moreover, the existing operations such as cache() in this code is performed on the PartitionsRDD, so checkpoint() should do the same way. More details and explanation can be found in the JIRA.

Author: JerryLead <[email protected]>
Author: Lijie Xu <[email protected]>

Closes apache#3549 from JerryLead/my_graphX_checkpoint and squashes the following commits:

d1aa8d8 [JerryLead] Perform checkpoint() on PartitionsRDD not VertexRDD and EdgeRDD themselves
ff08ed4 [JerryLead] Merge branch 'master' of https://github.com/apache/spark
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master

(cherry picked from commit fc0a147)
Signed-off-by: Ankur Dave <[email protected]>
jkbradley pushed a commit that referenced this pull request Dec 5, 2014
…erflow error

The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA.

Author: JerryLead <[email protected]>
Author: Lijie Xu <[email protected]>

Closes apache#3544 from JerryLead/my_graphX and squashes the following commits:

628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master

(cherry picked from commit 17c162f)
Signed-off-by: Ankur Dave <[email protected]>
jkbradley pushed a commit that referenced this pull request Dec 5, 2014
…ation chain

The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

The f closure of `PartitionsRDD(ZippedPartitionsRDD2)` contains a `$outer` that references EdgeRDD/VertexRDD, which causes task's serialization chain become very long in iterative GraphX applications. As a result, StackOverflow error will occur. If we set "f = null" in `clearDependencies()`, checkpoint() can cut off the long serialization chain. More details and explanation can be found in the JIRA.

Author: JerryLead <[email protected]>
Author: Lijie Xu <[email protected]>

Closes apache#3545 from JerryLead/my_core and squashes the following commits:

f7faea5 [JerryLead] checkpoint() should clear the f to avoid StackOverflow error
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master

(cherry picked from commit 77be8b9)
Signed-off-by: Ankur Dave <[email protected]>
jkbradley pushed a commit that referenced this pull request Dec 18, 2014
Added predict() to return the cluster labels and membership values
jkbradley pushed a commit that referenced this pull request Dec 30, 2014
…ins an empty AttributeSet() references

The sql "select * from spark_test::for_test where abs(20141202) is not null" has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and
partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception "java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables." is thrown.
The sql "select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).

Author: YanTangZhai <[email protected]>
Author: yantangzhai <[email protected]>

Closes apache#3556 from YanTangZhai/SPARK-4693 and squashes the following commits:

620ebe3 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
37cfdf5 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
70a3544 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
efa9b03 [YanTangZhai] Update HiveQuerySuite.scala
72accf1 [YanTangZhai] Update HiveQuerySuite.scala
e572b9a [YanTangZhai] Update HiveStrategies.scala
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Dec 30, 2014
…askTracker to reduce the chance of the communicating problem

Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem

Author: YanTangZhai <[email protected]>
Author: yantangzhai <[email protected]>

Closes apache#3785 from YanTangZhai/SPARK-4946 and squashes the following commits:

9ca6541 [yantangzhai] [SPARK-4946] [CORE] Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem
e4c2c0a [YanTangZhai] Merge pull request #15 from apache/master
718afeb [YanTangZhai] Merge pull request #12 from apache/master
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Dec 30, 2014
Addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486).
The proposed methods for addition are:

For `Matrix`
 - map: maps the values in the matrix with a given function. Produces a new matrix.
 - update: the values in the matrix are updated with a given function. Occurs in place.

Factory methods for `DenseMatrix`:
 - *zeros: Generate a matrix consisting of zeros
 - *ones: Generate a matrix consisting of ones
 - *eye: Generate an identity matrix
 - *rand: Generate a matrix consisting of i.i.d. uniform random numbers
 - *randn: Generate a matrix consisting of i.i.d. gaussian random numbers
 - *diag: Generate a diagonal matrix from a supplied vector
*These methods already exist in the factory methods for `Matrices`, however for cases where we require a `DenseMatrix`, you constantly have to add `.asInstanceOf[DenseMatrix]` everywhere, which makes the code "dirtier". I propose moving these functions to factory methods for `DenseMatrix` where the putput will be a `DenseMatrix` and the factory methods for `Matrices` will call these functions directly and output a generic `Matrix`.

Factory methods for `SparseMatrix`:
 - speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar.
 - sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers.
 - sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers.
 - diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training.

Factory methods for `Matrices`:
 - Include all the factory methods given above, but return a generic `Matrix` rather than `SparseMatrix` or `DenseMatrix`.
 - horzCat: Horizontally concatenate matrices to form one larger matrix. Very useful in both Multi Model Training, and for the repartitioning of BlockMatrix.
 - vertCat: Vertically concatenate matrices to form one larger matrix. Very useful for the repartitioning of BlockMatrix.

The names for these methods were selected from MATLAB

Author: Burak Yavuz <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#3319 from brkyvz/SPARK-4409 and squashes the following commits:

b0354f6 [Burak Yavuz] [SPARK-4409] Incorporated mengxr's code
04c4829 [Burak Yavuz] Merge pull request #1 from mengxr/SPARK-4409
80cfa29 [Xiangrui Meng] minor changes
ecc937a [Xiangrui Meng] update sprand
4e95e24 [Xiangrui Meng] simplify fromCOO implementation
10a63a6 [Burak Yavuz] [SPARK-4409] Fourth pass of code review
f62d6c7 [Burak Yavuz] [SPARK-4409] Modified genRandMatrix
3971c93 [Burak Yavuz] [SPARK-4409] Third pass of code review
75239f8 [Burak Yavuz] [SPARK-4409] Second pass of code review
e4bd0c0 [Burak Yavuz] [SPARK-4409] Modified horzcat and vertcat
65c562e [Burak Yavuz] [SPARK-4409] Hopefully fixed Java Test
d8be7bc [Burak Yavuz] [SPARK-4409] Organized imports
065b531 [Burak Yavuz] [SPARK-4409] First pass after code review
a8120d2 [Burak Yavuz] [SPARK-4409] Finished updates to API according to SPARK-4614
f798c82 [Burak Yavuz] [SPARK-4409] Updated API according to SPARK-4614
c75f3cd [Burak Yavuz] [SPARK-4409] Added JavaAPI Tests, and fixed a couple of bugs
d662f9d [Burak Yavuz] [SPARK-4409] Modified according to remote repo
83dfe37 [Burak Yavuz] [SPARK-4409] Scalastyle error fixed
a14c0da [Burak Yavuz] [SPARK-4409] Initial commit to add methods
jkbradley pushed a commit that referenced this pull request May 12, 2015
…cala and pySpark

Author: Joshi <[email protected]>
Author: Rekha Joshi <[email protected]>

Closes apache#5989 from rekhajoshm/fix/SPARK-7435 and squashes the following commits:

cfc9e02 [Joshi] Spark-7435[R]: updated patch for review comments
62becc1 [Joshi] SPARK-7435: Update to DataFrame
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request May 13, 2015
…at has space in its path

escape spaces in the arguments.

Author: Masayoshi TSUZUKI <[email protected]>
Author: Kousuke Saruta <[email protected]>

Closes apache#5447 from tsudukim/feature/SPARK-6568-2 and squashes the following commits:

3f9a188 [Masayoshi TSUZUKI] modified some errors.
ed46047 [Masayoshi TSUZUKI] avoid scalastyle errors.
1784239 [Masayoshi TSUZUKI] removed Utils.formatPath.
e03f289 [Masayoshi TSUZUKI] removed testWindows from Utils.resolveURI and Utils.resolveURIs. replaced SystemUtils.IS_OS_WINDOWS to Utils.isWindows. removed Utils.formatPath from PythonRunner.scala.
84c33d0 [Masayoshi TSUZUKI] - use resolveURI in nonLocalPaths - run tests for Windows path only on Windows
016128d [Masayoshi TSUZUKI] fixed to use File.toURI()
2c62e3b [Masayoshi TSUZUKI] Merge pull request #1 from sarutak/SPARK-6568-2
7019a8a [Masayoshi TSUZUKI] Merge branch 'master' of https://github.com/apache/spark into feature/SPARK-6568-2
45946ee [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6568-2
10f1c73 [Kousuke Saruta] Added a comment
93c3c40 [Kousuke Saruta] Merge branch 'classpath-handling-fix' of github.com:sarutak/spark into SPARK-6568-2
649da82 [Kousuke Saruta] Fix classpath handling
c7ba6a7 [Masayoshi TSUZUKI] [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path
jkbradley pushed a commit that referenced this pull request Jun 16, 2015
…into a single batch.

SQL
```
select * from tableA join tableB on (a > 3 and b = d) or (a > 3 and b = e)
```
Plan before modify
```
== Optimized Logical Plan ==
Project [a#293,b#294,c#295,d#296,e#297]
 Join Inner, Some(((a#293 > 3) && ((b#294 = d#296) || (b#294 = e#297))))
  MetastoreRelation default, tablea, None
  MetastoreRelation default, tableb, None
```
Plan after modify
```
== Optimized Logical Plan ==
Project [a#293,b#294,c#295,d#296,e#297]
 Join Inner, Some(((b#294 = d#296) || (b#294 = e#297)))
  Filter (a#293 > 3)
   MetastoreRelation default, tablea, None
  MetastoreRelation default, tableb, None
```

CombineLimits ==> Limit(If(LessThan(ne, le), ne, le), grandChild) and LessThan is in BooleanSimplification ,  so CombineLimits  must before BooleanSimplification and BooleanSimplification must before PushPredicateThroughJoin.

Author: Zhongshuai Pei <[email protected]>
Author: DoingDone9 <[email protected]>

Closes apache#6351 from DoingDone9/master and squashes the following commits:

20de7be [Zhongshuai Pei] Update Optimizer.scala
7bc7d28 [Zhongshuai Pei] Merge pull request #17 from apache/master
0ba5f42 [Zhongshuai Pei] Update Optimizer.scala
f8b9314 [Zhongshuai Pei] Update FilterPushdownSuite.scala
c529d9f [Zhongshuai Pei] Update FilterPushdownSuite.scala
ae3af6d [Zhongshuai Pei] Update FilterPushdownSuite.scala
a04ffae [Zhongshuai Pei] Update Optimizer.scala
11beb61 [Zhongshuai Pei] Update FilterPushdownSuite.scala
f2ee5fe [Zhongshuai Pei] Update Optimizer.scala
be6b1d5 [Zhongshuai Pei] Update Optimizer.scala
b01e622 [Zhongshuai Pei] Merge pull request #15 from apache/master
8df716a [Zhongshuai Pei] Update FilterPushdownSuite.scala
d98bc35 [Zhongshuai Pei] Update FilterPushdownSuite.scala
fa65718 [Zhongshuai Pei] Update Optimizer.scala
ab8e9a6 [Zhongshuai Pei] Merge pull request #14 from apache/master
14952e2 [Zhongshuai Pei] Merge pull request #13 from apache/master
f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
802261c [DoingDone9] Merge pull request #7 from apache/master
d00303b [DoingDone9] Merge pull request #6 from apache/master
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Jun 24, 2015
…columns are found

This PR improves the error message shown when conflicting partition column names are detected.  This can be particularly annoying and confusing when there are a large number of partitions while a handful of them happened to contain unexpected temporary file(s).  Now all suspicious directories are listed as below:

```
java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

        Partition column name list #0: b, c, d
        Partition column name list #1: b, c
        Partition column name list #2: b

For partitioned table directories, data files should only live in leaf directories. Please check the following directories for unexpected files:

        file:/tmp/foo/b=0
        file:/tmp/foo/b=1
        file:/tmp/foo/b=1/c=1
        file:/tmp/foo/b=0/c=0
```

Author: Cheng Lian <[email protected]>

Closes apache#6610 from liancheng/part-errmsg and squashes the following commits:

7d05f2c [Cheng Lian] Fixes Scala style issue
a149250 [Cheng Lian] Adds test case for the error message
6b74dd8 [Cheng Lian] Also lists suspicious non-leaf partition directories
a935eb8 [Cheng Lian] Improves error message when conflicting partition columns are found
jkbradley pushed a commit that referenced this pull request Jun 30, 2015
… to ShuffleReader

This commit updates the shuffle read path to enable ShuffleReader implementations more control over the deserialization process.

The BlockStoreShuffleFetcher.fetch() method has been renamed to BlockStoreShuffleFetcher.fetchBlockStreams(). Previously, this method returned a record iterator; now, it returns an iterator of (BlockId, InputStream). Deserialization of records is now handled in the ShuffleReader.read() method.

This change creates a cleaner separation of concerns and allows implementations of ShuffleReader more flexibility in how records are retrieved.

Author: Matt Massie <[email protected]>
Author: Kay Ousterhout <[email protected]>

Closes apache#6423 from massie/shuffle-api-cleanup and squashes the following commits:

8b0632c [Matt Massie] Minor Scala style fixes
d0a1b39 [Matt Massie] Merge pull request #1 from kayousterhout/massie_shuffle-api-cleanup
290f1eb [Kay Ousterhout] Added test for HashShuffleReader.read()
5186da0 [Kay Ousterhout] Revert "Add test to ensure HashShuffleReader is freeing resources"
f98a1b9 [Matt Massie] Add test to ensure HashShuffleReader is freeing resources
a011bfa [Matt Massie] Use PrivateMethodTester on check that delegate stream is closed
4ea1712 [Matt Massie] Small code cleanup for readability
7429a98 [Matt Massie] Update tests to check that BufferReleasingStream is closing delegate InputStream
f458489 [Matt Massie] Remove unnecessary map() on return Iterator
4abb855 [Matt Massie] Consolidate metric code. Make it clear why InterrubtibleIterator is needed.
5c30405 [Matt Massie] Return visibility of BlockStoreShuffleFetcher to private[hash]
7eedd1d [Matt Massie] Small Scala import cleanup
28f8085 [Matt Massie] Small import nit
f93841e [Matt Massie] Update shuffle read metrics in ShuffleReader instead of BlockStoreShuffleFetcher.
7e8e0fe [Matt Massie] Minor Scala style fixes
01e8721 [Matt Massie] Explicitly cast iterator in branches for type clarity
7c8f73e [Matt Massie] Close Block InputStream immediately after all records are read
208b7a5 [Matt Massie] Small code style changes
b70c945 [Matt Massie] Make BlockStoreShuffleFetcher visible to shuffle package
19135f2 [Matt Massie] [SPARK-7884] Allow Spark shuffle APIs to be more customizable
jkbradley pushed a commit that referenced this pull request Jun 30, 2015
Fix for incorrect memory in Spark UI as per SPARK-5768

Author: Joshi <[email protected]>
Author: Rekha Joshi <[email protected]>

Closes apache#6972 from rekhajoshm/SPARK-5768 and squashes the following commits:

b678a91 [Joshi] Fix for incorrect memory in Spark UI
2fe53d9 [Joshi] Fix for incorrect memory in Spark UI
eb823b8 [Joshi] SPARK-5768: Fix for incorrect memory in Spark UI
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Jun 30, 2015
… without side effects.

Fix for SparkContext stop behavior - Allow sc.stop() to be called multiple times without side effects.

Author: Joshi <[email protected]>
Author: Rekha Joshi <[email protected]>

Closes apache#6973 from rekhajoshm/SPARK-2645 and squashes the following commits:

277043e [Joshi] Fix for SparkContext stop behavior
446b0a4 [Joshi] Fix for SparkContext stop behavior
2ce5760 [Joshi] Fix for SparkContext stop behavior
c97839a [Joshi] Fix for SparkContext stop behavior
1aff39c [Joshi] Fix for SparkContext stop behavior
12f66b5 [Joshi] Fix for SparkContext stop behavior
72bb484 [Joshi] Fix for SparkContext stop behavior
a5a7d7f [Joshi] Fix for SparkContext stop behavior
9193a0c [Joshi] Fix for SparkContext stop behavior
58dba70 [Joshi] SPARK-2645: Fix for SparkContext stop behavior
380c5b0 [Joshi] SPARK-2645: Fix for SparkContext stop behavior
b566b66 [Joshi] SPARK-2645: Fix for SparkContext stop behavior
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Jul 9, 2015
…nfo if needed

Author: Joshi <[email protected]>
Author: Rekha Joshi <[email protected]>

Closes apache#5992 from rekhajoshm/fix/SPARK-7137 and squashes the following commits:

8c42b57 [Joshi] update checkInputColumn to print more info if needed
33ddd2e [Joshi] update checkInputColumn to print more info if needed
acf3e17 [Joshi] update checkInputColumn to print more info if needed
8993c0e [Joshi] SPARK-7137: Add checkInputColumn back to Params and print more info
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Jul 16, 2015
…mands

This will allow problems with piped commands to be detected.
This will also allow tasks to be retried where errors are rare (such as network problems in piped commands).

Author: Scott Taylor <[email protected]>

Closes apache#6262 from megatron-me-uk/patch-2 and squashes the following commits:

04ae1d5 [Scott Taylor] Remove spurious empty line
98fa101 [Scott Taylor] fix blank line style error
574b564 [Scott Taylor] Merge pull request #2 from megatron-me-uk/patch-4
0c1e762 [Scott Taylor] Update rdd pipe method for checkCode
ab9a2e1 [Scott Taylor] Update rdd pipe tests for checkCode
eb4801c [Scott Taylor] fix fail_condition
b0ac3a4 [Scott Taylor] Merge pull request #1 from megatron-me-uk/megatron-me-uk-patch-1
a307d13 [Scott Taylor] update rdd tests to test pipe modes
34fcdc3 [Scott Taylor] add optional argument 'mode' for rdd.pipe
a0c0161 [Scott Taylor] fix generator issue
8a9ef9c [Scott Taylor] make check_return_code an iterator
0486ae3 [Scott Taylor] style fixes
8ed89a6 [Scott Taylor] Chain generators to prevent potential deadlock
4153b02 [Scott Taylor] fix list.sort returns None
491d3fc [Scott Taylor] Pass a function handle to assertRaises
3344a21 [Scott Taylor] wrap assertRaises with QuietTest
3ab8c7a [Scott Taylor] remove whitespace for style
cc1a73d [Scott Taylor] fix style issues in pipe test
8db4073 [Scott Taylor] Add a test for rdd pipe functions
1b3dc4e [Scott Taylor] fix missing space around operator style
0974f98 [Scott Taylor] add space between words in multiline string
45f4977 [Scott Taylor] fix line too long style error
5745d85 [Scott Taylor] Remove space to fix style
f552d49 [Scott Taylor] Catch non-zero exit from pipe commands
jkbradley pushed a commit that referenced this pull request Jul 16, 2015
Improves the performance of LocalPrefixSpan by implementing optimizations proposed in [SPARK-8997](https://issues.apache.org/jira/browse/SPARK-8997)

Author: Feynman Liang <[email protected]>
Author: Feynman Liang <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#7360 from feynmanliang/SPARK-8997-improve-prefixspan and squashes the following commits:

59db2f5 [Feynman Liang] Merge pull request #1 from mengxr/SPARK-8997
91e4357 [Xiangrui Meng] update LocalPrefixSpan impl
9212256 [Feynman Liang] MengXR code review comments
f055d82 [Feynman Liang] Fix failing scalatest
2e00cba [Feynman Liang] Depth first projections
70b93e3 [Feynman Liang] Performance improvements in LocalPrefixSpan, fix tests
jkbradley pushed a commit that referenced this pull request Jul 20, 2015
This makes sure attempts are listed in the order they were executed, and that the
app's state matches the state of the most current attempt.

Author: Joshi <[email protected]>
Author: Rekha Joshi <[email protected]>

Closes apache#7253 from rekhajoshm/SPARK-8593 and squashes the following commits:

874dd80 [Joshi] History Server: updated order for multiple attempts(logcleaner)
716e0b1 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime)
548c753 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime)
83306a8 [Joshi] History Server: updated order for multiple attempts(descending start time)
b0fc922 [Joshi] History Server: updated order for multiple attempts(updated comment)
cc0fda7 [Joshi] History Server: updated order for multiple attempts(updated test)
304cb0b [Joshi] History Server: updated order for multiple attempts(reverted HistoryPage)
85024e8 [Joshi] History Server: updated order for multiple attempts
a41ac4b [Joshi] History Server: updated order for multiple attempts
ab65fa1 [Joshi] History Server: some attempt completed to work with showIncomplete
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Jul 20, 2015
Implement IntArrayParam in mllib

Author: Rekha Joshi <[email protected]>
Author: Joshi <[email protected]>

Closes apache#7481 from rekhajoshm/SPARK-9118 and squashes the following commits:

d3b1766 [Joshi] Implement IntArrayParam
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
jkbradley pushed a commit that referenced this pull request Jul 20, 2015
Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.

Author: George Dittmar <[email protected]>
Author: George <[email protected]>
Author: dittmarg <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#6112 from GeorgeDittmar/SPARK-7422 and squashes the following commits:

3e0a939 [George Dittmar] Merge pull request #1 from mengxr/SPARK-7422
127dec5 [Xiangrui Meng] update argmax impl
2ea6a55 [George Dittmar] Added MimaExcludes for Vectors.argmax
98058f4 [George Dittmar] Merge branch 'master' of github.com:apache/spark into SPARK-7422
5fd9380 [George Dittmar] fixing style check error
42341fb [George Dittmar] refactoring arg max check to better handle zero values
b22af46 [George Dittmar] Fixing spaces between commas in unit test
f2eba2f [George Dittmar] Cleaning up unit tests to be fewer lines
aa330e3 [George Dittmar] Fixing some last if else spacing issues
ac53c55 [George Dittmar] changing dense vector argmax unit test to be one line call vs 2
d5b5423 [George Dittmar] Fixing code style and updating if logic on when to check for zero values
ee1a85a [George Dittmar] Cleaning up unit tests a bit and modifying a few cases
3ee8711 [George Dittmar] Fixing corner case issue with zeros in the active values of the sparse vector. Updated unit tests
b1f059f [George Dittmar] Added comment before we start arg max calculation. Updated unit tests to cover corner cases
f21dcce [George Dittmar] commit
af17981 [dittmarg] Initial work fixing bug that was made clear in pr
eeda560 [George] Fixing SparseVector argmax function to ignore zero values while doing the calculation.
4526acc [George] Merge branch 'master' of github.com:apache/spark into SPARK-7422
df9538a [George] Added argmax to sparse vector and added unit test
3cffed4 [George] Adding unit tests for argmax functions for Dense and Sparse vectors
04677af [George] initial work on adding argmax to Vector and SparseVector
jkbradley pushed a commit that referenced this pull request Jul 30, 2015
…ected databases

Continuation of work by zhangjiajin

Closes apache#7412

Author: zhangjiajin <[email protected]>
Author: Feynman Liang <[email protected]>
Author: zhang jiajin <[email protected]>

Closes apache#7783 from feynmanliang/SPARK-8998-improve-distributed and squashes the following commits:

a61943d [Feynman Liang] Collect small patterns to local
4ddf479 [Feynman Liang] Parallelize freqItemCounts
ad23aa9 [zhang jiajin] Merge pull request #1 from feynmanliang/SPARK-8998-collectBeforeLocal
87fa021 [Feynman Liang] Improve extend prefix readability
c2caa5c [Feynman Liang] Readability improvements and comments
1235cfc [Feynman Liang] Use Iterable[Array[_]] over Array[Array[_]] for database
da0091b [Feynman Liang] Use lists for prefixes to reuse data
cb2a4fc [Feynman Liang] Inline code for readability
01c9ae9 [Feynman Liang] Add getters
6e149fa [Feynman Liang] Fix splitPrefixSuffixPairs
64271b3 [zhangjiajin] Modified codes according to comments.
d2250b7 [zhangjiajin] remove minPatternsBeforeLocalProcessing, add maxSuffixesBeforeLocalProcessing.
b07e20c [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into CollectEnoughPrefixes
095aa3a [zhangjiajin] Modified the code according to the review comments.
baa2885 [zhangjiajin] Modified the code according to the review comments.
6560c69 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixeSpan
a8fde87 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark
4dd1c8a [zhangjiajin] initialize file before rebase.
078d410 [zhangjiajin] fix a scala style error.
22b0ef4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan.
ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
ba5df34 [zhangjiajin] Fix a Scala style error.
4c60fb3 [zhangjiajin] Fix some Scala style errors.
1dd33ad [zhangjiajin] Modified the code according to the review comments.
89bc368 [zhangjiajin] Fixed a Scala style error.
a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
951fd42 [zhang jiajin] Delete Prefixspan.scala
575995f [zhangjiajin] Modified the code according to the review comments.
91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
jkbradley pushed a commit that referenced this pull request Aug 6, 2015
…" into true or false directly

SQL
```
select key from src where 3 in (4, 5);
```
Before
```
== Optimized Logical Plan ==
Project [key#12]
 Filter 3 INSET (5,4)
  MetastoreRelation default, src, None
```

After
```
== Optimized Logical Plan ==
LocalRelation [key#228], []
```

Author: Zhongshuai Pei <[email protected]>
Author: DoingDone9 <[email protected]>

Closes apache#5972 from DoingDone9/InToFalse and squashes the following commits:

4c722a2 [Zhongshuai Pei] Update predicates.scala
abe2bbb [Zhongshuai Pei] Update Optimizer.scala
fa461a5 [Zhongshuai Pei] Update Optimizer.scala
e34c28a [Zhongshuai Pei] Update predicates.scala
24739bd [Zhongshuai Pei] Update ConstantFoldingSuite.scala
f4dbf50 [Zhongshuai Pei] Update ConstantFoldingSuite.scala
35ceb7a [Zhongshuai Pei] Update Optimizer.scala
36c194e [Zhongshuai Pei] Update Optimizer.scala
2e8f6ca [Zhongshuai Pei] Update Optimizer.scala
14952e2 [Zhongshuai Pei] Merge pull request #13 from apache/master
f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
802261c [DoingDone9] Merge pull request #7 from apache/master
d00303b [DoingDone9] Merge pull request #6 from apache/master
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master

(cherry picked from commit 4b5e1fe)
Signed-off-by: Michael Armbrust <[email protected]>
jkbradley pushed a commit that referenced this pull request Aug 6, 2015
…cala and pySpark

Author: Joshi <[email protected]>
Author: Rekha Joshi <[email protected]>

Closes apache#5989 from rekhajoshm/fix/SPARK-7435 and squashes the following commits:

cfc9e02 [Joshi] Spark-7435[R]: updated patch for review comments
62becc1 [Joshi] SPARK-7435: Update to DataFrame
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master

(cherry picked from commit b94a933)
Signed-off-by: Shivaram Venkataraman <[email protected]>
jkbradley pushed a commit that referenced this pull request Aug 6, 2015
…at has space in its path

escape spaces in the arguments.

Author: Masayoshi TSUZUKI <[email protected]>
Author: Kousuke Saruta <[email protected]>

Closes apache#5447 from tsudukim/feature/SPARK-6568-2 and squashes the following commits:

3f9a188 [Masayoshi TSUZUKI] modified some errors.
ed46047 [Masayoshi TSUZUKI] avoid scalastyle errors.
1784239 [Masayoshi TSUZUKI] removed Utils.formatPath.
e03f289 [Masayoshi TSUZUKI] removed testWindows from Utils.resolveURI and Utils.resolveURIs. replaced SystemUtils.IS_OS_WINDOWS to Utils.isWindows. removed Utils.formatPath from PythonRunner.scala.
84c33d0 [Masayoshi TSUZUKI] - use resolveURI in nonLocalPaths - run tests for Windows path only on Windows
016128d [Masayoshi TSUZUKI] fixed to use File.toURI()
2c62e3b [Masayoshi TSUZUKI] Merge pull request #1 from sarutak/SPARK-6568-2
7019a8a [Masayoshi TSUZUKI] Merge branch 'master' of https://github.com/apache/spark into feature/SPARK-6568-2
45946ee [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6568-2
10f1c73 [Kousuke Saruta] Added a comment
93c3c40 [Kousuke Saruta] Merge branch 'classpath-handling-fix' of github.com:sarutak/spark into SPARK-6568-2
649da82 [Kousuke Saruta] Fix classpath handling
c7ba6a7 [Masayoshi TSUZUKI] [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path

(cherry picked from commit 50c7270)
Signed-off-by: Sean Owen <[email protected]>
jkbradley pushed a commit that referenced this pull request Aug 6, 2015
Fix for incorrect memory in Spark UI as per SPARK-5768

Author: Joshi <[email protected]>
Author: Rekha Joshi <[email protected]>

Closes apache#6972 from rekhajoshm/SPARK-5768 and squashes the following commits:

b678a91 [Joshi] Fix for incorrect memory in Spark UI
2fe53d9 [Joshi] Fix for incorrect memory in Spark UI
eb823b8 [Joshi] SPARK-5768: Fix for incorrect memory in Spark UI
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master

(cherry picked from commit 085a721)
Signed-off-by: Kousuke Saruta <[email protected]>
jkbradley pushed a commit that referenced this pull request Aug 6, 2015
This makes sure attempts are listed in the order they were executed, and that the
app's state matches the state of the most current attempt.

Author: Joshi <[email protected]>
Author: Rekha Joshi <[email protected]>

Closes apache#7253 from rekhajoshm/SPARK-8593 and squashes the following commits:

874dd80 [Joshi] History Server: updated order for multiple attempts(logcleaner)
716e0b1 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime)
548c753 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime)
83306a8 [Joshi] History Server: updated order for multiple attempts(descending start time)
b0fc922 [Joshi] History Server: updated order for multiple attempts(updated comment)
cc0fda7 [Joshi] History Server: updated order for multiple attempts(updated test)
304cb0b [Joshi] History Server: updated order for multiple attempts(reverted HistoryPage)
85024e8 [Joshi] History Server: updated order for multiple attempts
a41ac4b [Joshi] History Server: updated order for multiple attempts
ab65fa1 [Joshi] History Server: some attempt completed to work with showIncomplete
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master

(cherry picked from commit 42d8a01)
Signed-off-by: Sean Owen <[email protected]>
jkbradley pushed a commit that referenced this pull request Aug 11, 2015
This PR is based on apache#4229, thanks prabeesh.

Closes apache#4229

Author: Prabeesh K <[email protected]>
Author: zsxwing <[email protected]>
Author: prabs <[email protected]>
Author: Prabeesh K <[email protected]>

Closes apache#7833 from zsxwing/pr4229 and squashes the following commits:

9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python
jkbradley pushed a commit that referenced this pull request Oct 12, 2015
This PR is based on apache#4229, thanks prabeesh.

Closes apache#4229

Author: Prabeesh K <[email protected]>
Author: zsxwing <[email protected]>
Author: prabs <[email protected]>
Author: Prabeesh K <[email protected]>

Closes apache#7833 from zsxwing/pr4229 and squashes the following commits:

9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python

(cherry picked from commit 853809e)
Signed-off-by: Tathagata Das <[email protected]>
jkbradley pushed a commit that referenced this pull request Dec 12, 2015
jkbradley pushed a commit that referenced this pull request Sep 6, 2016
## What changes were proposed in this pull request?
This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.

This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.

Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).

When there is a mismatch, the error message looks like the following:

```
[info] - blacklist.sql !!! IGNORED !!!
[info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)
[info]   Expected "...147483648	-214748364[8]", but got "...147483648	-214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171)
```

## How was this patch tested?
This is a test infrastructure change.

Author: petermaxlee <[email protected]>

Closes apache#14472 from petermaxlee/SPARK-16866.
jkbradley pushed a commit that referenced this pull request Sep 22, 2016
## What changes were proposed in this pull request?
This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.

This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.

Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).

When there is a mismatch, the error message looks like the following:

```
[info] - blacklist.sql !!! IGNORED !!!
[info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)
[info]   Expected "...147483648	-214748364[8]", but got "...147483648	-214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171)
```

## How was this patch tested?
This is a test infrastructure change.

Author: petermaxlee <[email protected]>

Closes apache#14472 from petermaxlee/SPARK-16866.

(cherry picked from commit b9f8a11)
Signed-off-by: Wenchen Fan <[email protected]>
jkbradley pushed a commit that referenced this pull request Feb 13, 2018
## What changes were proposed in this pull request?

There were two related fixes regarding `from_json`, `get_json_object` and `json_tuple` ([Fix #1](apache@c8803c0),
 [Fix #2](apache@86174ea)), but they weren't comprehensive it seems. I wanted to extend those fixes to all the parsers, and add tests for each case.

## How was this patch tested?

Regression tests

Author: Burak Yavuz <[email protected]>

Closes apache#20302 from brkyvz/json-invfix.
jkbradley pushed a commit that referenced this pull request Mar 1, 2018
## What changes were proposed in this pull request?

Solved two bugs to enable stream-stream self joins.

### Incorrect analysis due to missing MultiInstanceRelation trait
Streaming leaf nodes did not extend MultiInstanceRelation, which is necessary for the catalyst analyzer to convert the self-join logical plan DAG into a tree (by creating new instances of the leaf relations). This was causing the error `Failure when resolving conflicting references in Join:` (see JIRA for details).

### Incorrect attribute rewrite when splicing batch plans in MicroBatchExecution
When splicing the source's batch plan into the streaming plan (by replacing the StreamingExecutionPlan), we were rewriting the attribute reference in the streaming plan with the new attribute references from the batch plan. This was incorrectly handling the scenario when multiple StreamingExecutionRelation point to the same source, and therefore eventually point to the same batch plan returned by the source. Here is an example query, and its corresponding plan transformations.
```
val df = input.toDF
val join =
      df.select('value % 5 as "key", 'value).join(
        df.select('value % 5 as "key", 'value), "key")
```
Streaming logical plan before splicing the batch plan
```
Project [key#6, value#1, value#12]
+- Join Inner, (key#6 = key#9)
   :- Project [(value#1 % 5) AS key#6, value#1]
   :  +- StreamingExecutionRelation Memory[#1], value#1
   +- Project [(value#12 % 5) AS key#9, value#12]
      +- StreamingExecutionRelation Memory[#1], value#12  // two different leaves pointing to same source
```
Batch logical plan after splicing the batch plan and before rewriting
```
Project [key#6, value#1, value#12]
+- Join Inner, (key#6 = key#9)
   :- Project [(value#1 % 5) AS key#6, value#1]
   :  +- LocalRelation [value#66]           // replaces StreamingExecutionRelation Memory[#1], value#1
   +- Project [(value#12 % 5) AS key#9, value#12]
      +- LocalRelation [value#66]           // replaces StreamingExecutionRelation Memory[#1], value#12
```
Batch logical plan after rewriting the attributes. Specifically, for spliced, the new output attributes (value#66) replace the earlier output attributes (value#12, and value#1, one for each StreamingExecutionRelation).
```
Project [key#6, value#66, value#66]       // both value#1 and value#12 replaces by value#66
+- Join Inner, (key#6 = key#9)
   :- Project [(value#66 % 5) AS key#6, value#66]
   :  +- LocalRelation [value#66]
   +- Project [(value#66 % 5) AS key#9, value#66]
      +- LocalRelation [value#66]
```
This causes the optimizer to eliminate value#66 from one side of the join.
```
Project [key#6, value#66, value#66]
+- Join Inner, (key#6 = key#9)
   :- Project [(value#66 % 5) AS key#6, value#66]
   :  +- LocalRelation [value#66]
   +- Project [(value#66 % 5) AS key#9]   // this does not generate value, incorrect join results
      +- LocalRelation [value#66]
```

**Solution**: Instead of rewriting attributes, use a Project to introduce aliases between the output attribute references and the new reference generated by the spliced plans. The analyzer and optimizer will take care of the rest.
```
Project [key#6, value#1, value#12]
+- Join Inner, (key#6 = key#9)
   :- Project [(value#1 % 5) AS key#6, value#1]
   :  +- Project [value#66 AS value#1]   // solution: project with aliases
   :     +- LocalRelation [value#66]
   +- Project [(value#12 % 5) AS key#9, value#12]
      +- Project [value#66 AS value#12]    // solution: project with aliases
         +- LocalRelation [value#66]
```

## How was this patch tested?
New unit test

Author: Tathagata Das <[email protected]>

Closes apache#20598 from tdas/SPARK-23406.
jkbradley pushed a commit that referenced this pull request Apr 11, 2018
## What changes were proposed in this pull request?

There were two related fixes regarding `from_json`, `get_json_object` and `json_tuple` ([Fix #1](apache@c8803c0),
 [Fix #2](apache@86174ea)), but they weren't comprehensive it seems. I wanted to extend those fixes to all the parsers, and add tests for each case.

## How was this patch tested?

Regression tests

Author: Burak Yavuz <[email protected]>

Closes apache#20302 from brkyvz/json-invfix.

(cherry picked from commit e01919e)
Signed-off-by: hyukjinkwon <[email protected]>
jkbradley pushed a commit that referenced this pull request Apr 11, 2018
This is a backport of apache#20598.

## What changes were proposed in this pull request?

Solved two bugs to enable stream-stream self joins.

### Incorrect analysis due to missing MultiInstanceRelation trait
Streaming leaf nodes did not extend MultiInstanceRelation, which is necessary for the catalyst analyzer to convert the self-join logical plan DAG into a tree (by creating new instances of the leaf relations). This was causing the error `Failure when resolving conflicting references in Join:` (see JIRA for details).

### Incorrect attribute rewrite when splicing batch plans in MicroBatchExecution
When splicing the source's batch plan into the streaming plan (by replacing the StreamingExecutionPlan), we were rewriting the attribute reference in the streaming plan with the new attribute references from the batch plan. This was incorrectly handling the scenario when multiple StreamingExecutionRelation point to the same source, and therefore eventually point to the same batch plan returned by the source. Here is an example query, and its corresponding plan transformations.
```
val df = input.toDF
val join =
      df.select('value % 5 as "key", 'value).join(
        df.select('value % 5 as "key", 'value), "key")
```
Streaming logical plan before splicing the batch plan
```
Project [key#6, value#1, value#12]
+- Join Inner, (key#6 = key#9)
   :- Project [(value#1 % 5) AS key#6, value#1]
   :  +- StreamingExecutionRelation Memory[#1], value#1
   +- Project [(value#12 % 5) AS key#9, value#12]
      +- StreamingExecutionRelation Memory[#1], value#12  // two different leaves pointing to same source
```
Batch logical plan after splicing the batch plan and before rewriting
```
Project [key#6, value#1, value#12]
+- Join Inner, (key#6 = key#9)
   :- Project [(value#1 % 5) AS key#6, value#1]
   :  +- LocalRelation [value#66]           // replaces StreamingExecutionRelation Memory[#1], value#1
   +- Project [(value#12 % 5) AS key#9, value#12]
      +- LocalRelation [value#66]           // replaces StreamingExecutionRelation Memory[#1], value#12
```
Batch logical plan after rewriting the attributes. Specifically, for spliced, the new output attributes (value#66) replace the earlier output attributes (value#12, and value#1, one for each StreamingExecutionRelation).
```
Project [key#6, value#66, value#66]       // both value#1 and value#12 replaces by value#66
+- Join Inner, (key#6 = key#9)
   :- Project [(value#66 % 5) AS key#6, value#66]
   :  +- LocalRelation [value#66]
   +- Project [(value#66 % 5) AS key#9, value#66]
      +- LocalRelation [value#66]
```
This causes the optimizer to eliminate value#66 from one side of the join.
```
Project [key#6, value#66, value#66]
+- Join Inner, (key#6 = key#9)
   :- Project [(value#66 % 5) AS key#6, value#66]
   :  +- LocalRelation [value#66]
   +- Project [(value#66 % 5) AS key#9]   // this does not generate value, incorrect join results
      +- LocalRelation [value#66]
```

**Solution**: Instead of rewriting attributes, use a Project to introduce aliases between the output attribute references and the new reference generated by the spliced plans. The analyzer and optimizer will take care of the rest.
```
Project [key#6, value#1, value#12]
+- Join Inner, (key#6 = key#9)
   :- Project [(value#1 % 5) AS key#6, value#1]
   :  +- Project [value#66 AS value#1]   // solution: project with aliases
   :     +- LocalRelation [value#66]
   +- Project [(value#12 % 5) AS key#9, value#12]
      +- Project [value#66 AS value#12]    // solution: project with aliases
         +- LocalRelation [value#66]
```

## How was this patch tested?
New unit test

Author: Tathagata Das <[email protected]>

Closes apache#20765 from tdas/SPARK-23406-2.3.
jkbradley pushed a commit that referenced this pull request Oct 24, 2019
…comparison assertions

## What changes were proposed in this pull request?

This PR removes a few hardware-dependent assertions which can cause a failure in `aarch64`.

**x86_64**
```
rootdonotdel-openlab-allinone-l00242678:/home/ubuntu# uname -a
Linux donotdel-openlab-allinone-l00242678 4.4.0-154-generic apache#181-Ubuntu SMP Tue Jun 25 05:29:03 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux

scala> import java.lang.Float.floatToRawIntBits
import java.lang.Float.floatToRawIntBits
scala> floatToRawIntBits(0.0f/0.0f)
res0: Int = -4194304
scala> floatToRawIntBits(Float.NaN)
res1: Int = 2143289344
```

**aarch64**
```
[rootarm-huangtianhua spark]# uname -a
Linux arm-huangtianhua 4.14.0-49.el7a.aarch64 #1 SMP Tue Apr 10 17:22:26 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux

scala> import java.lang.Float.floatToRawIntBits
import java.lang.Float.floatToRawIntBits
scala> floatToRawIntBits(0.0f/0.0f)
res1: Int = 2143289344
scala> floatToRawIntBits(Float.NaN)
res2: Int = 2143289344
```

## How was this patch tested?

Pass the Jenkins (This removes the test coverage).

Closes apache#25186 from huangtianhua/special-test-case-for-aarch64.

Authored-by: huangtianhua <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
jkbradley pushed a commit that referenced this pull request Jan 2, 2020
### What changes were proposed in this pull request?
`org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details:
```
Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database
```
Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs.

### Why are the changes needed?
Failing test doesn't give enough debug information.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
I've started the test manually and checked that such additional debug messages show up:
```
>>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Looking for keys for: kafka/localhostEXAMPLE.COM
Added key: 17version: 0
Added key: 23version: 0
Added key: 16version: 0
Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Using builtin default etypes for permitted_enctypes
default etypes for permitted_enctypes: 17 16 23.
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM
MemoryCache: Existing AuthList:
#3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM
#2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM
#1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM
```

Closes apache#26252 from gaborgsomogyi/SPARK-29580.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants