-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Branch 1.2 #10485
Closed
Closed
Branch 1.2 #10485
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
spark.locality.wait set to 100000 in examples/graphx/Analytics.scala. Should be left to the user. Author: Ernest <[email protected]> Closes #3730 from Earne/SPARK-4880 and squashes the following commits: d79ed04 [Ernest] remove spark.locality.wait in Analytics (cherry picked from commit a7ed6f3) Signed-off-by: Reynold Xin <[email protected]>
Rewording was based on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-td9804.html This is the associated JIRA ticket: https://issues.apache.org/jira/browse/SPARK-4884 Author: Madhu Siddalingaiah <[email protected]> Closes #3722 from msiddalingaiah/master and squashes the following commits: 79e679f [Madhu Siddalingaiah] [DOC]: improve documentation 51d14b9 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' 38faca4 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again) 332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code> cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' 0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions (cherry picked from commit d5a596d) Signed-off-by: Josh Rosen <[email protected]>
…port config This is used in NioBlockTransferService here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/nio/NioBlockTransferService.scala#L66 Author: Aaron Davidson <[email protected]> Closes #3688 from aarondav/SPARK-4837 and squashes the following commits: ebd2007 [Aaron Davidson] [SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config (cherry picked from commit 105293a) Signed-off-by: Josh Rosen <[email protected]>
This is such that the `ExecutorAllocationManager` does not take in the `SparkContext` with all of its dependencies as an argument. This prevents future developers of this class to tie down this class further with the `SparkContext`, which has really become quite a monstrous object. cc'ing pwendell who originally suggested this, and JoshRosen who may have thoughts about the trait mix-in style of `SparkContext`. Author: Andrew Or <[email protected]> Closes #3614 from andrewor14/dynamic-allocation-sc and squashes the following commits: 187070d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc 59baf6c [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc 347a348 [Andrew Or] Refactor SparkContext into ExecutorAllocationClient (cherry picked from commit 9804a75) Signed-off-by: Andrew Or <[email protected]> Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala
Author: Sandy Ryza <[email protected]> Closes #3684 from sryza/sandy-spark-3428 and squashes the following commits: cb827fe [Sandy Ryza] SPARK-3428. TaskMetrics for running tasks is missing GC time metrics (cherry picked from commit 283263f) Signed-off-by: Josh Rosen <[email protected]>
Author: Ryan Williams <[email protected]> Closes #3736 from ryan-williams/hist and squashes the following commits: 421d8ff [Ryan Williams] add another random typo fix 76d6a4c [Ryan Williams] remove hdfs example a2d0f82 [Ryan Williams] code review feedback 9ca7629 [Ryan Williams] [SPARK-4889] update history server example cmds (cherry picked from commit cdb2c64) Signed-off-by: Andrew Or <[email protected]>
Author: Ryan Williams <[email protected]> Closes #2848 from ryan-williams/fetch-file and squashes the following commits: c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently 8e39c16 [Ryan Williams] code review feedback 788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps (cherry picked from commit 7981f96) Signed-off-by: Josh Rosen <[email protected]>
the signature of registerKryoClasses is actually of Array[Class[_]] not Seq Author: Eran Medan <[email protected]> Closes #3747 from eranation/patch-1 and squashes the following commits: ee9885d [Eran Medan] change signature of example to match released code
…file Since we can set spark executor memory and executor cores using property file, we must also be allowed to set the executor instances. Author: Kanwaljit Singh <[email protected]> Closes #1657 from kjsingh/branch-1.0 and squashes the following commits: d8a5a12 [Kanwaljit Singh] SPARK-2641: Fixing how spark arguments are loaded from properties file for num executors Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
Once the external shuffle service is also documented, the dynamic allocation section will link to it. Let me know if the whole dynamic allocation should be moved to its separate page; I personally think the organization might be cleaner that way. This patch builds on top of oza's work in #3689. aarondav pwendell Author: Andrew Or <[email protected]> Author: Tsuyoshi Ozawa <[email protected]> Closes #3731 from andrewor14/document-dynamic-allocation and squashes the following commits: 1281447 [Andrew Or] Address a few comments b9843f2 [Andrew Or] Document the configs as well 246fb44 [Andrew Or] Merge branch 'SPARK-4839' of github.com:oza/spark into document-dynamic-allocation 8c64004 [Andrew Or] Add documentation for dynamic allocation (without configs) 6827b56 [Tsuyoshi Ozawa] Fixing a documentation of spark.dynamicAllocation.enabled. 53cff58 [Tsuyoshi Ozawa] Adding a documentation about dynamic resource allocation. (cherry picked from commit 15c03e1) Signed-off-by: Andrew Or <[email protected]>
Mvn Build Failed: value defaultProperties not found .Maybe related to this pr: 1d64812 andrewor14 can you look at this problem? Author: huangzhaowei <[email protected]> Closes #3749 from SaintBacchus/Mvn-Build-Fail and squashes the following commits: 8e2917c [huangzhaowei] Build Failed: value defaultProperties not found (cherry picked from commit a764960) Signed-off-by: Josh Rosen <[email protected]>
…oop 1.+ and Hadoop 2.+ `NullWritable` is a `Comparable` rather than `Comparable[NullWritable]` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes. I used the following commands to confirm the generated byte codes are some. ``` mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt diff ~/hadoop1.txt ~/hadoop2.txt ``` However, the compiler will generate different codes for the classes which call methods of `JobContext/TaskAttemptContext`. `JobContext/TaskAttemptContext` is a class in Hadoop 1.+, and calling its method will use `invokevirtual`, while it's an interface in Hadoop 2.+, and will use `invokeinterface`. To fix it, we can use reflection to call `JobContext/TaskAttemptContext.getConfiguration`. Author: zsxwing <[email protected]> Closes #3740 from zsxwing/SPARK-2075 and squashes the following commits: 39d9df2 [zsxwing] Fix the code style e4ad8b5 [zsxwing] Use null for the implicit Ordering 734bac9 [zsxwing] Explicitly set the implicit parameters ca03559 [zsxwing] Use reflection to access JobContext/TaskAttemptContext.getConfiguration fa40db0 [zsxwing] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD (cherry picked from commit 6ee6aa7) Signed-off-by: Reynold Xin <[email protected]>
backport #3740 for branch-1.2 Author: zsxwing <[email protected]> Closes #3758 from zsxwing/SPARK-2075-branch-1.2 and squashes the following commits: b57d440 [zsxwing] SPARK-2075 backport for branch-1.2
… service. Author: Tsuyoshi Ozawa <[email protected]> Closes #3757 from oza/SPARK-4915 and squashes the following commits: 3b0d6d6 [Tsuyoshi Ozawa] Fix classname to be specified for external shuffle service. (cherry picked from commit 96606f6) Signed-off-by: Andrew Or <[email protected]>
Author: zsxwing <[email protected]> Closes #3734 from zsxwing/SPARK-4883 and squashes the following commits: e6f2b61 [zsxwing] Fix the name cc74727 [zsxwing] Add a name to the directoryCleaner thread (cherry picked from commit 8773705) Signed-off-by: Andrew Or <[email protected]>
Using val arr1 = (0 until num).toArray instead of val arr1 = new Array[Int](num) for (i <- 0 until arr1.length) { arr1(i) = i } for short. Author: carlmartin <[email protected]> Closes #3750 from SaintBacchus/BroadcastTest and squashes the following commits: 43adb70 [carlmartin] Improve some code in BroadcastTest for short
Author: Aaron Davidson <[email protected]> Closes #3713 from aarondav/netty-configs and squashes the following commits: 8a8b373 [Aaron Davidson] Address Patrick's comments 3b1f84e [Aaron Davidson] [SPARK-4864] Add documentation to Netty-based configs (cherry picked from commit fbca6b6) Signed-off-by: Patrick Wendell <[email protected]>
It is not convenient to see the Spark version. We can keep the same style with Spark website.  Author: genmao.ygm <[email protected]> Closes #3763 from uncleGen/master-clean-141222 and squashes the following commits: 0dcb9a9 [genmao.ygm] [SPARK-4920][UI]:current spark version in UI is not striking. (cherry picked from commit de9d7d2) Signed-off-by: Andrew Or <[email protected]>
In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as, ```Scala val iterable = Seq(1, 2, 3).map(v => { println(v) v }) println("Iterable map done") val iterator = Seq(1, 2, 3).iterator.map(v => { println(v) v }) println("Iterator map done") ``` outputed ``` 1 2 3 Iterable map done Iterator map done ``` So we should use 'iterator' to reduce memory consumed by join. Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E Author: zsxwing <[email protected]> Closes #3671 from zsxwing/SPARK-4824 and squashes the following commits: 48ee7b9 [zsxwing] Remove the explicit types 95d59d6 [zsxwing] Add 'iterator' to reduce memory consumed by join (cherry picked from commit c233ab3) Signed-off-by: Josh Rosen <[email protected]>
Author: Nicholas Chammas <[email protected]> Closes #3772 from nchammas/patch-1 and squashes the following commits: b7d9083 [Nicholas Chammas] [Docs] Minor typo fixes (cherry picked from commit 0e532cc) Signed-off-by: Patrick Wendell <[email protected]>
Currently, the format about log4j in running-on-yarn.md is a bit messy.  Author: zsxwing <[email protected]> Closes #3774 from zsxwing/SPARK-4931 and squashes the following commits: 4a5f853 [zsxwing] Fix the format of running-on-yarn.md (cherry picked from commit 2d215ae) Signed-off-by: Josh Rosen <[email protected]>
Commit 7aacb7b added support for sharing downloaded files among multiple executors of the same app. That works great in Yarn, since the app's directory is cleaned up after the app is done. But Spark standalone mode didn't do that, so the lock/cache files created by that change were left around and could eventually fill up the disk hosting /tmp. To solve that, create app-specific directories under the local dirs when launching executors. Multiple executors launched by the same Worker will use the same app directories, so they should be able to share the downloaded files. When the application finishes, a new message is sent to all workers telling them the application has finished; once that message has been received, and all executors registered for the application shut down, then those directories will be cleaned up by the Worker. Note: Unit testing this is hard (if even possible), since local-cluster mode doesn't seem to leave the Master/Worker daemons running long enough after `sc.stop()` is called for the clean up protocol to take effect. Author: Marcelo Vanzin <[email protected]> Closes #3705 from vanzin/SPARK-4834 and squashes the following commits: b430534 [Marcelo Vanzin] Remove seemingly unnecessary synchronization. 50eb4b9 [Marcelo Vanzin] Review feedback. c0e5ea5 [Marcelo Vanzin] [SPARK-4834] [standalone] Clean up application files after app finishes. (cherry picked from commit dd15536) Signed-off-by: Josh Rosen <[email protected]>
Trivial modifications for usability. Author: Takeshi Yamamuro <[email protected]> Closes #3775 from maropu/AddHelpCommentInAnalytics and squashes the following commits: fbea8f5 [Takeshi Yamamuro] Add help comments in Analytics (cherry picked from commit 9c251c5) Signed-off-by: Josh Rosen <[email protected]>
This PR tries to fix the Hive tests failure encountered in PR #3157 by cleaning `lib_managed` before building assembly jar against Hive 0.13.1 in `dev/run-tests`. Otherwise two sets of datanucleus jars would be left in `lib_managed` and may mess up class paths while executing Hive test suites. Please refer to [this thread] [1] for details. A clean build would be even safer, but we only clean `lib_managed` here to save build time. This PR also takes the chance to clean up some minor typos and formatting issues in the comments. [1]: #3157 (comment) <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3756) <!-- Reviewable:end --> Author: Cheng Lian <[email protected]> Closes #3756 from liancheng/clean-lib-managed and squashes the following commits: e2bd21d [Cheng Lian] Adds lib_managed to clean set c9f2f3e [Cheng Lian] Cleans lib_managed before compiling with Hive 0.13.1 (cherry picked from commit 395b771) Signed-off-by: Josh Rosen <[email protected]>
See https://issues.apache.org/jira/browse/SPARK-4730. Author: Andrew Or <[email protected]> Closes #3590 from andrewor14/yarn-settings and squashes the following commits: 36e0753 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-settings dcd1316 [Andrew Or] Warn against deprecated YARN settings (cherry picked from commit 27c5399) Signed-off-by: Josh Rosen <[email protected]>
…stered Once the streaming receiver is de-registered at executor, the `ReceiverTrackerActor` needs to remove the corresponding reveiverInfo from the `receiverInfo` map at `ReceiverTracker`. Author: Ilayaperumal Gopinathan <[email protected]> Closes #3647 from ilayaperumalg/receiverInfo-RTracker and squashes the following commits: 6eb97d5 [Ilayaperumal Gopinathan] Polishing based on the review 3640c86 [Ilayaperumal Gopinathan] Remove receiverInfo once receiver is de-registered (cherry picked from commit 10d69e9) Signed-off-by: Tathagata Das <[email protected]>
…nabled Currently streaming block will be replicated when specific storage level is set, since WAL is already fault tolerant, so replication is needless and will hurt the throughput of streaming application. Hi tdas , as per discussed about this issue, I fixed with this implementation, I'm not is this the way you want, would you mind taking a look at it? Thanks a lot. Author: jerryshao <[email protected]> Closes #3534 from jerryshao/SPARK-4671 and squashes the following commits: 500b456 [jerryshao] Do not replicate streaming block when WAL is enabled (cherry picked from commit 3f5f4cc) Signed-off-by: Tathagata Das <[email protected]>
Author: Marcelo Vanzin <[email protected]> Closes #3460 from vanzin/SPARK-4606 and squashes the following commits: 031207d [Marcelo Vanzin] [SPARK-4606] Send EOF to child JVM when there's no more data to read. (cherry picked from commit 7e2deb7) Signed-off-by: Josh Rosen <[email protected]>
…for-loop) in WriteAheadLogBasedBlockHandler Use `Future.zip` instead of `Future.flatMap`(for-loop). `zip` implies these two Futures will run concurrently, while `flatMap` usually means one Future depends on the other one. Author: zsxwing <[email protected]> Closes #3721 from zsxwing/SPARK-4873 and squashes the following commits: 46a2cd9 [zsxwing] Use Future.zip instead of Future.flatMap(for-loop) (cherry picked from commit b4d0db8) Signed-off-by: Tathagata Das <[email protected]>
Corrected link to the Building Spark with Maven page from its original (http://spark.apache.org/docs/latest/building-with-maven.html) to the current page (http://spark.apache.org/docs/latest/building-spark.html) Author: Denny Lee <[email protected]> Closes #3802 from dennyglee/patch-1 and squashes the following commits: 15f601a [Denny Lee] Update README.md (cherry picked from commit 08b18c7) Signed-off-by: Josh Rosen <[email protected]>
Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD. This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy. cc JoshRosen Author: Davies Liu <[email protected]> Closes #5496 from davies/big_closure and squashes the following commits: 9a0ea4c [Davies Liu] fix big closure with shuffle Conflicts: python/pyspark/rdd.py
sbin/spark-daemon.sh used ps -p "$TARGET_PID" -o args= to figure out whether the process running with the expected PID is actually a Spark daemon. When running with a large classpath, the output of ps gets truncated and the check fails spuriously. This weakens the check to see if it's a java command (which is something we do in other parts of the script) rather than looking for the specific main class name. This means that SPARK-4832 might happen under a slightly broader range of circumstances (a java program happened to reuse the same PID), but it seems worthwhile compared to failing consistently with a large classpath. Author: Punya Biswal <[email protected]> Closes #5535 from punya/feature/SPARK-6952 and squashes the following commits: 7ea12d1 [Punya Biswal] Handle long args when detecting PID reuse
If `StreamingKMeans` is not `Serializable`, we cannot do checkpoint for applications that using `StreamingKMeans`. So we should make it `Serializable`. Author: zsxwing <[email protected]> Closes #5582 from zsxwing/SPARK-6998 and squashes the following commits: 67c2a14 [zsxwing] Make StreamingKMeans 'Serializable' (cherry picked from commit fa73da0) Signed-off-by: Reynold Xin <[email protected]>
…on on. Author: Marcelo Vanzin <[email protected]> Closes #5751 from vanzin/cached-rdd-warning and squashes the following commits: 554cc07 [Marcelo Vanzin] Change message. 9efb9da [Marcelo Vanzin] [minor] [core] Warn users who try to cache RDDs with dynamic allocation on.
…allocation on." This reverts commit d208047.
…regation see [SPARK-7181](https://issues.apache.org/jira/browse/SPARK-7181). Author: Qiping Li <[email protected]> Closes #5737 from chouqin/externalsorter and squashes the following commits: 2924b93 [Qiping Li] fix inifite loop in Externalsorter's mergeWithAggregation (cherry picked from commit 7f4b583) Signed-off-by: Sean Owen <[email protected]>
Re-use HiveConf in HiveQl Author: nitin2goyal <[email protected]> Closes #6036 from nitin2goyal/dev-nitin-1.2 and squashes the following commits: 7ff1f9e [nitin2goyal] [SPARK-7331][SQL] Re-use HiveConf in HiveQl
Applying this fix to branch 1.3, mengxr Author: Bryan Cutler <[email protected]> Closes #6111 from BryanCutler/dataFormat-option-1_3-7522 and squashes the following commits: 1a4c814 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option (cherry picked from commit 9445814) Signed-off-by: Sean Owen <[email protected]>
…r the documentation Pass args to start-master.sh through to start-daemon.sh, as other scripts do, so that things like --host have effect on start-master.sh as per docs Author: Sean Owen <[email protected]> Closes #6185 from srowen/SPARK-5412 and squashes the following commits: b3ce9da [Sean Owen] Pass args to start-master.sh through to start-daemon.sh, as other scripts do, so that things like --host have effect on start-master.sh as per docs (cherry picked from commit 8ab1450) Signed-off-by: Andrew Or <[email protected]>
This patch wraps `SnappyOutputStream` to ensure that `close()` is idempotent and to guard against write-after-`close()` bugs. This is a workaround for xerial/snappy-java#107, a bug where a non-idempotent `close()` method can lead to stream corruption. We can remove this workaround if we upgrade to a snappy-java version that contains my fix for this bug, but in the meantime this patch offers a backportable Spark fix. Author: Josh Rosen <[email protected]> Closes #6176 from JoshRosen/SPARK-7660-wrap-snappy and squashes the following commits: 8b77aae [Josh Rosen] Wrap SnappyOutputStream to fix SPARK-7660 (cherry picked from commit f2cc6b5) Signed-off-by: Josh Rosen <[email protected]> Conflicts: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java
… in MLlib Collaborative Filtering documentation. Fixing broken trainImplicit Scala example in MLlib Collaborative Filtering documentation to match one of the possible ALS.trainImplicit function signatures. Author: Mike Dusenberry <[email protected]> Closes #6422 from dusenberrymw/Fix_MLlib_Collab_Filtering_trainImplicit_Example and squashes the following commits: 36492f4 [Mike Dusenberry] Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation to match one of the possible ALS.trainImplicit function signatures. (cherry picked from commit 0463428) Signed-off-by: Xiangrui Meng <[email protected]>
Fix the bug that entering only 1 arg will cause array out of bounds exception in PageRank example. Author: Li Yao <[email protected]> Closes #6455 from lastland/patch-1 and squashes the following commits: de06128 [Li Yao] Fix the bug that entering only 1 arg will cause array out of bounds exception.
Author: MechCoder <[email protected]> Closes #6497 from MechCoder/spark-7946 and squashes the following commits: 2fdd0a3 [MechCoder] Add non-regression test 8c988c6 [MechCoder] [SPARK-7946] DecayFactor wrongly set in StreamingKMeans (cherry picked from commit 6181937) Signed-off-by: Xiangrui Meng <[email protected]>
…robust The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4` It fails in my system since I have version `1.10` :P Author: MechCoder <[email protected]> Closes #6579 from MechCoder/np_ver and squashes the following commits: 15430f8 [MechCoder] fix syntax error 893fb7e [MechCoder] remove equal to e35f0d4 [MechCoder] minor e89376c [MechCoder] Better checking 22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust (cherry picked from commit 452eb82) Signed-off-by: Xiangrui Meng <[email protected]>
…mThreadStatistics (branch-1.2) This patch adds a regression test for an extremely rare bug where `SparkHadoopUtil.getFileSystemThreadStatistics` would fail with a `NullPointerException` if the Hadoop `FileSystem.statisticsTable` contained a `Statistics` entry without a schema. I'm not sure exactly how Hadoop gets into such a state, but this patch's regression test forces that state in order to reproduce this bug. The fix is to add additional null-checking. I debated adding an additional try-catch block around this entire metrics code to just ignore exceptions and keep going in the case of errors, but decided against that approach for now because it seemed overly conservative and might mask other bugs. We can revisit this in followup patches. Author: Josh Rosen <[email protected]> Closes #6618 from JoshRosen/SPARK-8062-branch-1.2 and squashes the following commits: 652fa3c [Josh Rosen] Re-name test and reapply fix 66fc600 [Josh Rosen] Fix and minimize regression test (verified that it still fails) 1d8d125 [Josh Rosen] Fix SPARK-8062 with additional null checks b6430f0 [Josh Rosen] Add failing regression test for SPARK-8062
…ce between label and features vector fix LabeledPoint parser when there is a whitespace between label and features vector, e.g. (y, [x1, x2, x3]) Author: Oleksiy Dyagilev <[email protected]> Closes #6954 from fe2s/SPARK-8525 and squashes the following commits: 0755b9d [Oleksiy Dyagilev] [SPARK-8525][MLLIB] addressing comment, removing dep on commons-lang c1abc2b [Oleksiy Dyagilev] [SPARK-8525][MLLIB] fix LabeledPoint parser when there is a whitespace on specific position (cherry picked from commit a803118) Signed-off-by: Xiangrui Meng <[email protected]>
…).U.numCols = k I'm sorry that I made #6949 closed by mistake. I pushed codes again. And, I added a test code. > There is a bug that `U.numCols() = self.nCols` in `IndexedRowMatrix.computeSVD()` It should have been `U.numCols() = k = svd.U.numCols()` > ``` self = U * sigma * V.transpose (m x n) = (m x n) * (k x k) * (k x n) //ASIS --> (m x n) = (m x k) * (k x k) * (k x n) //TOBE ``` Author: lee19 <[email protected]> Closes #6953 from lee19/MLlibBugfix and squashes the following commits: c1812a0 [lee19] [SPARK-8563] [MLlib] Used nRows instead of numRows() to reduce a burden. 4b9803b [lee19] [SPARK-8563] [MLlib] Fixed a build error. c2ccd89 [lee19] Added a unit test that validates matrix sizes of svd for [SPARK-8563][MLlib] 8373424 [lee19] [SPARK-8563][MLlib] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k (cherry picked from commit e725262) Signed-off-by: Xiangrui Meng <[email protected]>
…tests Several places in the PySpark SparseVector docs have one defined as: ``` SparseVector(4, [2, 4], [1.0, 2.0]) ``` The index 4 goes out of bounds (but this is not checked). CC: mengxr Author: Joseph K. Bradley <[email protected]> Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits: c806a65 [Joseph K. Bradley] fixed doc test e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests (cherry picked from commit a5d0581) Signed-off-by: Xiangrui Meng <[email protected]>
… and beta!=1 Fix BLAS.gemm to update matrix C when alpha==0 and beta!=1 Also include unit tests to verify the fix. mengxr brkyvz Author: Meihua Wu <[email protected]> Closes #7503 from rotationsymmetry/fix_BLAS_gemm and squashes the following commits: fce199c [Meihua Wu] Fix BLAS.gemm to update C when alpha==0 and beta!=1 (cherry picked from commit ff3c72d) Signed-off-by: Xiangrui Meng <[email protected]>
Remove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https. Follow up on #7792 Author: Sean Owen <[email protected]> Closes #7956 from srowen/SPARK-9633 and squashes the following commits: caa40bd [Sean Owen] Remove 2 defunct SBT download URLs and replace with the 1 known download URL. Also, use https. Conflicts: sbt/sbt-launch-lib.bash
…ld be 0.0 (original: 1.0) Small typo in the example for `LabelledPoint` in the MLLib docs. Author: Sean Paradiso <[email protected]> Closes #8680 from sparadiso/docs_mllib_smalltypo. (cherry picked from commit 1dc7548) Signed-off-by: Xiangrui Meng <[email protected]>
…rialization Python time values return a floating point value, need to cast to integer before serialize with struct.pack('!q', value) https://issues.apache.org/jira/browse/SPARK-6931 Author: Bryan Cutler <[email protected]> Closes #8594 from BryanCutler/py-write_long-backport-6931-1.2.
…keys JIRA: https://issues.apache.org/jira/browse/SPARK-10642 When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`. Author: Liang-Chi Hsieh <[email protected]> Closes #8796 from viirya/fix-pyrdd-lookup. (cherry picked from commit 136c77d) Signed-off-by: Davies Liu <[email protected]>
As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SCP-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public links to view them. Per shaneknapp, we should remove this log syncing mechanism if it is no longer necessary; removing the need to SCP from the Jenkins workers to the masters is a desired step as part of some larger Jenkins infra refactoring. Author: Josh Rosen <[email protected]> Closes #8793 from JoshRosen/remove-jenkins-ssh-to-master. (cherry picked from commit f1c9115) Signed-off-by: Josh Rosen <[email protected]>
The created decimal is wrong if using `Decimal(unscaled, precision, scale)` with unscaled > 1e18 and and precision > 18 and scale > 0. This bug exists since the beginning. Author: Davies Liu <[email protected]> Closes #9014 from davies/fix_decimal. (cherry picked from commit 37526ac) Signed-off-by: Davies Liu <[email protected]> Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/decimal/Decimal.scala
jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <[email protected]> Closes #9803 from hhbyyh/w2vVocab. (cherry picked from commit e391abd) Signed-off-by: Xiangrui Meng <[email protected]>
Can one of the admins verify this patch? |
@ErlangZ please close this PR |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.