-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 1271 (1320) cogroup and groupby should pass iterator[x] #242
Spark 1271 (1320) cogroup and groupby should pass iterator[x] #242
Conversation
Merged build triggered. |
Merged build started. |
Merged build finished. |
One or more automated tests failed |
Merged build triggered. |
Merged build started. |
Merged build finished. |
One or more automated tests failed |
Jenkins, retest this please On Wednesday, March 26, 2014, UCB AMPLab [email protected] wrote:
Cell : 425-233-8271 |
what is the immediate need for this change ? |
We are trying to cement the API for 1.0. This will allow us to provide an actual iterator-based implementation later that does not materialize all the values for a single group-by key at once, for instance. With our current Seq-based solution, we have basically no choice but to OOM now and for any future implementation if the values are too large to fit in memory. |
fair enough, just wanted to ensure it was being done in preparation for disk backed iterator (or atleast that would be something which can be implemented using this). Btw, incompatible api change right ? Any attempts to preserve (with deprecated warning) existing behavior ? |
It would be a bit difficult to preserve the other return type since the call signatures are the same. Its probably worth calling out in the release notes given that I had to make changes to some code which compiled fine against the changes but failed at run-time (mostly in Scala the iterator type has a size method which while it works consumes all of the elements). |
Merged build triggered. One or more automated tests failed |
Merged build started. One or more automated tests failed |
@mridulm I think type erasure will make it difficult to preserve compatiblity. |
Do we also want these changes for the python API? |
Merged build finished. All automated tests passed. |
All automated tests passed. |
@pwendell agree, I was not referring to method overloading. |
@mridulm if not method overloading - do you have a specific proposal in mind? Downstream consumers will have to add |
Can one of the admins verify this patch? |
So I made the python API match as well. |
@mridulm Did you want something like named legacyGroupByKey which would have the old behaviour? |
Merged build triggered. One or more automated tests failed. |
I think moving your function to |
Merged build started. One or more automated tests failed. |
@pwendell true, but its also probably easier to do quickly if you have a lot of code, and the Java people can't use toSeq. That being said I'd rather not add the legacy functions unless there is a strong use case for them. |
@holdenk this needs to be up-merged to master |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13910/ |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Thanks @holdenk - merged! |
Author: Holden Karau <[email protected]> Closes apache#242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits: f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator 77048f8 [Holden Karau] Fix merge up to master d3fe909 [Holden Karau] use toSeq instead 7a092a3 [Holden Karau] switch resultitr to resultiterable eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables c5075aa [Holden Karau] If guava 14 had iterables 2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API 11e730c [Holden Karau] Fix streaming tests 66b583d [Holden Karau] Fix the core test suite to compile 4ed579b [Holden Karau] Refactor from iterator to iterable d052c07 [Holden Karau] Python tests now pass with iterator pandas 3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work" cd1e81c [Holden Karau] Try and make pickling list iterators work c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well 88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming a5ee714 [Holden Karau] oops, was checking wrong iterator e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming ec8cc3e [Holden Karau] Fix test issues\! 4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD" ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas" b692868 [Holden Karau] Revert 7e533f7 [Holden Karau] Fix the bug 8a5153a [Holden Karau] Revert me, but we have some stuff to debug b4e86a9 [Holden Karau] Add a join based on the problem in SVD c4510e2 [Holden Karau] Revert this but for now put things in list pandas b4e0b1d [Holden Karau] Fix style issues 71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness. b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work 37888ec [Holden Karau] core/tests now pass 249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes 6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy" fe992fe [Holden Karau] hmmm try and fix up basic operation suite 172705c [Holden Karau] Fix Java API suite caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy 88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator 4991af6 [Holden Karau] Fix some tests be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after 687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
[SPARKR-92] Phase 2: implement sum(rdd)
This PR pulls in recent changes in SparkR-pkg, including cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField. Author: cafreeman <[email protected]> Author: Davies Liu <[email protected]> Author: Zongheng Yang <[email protected]> Author: Shivaram Venkataraman <[email protected]> Author: Shivaram Venkataraman <[email protected]> Author: Sun Rui <[email protected]> Closes #5436 from davies/R3 and squashes the following commits: c2b09be [Davies Liu] SQLTypes -> schema a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3 168b7fe [Davies Liu] sort generics b1fe460 [Davies Liu] fix conflict in README.md e74c04e [Davies Liu] fix schema.R 4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5 41f8184 [Davies Liu] rm man ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3 1bdcb63 [Zongheng Yang] Updates to README.md. 5a553e7 [cafreeman] Use object attribute instead of argument 71372d9 [cafreeman] Update docs and examples 8526d2e [cafreeman] Remove `tojson` functions 6ef5f2d [cafreeman] Fix spacing 7741d66 [cafreeman] Rename the SQL DataType function 141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream 9387402 [Davies Liu] fix style 40199eb [Shivaram Venkataraman] Move except into sorted position 07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD. 7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey ed66c81 [cafreeman] Update `subtract` to work with `generics.R` f3ba785 [cafreeman] Fixed duplicate export 275deb4 [cafreeman] Update `NAMESPACE` and tests 1a3b63d [cafreeman] new version of `CreateDF` 836c4bf [cafreeman] Update `createDataFrame` and `toDF` be5d5c1 [cafreeman] refactor schema functions 40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5 20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4 c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master 136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats cd66603 [cafreeman] new line at EOF 8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep 7dd81b7 [cafreeman] Documentation 0e2a94f [cafreeman] Define functions for schema and fields
….sql ## What changes were proposed in this pull request? This patch moves all Vacuum-related code from `org.apache.spark` to `com.databricks.sql` as part of the general task of clearly separating Edge issues in order to reduce merge conflicts with OSS. `AclCommandParser` is renamed to a more general `DatabricksSqlParser`, to be used for all DB-specific syntax and is moved to a new package called `com.databricks.sql.parser`. `VacuumTableCommand` is moved from `org.apache.spark.sql.execution.command` to `com.databricks.sql.transaction`. ## How was this patch tested? Tests in project `spark-sql` pass. Author: Adrian Ionescu <[email protected]> Closes apache#242 from adrian-ionescu/SC-5840.
The problem with PR apache#242 was that it renamed, but didn't completely decouple `DatabricksSqlParser` from Acls. As such, the Vacuum command was only recognized if Acl support was enabled (via `spark.session.extensions = AclExtensions` and `spark.databricks.acl.enabled = true`) ## What changes were proposed in this pull request? - extract out all Acl-related commands from `DatabricksSqlCommandBuilder` into `AclCommandBuilder` - separate related test suites accordingly - make Acl client optional for `DatabricksSqlParser` - create new `DatabricksExtensions` class that injects `DatabricksSqlParser` without Acl support - apply `DatabricksExtensions` by default ## How was this patch tested? Ran all tests in `spark-sql` Manually tested `Vacuum` from `sparkShell` and `sparkShellAcl` Author: Adrian Ionescu <[email protected]> Closes apache#248 from adrian-ionescu/db-parser.
* MapR [SPARK-194] Redirect to Spark History server
Add install-k8s role
No description provided.