[WIP] Simplify the build with sbt 0.13.2 features #706

jaceklaskowski · 2014-05-09T02:25:35Z

It's a WIP, but am pull request'ing the current changes hoping that someone from the dev team would have a look at the changes and guide me how to merge them to the project.

Ultimately I'd like to refactor the build to make it easier to understand.

AmplabJenkins · 2014-05-09T02:27:58Z

Can one of the admins verify this patch?

pwendell · 2014-05-09T02:28:50Z

Hey @jaceklaskowski - this is nice to have as a clean up. Just a heads up, we're likely to restructure the sbt build to read dependencies from maven since right now we are maintaining two builds. So we might wait until we see what that looks like before we consider other major changes to the build.

jaceklaskowski · 2014-05-09T12:15:21Z

Hey @pwendell, thanks for prompt response! I'd love being engaged in the effort if possible. Where could we discuss how much I could do regarding sbt (I think I might be quite helpful here and there)? Is there a JIRA issue I could be following the progress of the task.

Please guide me so the project could benefit from some of spare time I could devote.

markhamstra · 2014-05-09T18:22:23Z

https://issues.apache.org/jira/browse/SPARK-1776

markhamstra · 2014-05-09T18:24:58Z

project/SparkBuild.scala

 import sbtassembly.Plugin._
 import AssemblyKeys._
-import scala.util.Properties
+import util.Properties


https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

Thanks! Will keep that in mind when sending patches in the future.

jaceklaskowski · 2014-05-09T18:54:01Z

As to the changes I proposed in the PR, I think that however the future steps with sbt-pom-reader they're easily applicable to the build. They (are supposed to) simplify the build definition format to leverage sbt 0.13.2 macros and are expected not to interfere with the upcoming changes. They're slated for 1.1.0 after all.

I'm writing all this hoping to convince you guys, the committers, to accept the pr so I can propose others ;-) I'd like to play a bit with the optional project inclusions as I think there's too many env vars to work with the different versions of Hadoop.

…-refactoring

…-refactoring Conflicts: project/SparkBuild.scala

https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

jaceklaskowski · 2014-05-11T10:33:57Z

Knock, knock. Could I ask to approve the changes or let me know what portion to change to get it in?

srowen · 2014-05-11T10:45:11Z

project/SparkBuild.scala

@@ -297,7 +273,7 @@ object SparkBuild extends Build {
  val chillVersion = "0.3.6"
  val codahaleMetricsVersion = "3.0.0"
  val jblasVersion = "1.2.3"
-  val jets3tVersion = if ("^2\\.[3-9]+".r.findFirstIn(hadoopVersion).isDefined) "0.9.0" else "0.7.1"
+  val jets3tVersion = "^2\\.[3-9]+".r.findFirstIn(hadoopVersion).fold("0.7.1")(_ => "0.9.0")


This is a tangential question. Is this more idiomatic scala? I've seen fold as an alternative to if-else or use of Option, but had thought it harder to understand.

I myself learnt it not so long ago and noticed it spurred some discussions about it in the Scala community (with Martin Ordersky himself).

Functional Programming in Scala reads on page 69 (in the pdf version):

"It’s fine to use pattern matching, though you should be able to implement all the functions besides map and getOrElse without resorting to pattern matching."

So I followed the advice and applied Option.map(...).getOrElse(...) quite intensively, but...IntelliJ IDEA just before I committed the change had suggested to replace it with fold. I thought I needed to change my habits once more and sent the PR with what IDEA offered.

With all that said, it's not clear what's the most idiomatic approach, but pattern matching is in my opinion a step back from map/getOrElse and there's no need for it.

I'd appreciate being corrected and could even replace Option.fold with Option.map/Option.getOrElse if that would make the PR better for more committers.

We discussed this a while ago, soon after Spark development moved to Scala 2.10. The overwhelming preference was that we use map/getOrElse instead of fold. When working on Spark, consider turning off that particular inspection and warning in the IntelliJ preferences.

ScrapCodes · 2014-05-11T11:22:42Z

This is what Patrick was trying to say https://issues.apache.org/jira/browse/SPARK-1776. So while we have this on our mind, even if we merge this patch it has to eventually be replaced entirely. We are still experimenting and lets see if we all agree to this. OTOH We should retain the comments, in case we decide to merge this patch.

Jenkins, test this please.

jaceklaskowski · 2014-05-11T22:24:55Z

Thanks! I've said it before (when @pwendell asked to hold off) and now I'll say it again as my changes don't seem to find home soon before the "We are still experimenting"'s over. When is the experimentation happening? Is there a branch for it? Is there a discussion on the mailing list(s) how it's going to be done? I'd appreciate more openness in this regard (to avoid Kafka's case where they moved to gradle for no apparent reasons other than that they didn't seem to have cared to learn sbt enough).

aarondav · 2014-05-11T23:24:50Z

Here is a link to our mailing list discussion on the topic:
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-tp2315.html

Note that also the biggest time sink right now is preparing the 1.0
release, many people are running around trying to finalize APIs and bug
fixes before 1.0 is shipped, and don't have so much time at the moment to
closely examine features for 1.1.

On Sun, May 11, 2014 at 3:24 PM, Jacek Laskowski
[email protected]:

Thanks! I've said it before (when @pwendell https://github.com/pwendellasked to hold off) and now I'll say it again as my changes don't seem to
find home soon before the "We are still experimenting"'s over. When is
the experimentation happening? Is there a branch for it? Is there a
discussion on the mailing list(s) how it's going to be done? I'd appreciate
more openness in this regard (to avoid Kafka's case where they moved to
gradle for no apparent reasons other than that they didn't seem to have
cared to learn sbt enough).

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/706#issuecomment-42785356
.

…-refactoring

pwendell · 2014-05-13T01:43:00Z

project/SparkBuild.scala

@@ -55,81 +58,77 @@ object SparkBuild extends Build {
  val SCALAC_JVM_VERSION = "jvm-1.6"
  val JAVAC_JVM_VERSION = "1.6"

-  lazy val root = Project("root", file("."), settings = rootSettings) aggregate(allProjects: _*)
+  lazy val root = project in file(".") settings (rootSettings: _*) aggregate(allProjects: _*)


Hm actually let me step back. Is this change purporting to make the style better, or is there a functionality change?

…-refactoring

pwendell · 2014-08-01T19:00:34Z

I think this we should close this issue now because we separately re-factored this build.

* [SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON. For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data. Example in spark-shell. scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"b":1} scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false") res2: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"a":null,"b":1} Add new test to JacksonGeneratorSuite Lead-authored-by: stczwd <[email protected]> Co-authored-by: Jackey Lee <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit with id 78b0cbe) * [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating # What changes were proposed in this pull request? Add description for ignoreNullFields, which is commited in apache#26098 , in DataFrameWriter and readwriter.py. Enable user to use ignoreNullFields in pyspark. ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes apache#26227 from stczwd/json-generator-doc. Authored-by: stczwd <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: stczwd <[email protected]>

…xy user in cluster mode (apache#706) Backporting fix for SPARK-41958 to 3.3 branch from apache#39474 Below description from original PR. -------------------------- ### What changes were proposed in this pull request? This PR proposes to disallow arbitrary custom classpath with proxy user in cluster mode by default. ### Why are the changes needed? To avoid arbitrary classpath in spark cluster. ### Does this PR introduce _any_ user-facing change? Yes. User should reenable this feature by `spark.submit.proxyUser.allowCustomClasspathInClusterMode`. ### How was this patch tested? Manually tested. Closes apache#39474 from Ngone51/dev. Lead-authored-by: Peter Toth <peter.tothgmail.com> (cherry picked from commit 909da96) ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#41428 from degant/spark-41958-3.3. Lead-authored-by: Degant Puri <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: Degant Puri <[email protected]> Co-authored-by: Peter Toth <[email protected]>

* [SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON. For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data. Example in spark-shell. scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"b":1} scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false") res2: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"a":null,"b":1} Add new test to JacksonGeneratorSuite Lead-authored-by: stczwd <[email protected]> Co-authored-by: Jackey Lee <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit with id 78b0cbe) * [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating # What changes were proposed in this pull request? Add description for ignoreNullFields, which is commited in apache#26098 , in DataFrameWriter and readwriter.py. Enable user to use ignoreNullFields in pyspark. ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes apache#26227 from stczwd/json-generator-doc. Authored-by: stczwd <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: stczwd <[email protected]>

Simplify the build with sbt 0.13.2 features

9b3d2bc

jaceklaskowski changed the title ~~Simplify the build with sbt 0.13.2 features~~ [WIP] Simplify the build with sbt 0.13.2 features May 9, 2014

markhamstra reviewed May 9, 2014
View reviewed changes

jaceklaskowski added 4 commits May 9, 2014 21:08

Merge branch 'master' of git://github.com/apache/spark into wip/build…

18df656

…-refactoring

Merge branch 'master' of git://github.com/apache/spark into wip/build…

c11ef36

…-refactoring

Merge branch 'master' of git://github.com/apache/spark into wip/build…

0102d85

…-refactoring Conflicts: project/SparkBuild.scala

Follow Spark's code style

cecf740

https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

srowen reviewed May 11, 2014
View reviewed changes

jaceklaskowski added 3 commits May 12, 2014 02:49

Merge branch 'master' of git://github.com/apache/spark into wip/build…

5efd190

…-refactoring

Remove extra %

cf4e967

Merge branch 'master' of git://github.com/apache/spark into wip/build…

0ef2d92

…-refactoring

pwendell reviewed May 13, 2014
View reviewed changes

jaceklaskowski added 2 commits May 14, 2014 01:06

Merge branch 'master' of git://github.com/apache/spark into wip/build…

5f86114

…-refactoring

Merge branch 'master' of git://github.com/apache/spark into wip/build…

5d3d8a8

…-refactoring

asfgit closed this in 87738bf Aug 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Simplify the build with sbt 0.13.2 features #706

[WIP] Simplify the build with sbt 0.13.2 features #706

jaceklaskowski commented May 9, 2014

AmplabJenkins commented May 9, 2014

pwendell commented May 9, 2014

jaceklaskowski commented May 9, 2014

markhamstra commented May 9, 2014

markhamstra May 9, 2014

jaceklaskowski May 9, 2014

jaceklaskowski May 11, 2014

jaceklaskowski commented May 9, 2014

jaceklaskowski commented May 11, 2014

srowen May 11, 2014

jaceklaskowski May 11, 2014

markhamstra May 11, 2014

ScrapCodes commented May 11, 2014

jaceklaskowski commented May 11, 2014

aarondav commented May 11, 2014

pwendell May 13, 2014

pwendell commented Aug 1, 2014

[WIP] Simplify the build with sbt 0.13.2 features #706

[WIP] Simplify the build with sbt 0.13.2 features #706

Conversation

jaceklaskowski commented May 9, 2014

AmplabJenkins commented May 9, 2014

pwendell commented May 9, 2014

jaceklaskowski commented May 9, 2014

markhamstra commented May 9, 2014

markhamstra May 9, 2014

Choose a reason for hiding this comment

jaceklaskowski May 9, 2014

Choose a reason for hiding this comment

jaceklaskowski May 11, 2014

Choose a reason for hiding this comment

jaceklaskowski commented May 9, 2014

jaceklaskowski commented May 11, 2014

srowen May 11, 2014

Choose a reason for hiding this comment

jaceklaskowski May 11, 2014

Choose a reason for hiding this comment

markhamstra May 11, 2014

Choose a reason for hiding this comment

ScrapCodes commented May 11, 2014

jaceklaskowski commented May 11, 2014

aarondav commented May 11, 2014

pwendell May 13, 2014

Choose a reason for hiding this comment

pwendell commented Aug 1, 2014