-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Simplify the build with sbt 0.13.2 features #706
[WIP] Simplify the build with sbt 0.13.2 features #706
Conversation
Can one of the admins verify this patch? |
Hey @jaceklaskowski - this is nice to have as a clean up. Just a heads up, we're likely to restructure the sbt build to read dependencies from maven since right now we are maintaining two builds. So we might wait until we see what that looks like before we consider other major changes to the build. |
Hey @pwendell, thanks for prompt response! I'd love being engaged in the effort if possible. Where could we discuss how much I could do regarding sbt (I think I might be quite helpful here and there)? Is there a JIRA issue I could be following the progress of the task. Please guide me so the project could benefit from some of spare time I could devote. |
import sbtassembly.Plugin._ | ||
import AssemblyKeys._ | ||
import scala.util.Properties | ||
import util.Properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will keep that in mind when sending patches in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
As to the changes I proposed in the PR, I think that however the future steps with sbt-pom-reader they're easily applicable to the build. They (are supposed to) simplify the build definition format to leverage sbt 0.13.2 macros and are expected not to interfere with the upcoming changes. They're slated for 1.1.0 after all. I'm writing all this hoping to convince you guys, the committers, to accept the pr so I can propose others ;-) I'd like to play a bit with the optional project inclusions as I think there's too many env vars to work with the different versions of Hadoop. |
…-refactoring Conflicts: project/SparkBuild.scala
Knock, knock. Could I ask to approve the changes or let me know what portion to change to get it in? |
@@ -297,7 +273,7 @@ object SparkBuild extends Build { | |||
val chillVersion = "0.3.6" | |||
val codahaleMetricsVersion = "3.0.0" | |||
val jblasVersion = "1.2.3" | |||
val jets3tVersion = if ("^2\\.[3-9]+".r.findFirstIn(hadoopVersion).isDefined) "0.9.0" else "0.7.1" | |||
val jets3tVersion = "^2\\.[3-9]+".r.findFirstIn(hadoopVersion).fold("0.7.1")(_ => "0.9.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tangential question. Is this more idiomatic scala? I've seen fold as an alternative to if-else or use of Option, but had thought it harder to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I myself learnt it not so long ago and noticed it spurred some discussions about it in the Scala community (with Martin Ordersky himself).
Functional Programming in Scala reads on page 69 (in the pdf version):
"It’s fine to use pattern matching, though you should be able to implement all the functions besides map and getOrElse without resorting to pattern matching."
So I followed the advice and applied Option.map(...).getOrElse(...)
quite intensively, but...IntelliJ IDEA just before I committed the change had suggested to replace it with fold
. I thought I needed to change my habits once more and sent the PR with what IDEA offered.
With all that said, it's not clear what's the most idiomatic approach, but pattern matching is in my opinion a step back from map/getOrElse and there's no need for it.
I'd appreciate being corrected and could even replace Option.fold
with Option.map
/Option.getOrElse
if that would make the PR better for more committers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this a while ago, soon after Spark development moved to Scala 2.10. The overwhelming preference was that we use map/getOrElse instead of fold. When working on Spark, consider turning off that particular inspection and warning in the IntelliJ preferences.
This is what Patrick was trying to say https://issues.apache.org/jira/browse/SPARK-1776. So while we have this on our mind, even if we merge this patch it has to eventually be replaced entirely. We are still experimenting and lets see if we all agree to this. OTOH We should retain the comments, in case we decide to merge this patch. Jenkins, test this please. |
Thanks! I've said it before (when @pwendell asked to hold off) and now I'll say it again as my changes don't seem to find home soon before the "We are still experimenting"'s over. When is the experimentation happening? Is there a branch for it? Is there a discussion on the mailing list(s) how it's going to be done? I'd appreciate more openness in this regard (to avoid Kafka's case where they moved to gradle for no apparent reasons other than that they didn't seem to have cared to learn sbt enough). |
Here is a link to our mailing list discussion on the topic: Note that also the biggest time sink right now is preparing the 1.0 On Sun, May 11, 2014 at 3:24 PM, Jacek Laskowski
|
@@ -55,81 +58,77 @@ object SparkBuild extends Build { | |||
val SCALAC_JVM_VERSION = "jvm-1.6" | |||
val JAVAC_JVM_VERSION = "1.6" | |||
|
|||
lazy val root = Project("root", file("."), settings = rootSettings) aggregate(allProjects: _*) | |||
lazy val root = project in file(".") settings (rootSettings: _*) aggregate(allProjects: _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm actually let me step back. Is this change purporting to make the style better, or is there a functionality change?
I think this we should close this issue now because we separately re-factored this build. |
* [SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON. For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data. Example in spark-shell. scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"b":1} scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false") res2: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"a":null,"b":1} Add new test to JacksonGeneratorSuite Lead-authored-by: stczwd <[email protected]> Co-authored-by: Jackey Lee <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit with id 78b0cbe) * [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating # What changes were proposed in this pull request? Add description for ignoreNullFields, which is commited in apache#26098 , in DataFrameWriter and readwriter.py. Enable user to use ignoreNullFields in pyspark. ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes apache#26227 from stczwd/json-generator-doc. Authored-by: stczwd <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: stczwd <[email protected]>
…xy user in cluster mode (apache#706) Backporting fix for SPARK-41958 to 3.3 branch from apache#39474 Below description from original PR. -------------------------- ### What changes were proposed in this pull request? This PR proposes to disallow arbitrary custom classpath with proxy user in cluster mode by default. ### Why are the changes needed? To avoid arbitrary classpath in spark cluster. ### Does this PR introduce _any_ user-facing change? Yes. User should reenable this feature by `spark.submit.proxyUser.allowCustomClasspathInClusterMode`. ### How was this patch tested? Manually tested. Closes apache#39474 from Ngone51/dev. Lead-authored-by: Peter Toth <peter.tothgmail.com> (cherry picked from commit 909da96) ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#41428 from degant/spark-41958-3.3. Lead-authored-by: Degant Puri <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: Degant Puri <[email protected]> Co-authored-by: Peter Toth <[email protected]>
…xy user in cluster mode (apache#706) Backporting fix for SPARK-41958 to 3.3 branch from apache#39474 Below description from original PR. -------------------------- ### What changes were proposed in this pull request? This PR proposes to disallow arbitrary custom classpath with proxy user in cluster mode by default. ### Why are the changes needed? To avoid arbitrary classpath in spark cluster. ### Does this PR introduce _any_ user-facing change? Yes. User should reenable this feature by `spark.submit.proxyUser.allowCustomClasspathInClusterMode`. ### How was this patch tested? Manually tested. Closes apache#39474 from Ngone51/dev. Lead-authored-by: Peter Toth <peter.tothgmail.com> (cherry picked from commit 909da96) ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#41428 from degant/spark-41958-3.3. Lead-authored-by: Degant Puri <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: Degant Puri <[email protected]> Co-authored-by: Peter Toth <[email protected]>
* [SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON. For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data. Example in spark-shell. scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"b":1} scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false") res2: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"a":null,"b":1} Add new test to JacksonGeneratorSuite Lead-authored-by: stczwd <[email protected]> Co-authored-by: Jackey Lee <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit with id 78b0cbe) * [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating # What changes were proposed in this pull request? Add description for ignoreNullFields, which is commited in apache#26098 , in DataFrameWriter and readwriter.py. Enable user to use ignoreNullFields in pyspark. ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes apache#26227 from stczwd/json-generator-doc. Authored-by: stczwd <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: stczwd <[email protected]>
It's a WIP, but am pull request'ing the current changes hoping that someone from the dev team would have a look at the changes and guide me how to merge them to the project.
Ultimately I'd like to refactor the build to make it easier to understand.