[SPARK-11691][SQL] Allow to specify compression codec in HadoopFsRela… #9657

zjffdu · 2015-11-12T08:19:29Z

…tion when saving

SparkQA · 2015-11-12T09:17:18Z

Test build #45721 has finished for PR 9657 at commit 09d39e5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-12T12:01:02Z

Test build #45728 has finished for PR 9657 at commit 5269f4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2015-12-01T08:21:05Z

@Lewuathe Could you help review this ? This is a dependency issue for refactor CsvRelation to extend HadoopFsRelation. CsvRelation now support to write to compressed format while currently HadoopFsRelationd don't support that.

Lewuathe · 2015-12-02T02:34:23Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical.{Project, InsertIntoTable}
 import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
 import org.apache.spark.sql.execution.datasources.{CreateTableUsingAsSelect, ResolvedDataSource}
 import org.apache.spark.sql.sources.HadoopFsRelation
-
+import org.apache.hadoop.io.compress.CompressionCodec


Third party modules should be put above org.apache.spark.* modules.
see: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

Lewuathe · 2015-12-02T02:45:50Z

@zjffdu Do you intend to remove the compression codev from CsvRelation after changing HadoopFsRelation and inherit it? Since spark-csv is a external library, it is not reasonable to change according to spark-csv requirements.

But anyway compression codec option in HadoopFsRelation is reasonable itself for me.

Lewuathe · 2015-12-02T02:46:13Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala

@@ -30,7 +30,7 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.execution.{RunnableCommand, SQLExecution}
 import org.apache.spark.sql.sources._
 import org.apache.spark.util.Utils
-
+import org.apache.hadoop.io.compress.CompressionCodec


Same as DataFrameWriter import.

zjffdu · 2015-12-02T03:06:11Z

@Lewuathe Yes, I'd like to make CsvRelation to extend HadoopFsRelation also after this change since currently CsvRelation support compression so want to keep its compatibility.
Although CsvRelation is an external library, csv is a format used very often. If it also extends HadoopFsRelation, then other third party library can use HadoopFsRelation to embrace lots of formats. Acutally SPARK-10038 is the case that need that.

Lewuathe · 2015-12-02T03:06:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala

+  test("compression") {
+    val tempDirPath = Files.createTempDir().getAbsolutePath;
+    val df = sqlContext.read.text(testFile)
+    df.show()


Was it written for debug? We can remove show.

Right, will correct it.

zjffdu · 2015-12-02T03:40:59Z

@Lewuathe Thanks for the review. I push another commit to address the comments. Besides I change the compression feature to 1.6.0.

SparkQA · 2015-12-02T06:27:28Z

Test build #47034 has finished for PR 9657 at commit 896440a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-02T07:04:16Z

Test build #47036 has finished for PR 9657 at commit db31b8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zjffdu · 2015-12-03T05:21:05Z

Never mind, I change back to 1.7.0 since 1.6 is in rc1

SparkQA · 2015-12-03T07:02:40Z

Test build #47118 has finished for PR 9657 at commit 657ba5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-01-29T03:37:20Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   * @since 1.7.0
+   */
+  def compress(codec: Class[_ <: CompressionCodec]): DataFrameWriter = {
+    this.extraOptions += ("compression.codec" -> codec.getCanonicalName)


The amount of code changes should be small, so we do not need this additional interface.
If we add an interface for each additional option, the number of interfaces blows up.

Agree that we should not add interface for every configuration, but considering compression is a very common property, I feel it would be better to keep this interface. We also expose the compression api in RDD.saveAsXX, so think it would be better to be consistent here in dataframe

maropu · 2016-01-29T03:43:48Z

@zjffdu Any updates? If you keep working on it, please check my comments.

zjffdu · 2016-01-29T03:49:31Z

Will update this PR

…tion when saving

zjffdu · 2016-01-29T05:38:04Z

@maropu Thanks for review, I update the PR to address part of your comments. Please check my comments inline.

SparkQA · 2016-01-29T05:51:20Z

Test build #50345 has finished for PR 9657 at commit ea70b40.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-29T06:01:39Z

Test build #50347 has finished for PR 9657 at commit a8ed421.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-01-29T07:34:40Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+  /*
+   * Specify the compression codec when saving it on hdfs
+   *
+   * @since 1.7.0


not 1.7.0, but 2.0.0.

maropu · 2016-01-29T08:55:17Z

@rxin @liancheng Could you review this?

rxin · 2016-01-29T08:57:30Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+   *
+   * @since 2.0.0
+   */
+  def compress(codec: Class[_ <: CompressionCodec]): DataFrameWriter = {


Can this just be a normal option?

Also we shouldn't depend on Hadoop APIs in options, which is a user facing API. Nobody outside the Hadoop world knows how to use the CompressionCodec API.

SparkQA · 2016-01-29T09:07:59Z

Test build #50364 has finished for PR 9657 at commit 3645dbc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-01-30T00:08:34Z

@zjffdu ping

maropu · 2016-02-02T04:45:20Z

@zjffdu ping

zjffdu · 2016-02-02T10:01:10Z

sorry for late response, will update the patch tomorrow.

maropu · 2016-02-09T09:51:34Z

@zjffdu ping

maropu · 2016-02-18T13:42:06Z

@zjffdu If you have no time to take this, is it okay I rework?

maropu · 2016-02-29T08:52:32Z

This is resolved by #11384, so could you close this?

Lewuathe reviewed Dec 2, 2015
View reviewed changes

maropu reviewed Jan 29, 2016
View reviewed changes

zjffdu added 6 commits January 29, 2016 12:47

[SPARK-11691][SQL] Allow to specify compression codec in HadoopFsRela…

8d151b5

…tion when saving

fix compilation issue

3202474

address comments

f6c0074

minor change

d470aea

change back to 1.7.0

f47d8cc

fix review comments

ea70b40

zjffdu force-pushed the SPARK-11691 branch from 657ba5a to ea70b40 Compare January 29, 2016 05:23

fix code style

a8ed421

maropu reviewed Jan 29, 2016
View reviewed changes

minor fix

3645dbc

rxin reviewed Jan 29, 2016
View reviewed changes

maropu mentioned this pull request Feb 23, 2016

[SPARK-11691][SQL] Support setting hadoop compression codecs in DataFrameWriter#option #11324

Closed

zjffdu closed this Feb 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11691][SQL] Allow to specify compression codec in HadoopFsRela… #9657

[SPARK-11691][SQL] Allow to specify compression codec in HadoopFsRela… #9657

zjffdu commented Nov 12, 2015

SparkQA commented Nov 12, 2015

SparkQA commented Nov 12, 2015

zjffdu commented Dec 1, 2015

Lewuathe Dec 2, 2015

Lewuathe commented Dec 2, 2015

Lewuathe Dec 2, 2015

zjffdu commented Dec 2, 2015

Lewuathe Dec 2, 2015

zjffdu Dec 2, 2015

zjffdu commented Dec 2, 2015

SparkQA commented Dec 2, 2015

SparkQA commented Dec 2, 2015

zjffdu commented Dec 3, 2015

SparkQA commented Dec 3, 2015

maropu Jan 29, 2016

zjffdu Jan 29, 2016

maropu Jan 29, 2016

maropu commented Jan 29, 2016

zjffdu commented Jan 29, 2016

zjffdu commented Jan 29, 2016

SparkQA commented Jan 29, 2016

SparkQA commented Jan 29, 2016

maropu Jan 29, 2016

maropu commented Jan 29, 2016

rxin Jan 29, 2016

rxin Jan 29, 2016

maropu Jan 30, 2016

SparkQA commented Jan 29, 2016

maropu commented Jan 30, 2016

maropu commented Feb 2, 2016

zjffdu commented Feb 2, 2016

maropu commented Feb 9, 2016

maropu commented Feb 18, 2016

maropu commented Feb 29, 2016

[SPARK-11691][SQL] Allow to specify compression codec in HadoopFsRela… #9657

[SPARK-11691][SQL] Allow to specify compression codec in HadoopFsRela… #9657

Conversation

zjffdu commented Nov 12, 2015

SparkQA commented Nov 12, 2015

SparkQA commented Nov 12, 2015

zjffdu commented Dec 1, 2015

Choose a reason for hiding this comment

Lewuathe commented Dec 2, 2015

Choose a reason for hiding this comment

zjffdu commented Dec 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zjffdu commented Dec 2, 2015

SparkQA commented Dec 2, 2015

SparkQA commented Dec 2, 2015

zjffdu commented Dec 3, 2015

SparkQA commented Dec 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jan 29, 2016

zjffdu commented Jan 29, 2016

zjffdu commented Jan 29, 2016

SparkQA commented Jan 29, 2016

SparkQA commented Jan 29, 2016

Choose a reason for hiding this comment

maropu commented Jan 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 29, 2016

maropu commented Jan 30, 2016

maropu commented Feb 2, 2016

zjffdu commented Feb 2, 2016

maropu commented Feb 9, 2016

maropu commented Feb 18, 2016

maropu commented Feb 29, 2016