Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11691][SQL] Allow to specify compression codec in HadoopFsRela… #9657

Closed
wants to merge 8 commits into from

Conversation

zjffdu
Copy link
Contributor

@zjffdu zjffdu commented Nov 12, 2015

…tion when saving

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45721 has finished for PR 9657 at commit 09d39e5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45728 has finished for PR 9657 at commit 5269f4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 1, 2015

@Lewuathe Could you help review this ? This is a dependency issue for refactor CsvRelation to extend HadoopFsRelation. CsvRelation now support to write to compressed format while currently HadoopFsRelationd don't support that.

@@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical.{Project, InsertIntoTable}
import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
import org.apache.spark.sql.execution.datasources.{CreateTableUsingAsSelect, ResolvedDataSource}
import org.apache.spark.sql.sources.HadoopFsRelation

import org.apache.hadoop.io.compress.CompressionCodec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third party modules should be put above org.apache.spark.* modules.
see: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports

@Lewuathe
Copy link
Contributor

Lewuathe commented Dec 2, 2015

@zjffdu Do you intend to remove the compression codev from CsvRelation after changing HadoopFsRelation and inherit it? Since spark-csv is a external library, it is not reasonable to change according to spark-csv requirements.

But anyway compression codec option in HadoopFsRelation is reasonable itself for me.

@@ -30,7 +30,7 @@ import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.{RunnableCommand, SQLExecution}
import org.apache.spark.sql.sources._
import org.apache.spark.util.Utils

import org.apache.hadoop.io.compress.CompressionCodec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as DataFrameWriter import.

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 2, 2015

@Lewuathe Yes, I'd like to make CsvRelation to extend HadoopFsRelation also after this change since currently CsvRelation support compression so want to keep its compatibility.
Although CsvRelation is an external library, csv is a format used very often. If it also extends HadoopFsRelation, then other third party library can use HadoopFsRelation to embrace lots of formats. Acutally SPARK-10038 is the case that need that.

test("compression") {
val tempDirPath = Files.createTempDir().getAbsolutePath;
val df = sqlContext.read.text(testFile)
df.show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it written for debug? We can remove show.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will correct it.

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 2, 2015

@Lewuathe Thanks for the review. I push another commit to address the comments. Besides I change the compression feature to 1.6.0.

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47034 has finished for PR 9657 at commit 896440a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47036 has finished for PR 9657 at commit db31b8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 3, 2015

Never mind, I change back to 1.7.0 since 1.6 is in rc1

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47118 has finished for PR 9657 at commit 657ba5a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @since 1.7.0
*/
def compress(codec: Class[_ <: CompressionCodec]): DataFrameWriter = {
this.extraOptions += ("compression.codec" -> codec.getCanonicalName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amount of code changes should be small, so we do not need this additional interface.
If we add an interface for each additional option, the number of interfaces blows up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we should not add interface for every configuration, but considering compression is a very common property, I feel it would be better to keep this interface. We also expose the compression api in RDD.saveAsXX, so think it would be better to be consistent here in dataframe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@maropu
Copy link
Member

maropu commented Jan 29, 2016

@zjffdu Any updates? If you keep working on it, please check my comments.

@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 29, 2016

Will update this PR

@zjffdu
Copy link
Contributor Author

zjffdu commented Jan 29, 2016

@maropu Thanks for review, I update the PR to address part of your comments. Please check my comments inline.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50345 has finished for PR 9657 at commit ea70b40.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50347 has finished for PR 9657 at commit a8ed421.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/*
* Specify the compression codec when saving it on hdfs
*
* @since 1.7.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 1.7.0, but 2.0.0.

@maropu
Copy link
Member

maropu commented Jan 29, 2016

@rxin @liancheng Could you review this?

*
* @since 2.0.0
*/
def compress(codec: Class[_ <: CompressionCodec]): DataFrameWriter = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be a normal option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we shouldn't depend on Hadoop APIs in options, which is a user facing API. Nobody outside the Hadoop world knows how to use the CompressionCodec API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50364 has finished for PR 9657 at commit 3645dbc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jan 30, 2016

@zjffdu ping

1 similar comment
@maropu
Copy link
Member

maropu commented Feb 2, 2016

@zjffdu ping

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 2, 2016

sorry for late response, will update the patch tomorrow.

@maropu
Copy link
Member

maropu commented Feb 9, 2016

@zjffdu ping

@maropu
Copy link
Member

maropu commented Feb 18, 2016

@zjffdu If you have no time to take this, is it okay I rework?

@maropu
Copy link
Member

maropu commented Feb 29, 2016

This is resolved by #11384, so could you close this?

@zjffdu zjffdu closed this Feb 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants