-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11691][SQL] Allow to specify compression codec in HadoopFsRela… #9657
Conversation
Test build #45721 has finished for PR 9657 at commit
|
Test build #45728 has finished for PR 9657 at commit
|
@Lewuathe Could you help review this ? This is a dependency issue for refactor CsvRelation to extend HadoopFsRelation. CsvRelation now support to write to compressed format while currently HadoopFsRelationd don't support that. |
@@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.plans.logical.{Project, InsertIntoTable} | |||
import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils | |||
import org.apache.spark.sql.execution.datasources.{CreateTableUsingAsSelect, ResolvedDataSource} | |||
import org.apache.spark.sql.sources.HadoopFsRelation | |||
|
|||
import org.apache.hadoop.io.compress.CompressionCodec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Third party modules should be put above org.apache.spark.*
modules.
see: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports
@zjffdu Do you intend to remove the compression codev from But anyway compression codec option in |
@@ -30,7 +30,7 @@ import org.apache.spark.sql.catalyst.InternalRow | |||
import org.apache.spark.sql.execution.{RunnableCommand, SQLExecution} | |||
import org.apache.spark.sql.sources._ | |||
import org.apache.spark.util.Utils | |||
|
|||
import org.apache.hadoop.io.compress.CompressionCodec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as DataFrameWriter
import.
@Lewuathe Yes, I'd like to make CsvRelation to extend HadoopFsRelation also after this change since currently CsvRelation support compression so want to keep its compatibility. |
test("compression") { | ||
val tempDirPath = Files.createTempDir().getAbsolutePath; | ||
val df = sqlContext.read.text(testFile) | ||
df.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it written for debug? We can remove show
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, will correct it.
@Lewuathe Thanks for the review. I push another commit to address the comments. Besides I change the compression feature to 1.6.0. |
Test build #47034 has finished for PR 9657 at commit
|
Test build #47036 has finished for PR 9657 at commit
|
Never mind, I change back to 1.7.0 since 1.6 is in rc1 |
Test build #47118 has finished for PR 9657 at commit
|
* @since 1.7.0 | ||
*/ | ||
def compress(codec: Class[_ <: CompressionCodec]): DataFrameWriter = { | ||
this.extraOptions += ("compression.codec" -> codec.getCanonicalName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The amount of code changes should be small, so we do not need this additional interface.
If we add an interface for each additional option, the number of interfaces blows up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we should not add interface for every configuration, but considering compression is a very common property, I feel it would be better to keep this interface. We also expose the compression api in RDD.saveAsXX, so think it would be better to be consistent here in dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
@zjffdu Any updates? If you keep working on it, please check my comments. |
Will update this PR |
@maropu Thanks for review, I update the PR to address part of your comments. Please check my comments inline. |
Test build #50345 has finished for PR 9657 at commit
|
Test build #50347 has finished for PR 9657 at commit
|
/* | ||
* Specify the compression codec when saving it on hdfs | ||
* | ||
* @since 1.7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not 1.7.0
, but 2.0.0
.
@rxin @liancheng Could you review this? |
* | ||
* @since 2.0.0 | ||
*/ | ||
def compress(codec: Class[_ <: CompressionCodec]): DataFrameWriter = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this just be a normal option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we shouldn't depend on Hadoop APIs in options, which is a user facing API. Nobody outside the Hadoop world knows how to use the CompressionCodec API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Test build #50364 has finished for PR 9657 at commit
|
@zjffdu ping |
1 similar comment
@zjffdu ping |
sorry for late response, will update the patch tomorrow. |
@zjffdu ping |
@zjffdu If you have no time to take this, is it okay I rework? |
This is resolved by #11384, so could you close this? |
…tion when saving