[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call #11389

HyukjinKwon · 2016-02-26T09:20:53Z

https://issues.apache.org/jira/browse/SPARK-13507
https://issues.apache.org/jira/browse/SPARK-13509

What changes were proposed in this pull request?

This PR adds the support to write CSV data directly by a single call to the given path.

Several unitests were added for each functionality.

How was this patch tested?

This was tested with unittests and with dev/run_tests for coding style

HyukjinKwon · 2016-02-26T09:21:48Z

@rxin I opened this PR because it looks writing csv() should be added anyway.
If I got the documentation stuff here wrong, I will move that back.

SparkQA · 2016-02-26T11:00:06Z

Test build #52042 has finished for PR 11389 at commit a97a0a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-02-26T11:13:59Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -464,6 +464,12 @@ final class DataFrameWriter private[sql](df: DataFrame) {
   *   format("parquet").save(path)
   * }}}
   *
+   * You can set the following JSON-specific options for writing JSON files:


This looks like it's in the wrong place?

SparkQA · 2016-02-26T13:59:57Z

Test build #52047 has finished for PR 11389 at commit f82a2f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-26T19:08:56Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

@@ -453,6 +453,12 @@ final class DataFrameWriter private[sql](df: DataFrame) {
   *   format("json").save(path)
   * }}}
   *
+   * You can set the following JSON-specific options for writing JSON files:
+   * <li>`compression` or `codec` (default `null`): compression codec to use when saving to file.


just say compression, and don't mention codec.

actually i'd remove codec support from the underlying source code, and only keep it for csv as an undocumented option for backward compatibility.

HyukjinKwon · 2016-02-29T00:23:21Z

@rxin Actually, do you think we need the compression option for Parquet and ORC as well (although I am not going to deal with them in this PR even if we need them)?

rxin · 2016-02-29T01:09:21Z

It'd be great to fix in a future pr.

for this one, let's also fix python?

HyukjinKwon · 2016-02-29T01:14:48Z

@rxin Sure.

SparkQA · 2016-02-29T01:44:42Z

Test build #52151 has finished for PR 11389 at commit 9fe8fca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-29T04:01:19Z

LGTM pending tests

SparkQA · 2016-02-29T04:19:14Z

Test build #52158 has finished for PR 11389 at commit 8fbad40.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-02-29T04:22:03Z

retest this please

SparkQA · 2016-02-29T04:25:47Z

Test build #52154 has finished for PR 11389 at commit 8efb0e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-02-29T04:33:28Z

Hm... It looks a bit weird. ParquetHadoopFsRelationSuite the test about types (test all data types - ByteType) keeps failing. This is also happening at other PRs I made sometimes.

I thought I wanted to submit a hot-fix but I found it actually works okay in my local.

HyukjinKwon · 2016-02-29T04:33:36Z

retest this please

HyukjinKwon · 2016-02-29T04:37:52Z

@yhuai Could I ask that you have any clue on this occasional failure? ~~I think this might be related with whole-code generation~~. This is happening at some builds such as https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52154/consoleFull, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52151/consoleFull, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52160/consoleFull and https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52153/consoleFull.

~~I could not find any clues until now.~~

SparkQA · 2016-02-29T05:08:58Z

Test build #52160 has finished for PR 11389 at commit 9ca920b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-02-29T05:13:37Z

As this passes sometimes (e.g. #11016), I will restart.

HyukjinKwon · 2016-02-29T05:13:52Z

retest this please

SparkQA · 2016-02-29T05:48:52Z

Test build #52161 has finished for PR 11389 at commit 9ca920b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-29T06:13:00Z

Test build #52163 has finished for PR 11389 at commit 9ca920b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-29T07:41:10Z

Test build #52167 has finished for PR 11389 at commit 9ca920b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-29T09:12:30Z

Test build #52175 has finished for PR 11389 at commit fea9df8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-29T09:16:44Z

Test build #52176 has finished for PR 11389 at commit cec8442.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-02-29T09:42:18Z

I see that's a problem in new vecterizedreader. I missed the exception message. Looking deeper.

HyukjinKwon · 2016-02-29T11:26:29Z

@rxin Anyway, would you merge this if it looks good?

rxin · 2016-02-29T17:44:01Z

Thanks - merging this in master.

… function call https://issues.apache.org/jira/browse/SPARK-13507 https://issues.apache.org/jira/browse/SPARK-13509 ## What changes were proposed in this pull request? This PR adds the support to write CSV data directly by a single call to the given path. Several unitests were added for each functionality. ## How was this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <[email protected]> Author: Hyukjin Kwon <[email protected]> Closes apache#11389 from HyukjinKwon/SPARK-13507-13509.

Support for writing CSV with a single function call

a97a0a8

srowen reviewed Feb 26, 2016
View reviewed changes

Move the comments above for json.

f82a2f4

rxin reviewed Feb 26, 2016
View reviewed changes

Update comments and remove the supprot for codec for JSON and TEXT

9fe8fca

HyukjinKwon added 4 commits February 29, 2016 11:59

Add Python API

c703def

Update upstream

8efb0e3

Add a newline at the end of the test file

8fbad40

Update an weired indentation

9ca920b

Make path seq if it is an string

fea9df8

Read without options

cec8442

asfgit closed this in 02aa499 Feb 29, 2016

HyukjinKwon deleted the SPARK-13507-13509 branch September 23, 2016 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call #11389

[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call #11389

HyukjinKwon commented Feb 26, 2016

HyukjinKwon commented Feb 26, 2016

SparkQA commented Feb 26, 2016

srowen Feb 26, 2016

SparkQA commented Feb 26, 2016

rxin Feb 26, 2016

rxin Feb 26, 2016

HyukjinKwon commented Feb 29, 2016

rxin commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

rxin commented Feb 29, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

rxin commented Feb 29, 2016

[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call #11389

[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call #11389

Conversation

HyukjinKwon commented Feb 26, 2016

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Feb 26, 2016

SparkQA commented Feb 26, 2016

srowen Feb 26, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 26, 2016

rxin Feb 26, 2016

Choose a reason for hiding this comment

rxin Feb 26, 2016

Choose a reason for hiding this comment

HyukjinKwon commented Feb 29, 2016

rxin commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

rxin commented Feb 29, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

rxin commented Feb 29, 2016