[SPARK-13108][SQL] Support for ascii compatible encodings at CSV data source #11016

HyukjinKwon · 2016-02-02T04:02:26Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13108

CSV datasource currently does not support non-ascii compatible encodings.

I tested this in Max OS. I converted cars_iso-8859-1.csv into cars_utf-16.csv as below:

iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv

and run the codes below:

val cars = "cars_utf-16.csv"
sqlContext.read
  .format("csv")
  .option("charset", "utf-16")
  .option("delimiter", 'þ')
  .load(cars)
  .show()

This produces a wrong results below:

+----+-----+-----+--------------------+------+
|year| make|model|             comment|blank�|
+----+-----+-----+--------------------+------+
|2012|Tesla|    S|          No comment|     �|
|   �| null| null|                null|  null|
|1997| Ford| E350|Go get one now th...|     �|
|2015|Chevy|Volt�|                null|  null|
|   �| null| null|                null|  null|
+----+-----+-----+--------------------+------+

(For more details, please check the Jira issue ticket).

So, this PR let this datasource adds the support for the encodings
Unittests were added for CSVSuite for end-to-end test.

How was the this patch tested?

This was tested with unittests and with dev/run_tests for coding style

SparkQA · 2016-02-02T05:37:36Z

Test build #50533 has finished for PR 11016 at commit 9f3735c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-02-04T06:58:46Z

cc @rxin

HyukjinKwon · 2016-02-04T07:19:47Z

retest this please

SparkQA · 2016-02-04T08:23:33Z

Test build #50739 has finished for PR 11016 at commit 00fe23c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-04T08:48:00Z

Test build #50741 has finished for PR 11016 at commit 00fe23c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

falaki · 2016-02-24T20:24:34Z

I think the approach looks good if it was only an issue for csv. But if you want to fix it in a more general way, it makes more sense to move EncodingTextInputFormat to a general location (with its own unit tests) and then use it everywhere: JSON, TEXT, CSV.

HyukjinKwon · 2016-02-25T00:53:01Z

@falaki Thanks. Then, I will try to generalize this and then change the title as well with some more commits.

HyukjinKwon · 2016-02-25T01:32:21Z

@falaki It looks JSON and TEXT data sources don't support encoding option. Maybe can I do this in another PR (or a follow-up) so that both data sources support the option with an generalized EncodingTextInputFormat?

SparkQA · 2016-02-25T03:28:43Z

Test build #51919 has finished for PR 11016 at commit 4fa2ed3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-29T02:35:54Z

Test build #52153 has finished for PR 11016 at commit a6f6023.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-02-29T03:24:13Z

retest this please

SparkQA · 2016-02-29T05:04:23Z

Test build #52157 has finished for PR 11016 at commit a6f6023.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-08T08:15:19Z

Test build #52635 has finished for PR 11016 at commit 264a1dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-21T04:40:16Z

I just found a similar issue with this, SPARK-1849.

I think we might have to do not support non-ascii compatible encodings because it looks this PR will support general encodings but I cannot guarantee it supports all the encodings. I mean, this will support general encodings but there might be some encodings writing a BOM-bits-like header.

Since Spark CSV is already supporting the encoding option, I cannot come up with more than three options below:

Only CSV data source supports some encodings for backward compatibility but except non-ascii compatible encodings and throws an exception when it is non-ascii compatible encodings.
CSV data source supports other encodings in this way but there are documentations to mention it does not guarantee all the encodings.
Supports all the encodings and add the tests for all the encodings (maybe with this encoding list in Java)

@srowen Would you maybe give some feedback please?

HyukjinKwon · 2016-04-01T10:49:29Z

@falaki @rxin @srowen Sorry, this will not work for files written in Windows (\r\n). I am closing this. If you intend to just block non-ascii compatible encodings, then I will create a new PR or reopen this.

HyukjinKwon added 2 commits February 2, 2016 12:52

Validate ascii compatible encodings

d3cddb2

Remove a heading extra whitespace

9f3735c

HyukjinKwon added 2 commits February 4, 2016 15:53

Add the support for non-ascii compatible encodings

34af8d3

Remove unused import and add a newline

00fe23c

HyukjinKwon added 2 commits February 25, 2016 10:41

Shorten codes for the same logics

875e237

Remove duplicated line

4fa2ed3

Resolve conflicts

a6f6023

HyukjinKwon mentioned this pull request Feb 29, 2016

[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call #11389

Closed

HyukjinKwon changed the title ~~[SPARK-13108][SQL] Validate ascii compatible encodings~~ [SPARK-13108][SQL] Support for ascii compatible encodings at CSV data source Mar 3, 2016

HyukjinKwon added 2 commits March 8, 2016 15:33

Resolve conflicts

66b2769

Call EncodingTextInputFormat

264a1dc

HyukjinKwon closed this Apr 1, 2016

HyukjinKwon deleted the SPARK-13108-non-ascii-encodings branch October 1, 2016 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13108][SQL] Support for ascii compatible encodings at CSV data source #11016

[SPARK-13108][SQL] Support for ascii compatible encodings at CSV data source #11016

HyukjinKwon commented Feb 2, 2016

SparkQA commented Feb 2, 2016

HyukjinKwon commented Feb 4, 2016

HyukjinKwon commented Feb 4, 2016

SparkQA commented Feb 4, 2016

SparkQA commented Feb 4, 2016

falaki commented Feb 24, 2016

HyukjinKwon commented Feb 25, 2016

HyukjinKwon commented Feb 25, 2016

SparkQA commented Feb 25, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Mar 8, 2016

HyukjinKwon commented Mar 21, 2016

HyukjinKwon commented Apr 1, 2016

[SPARK-13108][SQL] Support for ascii compatible encodings at CSV data source #11016

[SPARK-13108][SQL] Support for ascii compatible encodings at CSV data source #11016

Conversation

HyukjinKwon commented Feb 2, 2016

What changes were proposed in this pull request?

How was the this patch tested?

SparkQA commented Feb 2, 2016

HyukjinKwon commented Feb 4, 2016

HyukjinKwon commented Feb 4, 2016

SparkQA commented Feb 4, 2016

SparkQA commented Feb 4, 2016

falaki commented Feb 24, 2016

HyukjinKwon commented Feb 25, 2016

HyukjinKwon commented Feb 25, 2016

SparkQA commented Feb 25, 2016

SparkQA commented Feb 29, 2016

HyukjinKwon commented Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Mar 8, 2016

HyukjinKwon commented Mar 21, 2016

HyukjinKwon commented Apr 1, 2016