Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13108][SQL] Support for ascii compatible encodings at CSV data source #11016

Closed

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13108

CSV datasource currently does not support non-ascii compatible encodings.

I tested this in Max OS. I converted cars_iso-8859-1.csv into cars_utf-16.csv as below:

iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv

and run the codes below:

val cars = "cars_utf-16.csv"
sqlContext.read
  .format("csv")
  .option("charset", "utf-16")
  .option("delimiter", 'þ')
  .load(cars)
  .show()

This produces a wrong results below:

+----+-----+-----+--------------------+------+
|year| make|model|             comment|blank�|
+----+-----+-----+--------------------+------+
|2012|Tesla|    S|          No comment|     �|
|   �| null| null|                null|  null|
|1997| Ford| E350|Go get one now th...|     �|
|2015|Chevy|Volt�|                null|  null|
|   �| null| null|                null|  null|
+----+-----+-----+--------------------+------+

(For more details, please check the Jira issue ticket).

So, this PR let this datasource adds the support for the encodings
Unittests were added for CSVSuite for end-to-end test.

How was the this patch tested?

This was tested with unittests and with dev/run_tests for coding style

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50533 has finished for PR 11016 at commit 9f3735c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

cc @rxin

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #50739 has finished for PR 11016 at commit 00fe23c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #50741 has finished for PR 11016 at commit 00fe23c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@falaki
Copy link
Contributor

falaki commented Feb 24, 2016

I think the approach looks good if it was only an issue for csv. But if you want to fix it in a more general way, it makes more sense to move EncodingTextInputFormat to a general location (with its own unit tests) and then use it everywhere: JSON, TEXT, CSV.

@HyukjinKwon
Copy link
Member Author

@falaki Thanks. Then, I will try to generalize this and then change the title as well with some more commits.

@HyukjinKwon
Copy link
Member Author

@falaki It looks JSON and TEXT data sources don't support encoding option. Maybe can I do this in another PR (or a follow-up) so that both data sources support the option with an generalized EncodingTextInputFormat?

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51919 has finished for PR 11016 at commit 4fa2ed3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 29, 2016

Test build #52153 has finished for PR 11016 at commit a6f6023.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 29, 2016

Test build #52157 has finished for PR 11016 at commit a6f6023.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-13108][SQL] Validate ascii compatible encodings [SPARK-13108][SQL] Support for ascii compatible encodings at CSV data source Mar 3, 2016
@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52635 has finished for PR 11016 at commit 264a1dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

I just found a similar issue with this, SPARK-1849.

I think we might have to do not support non-ascii compatible encodings because it looks this PR will support general encodings but I cannot guarantee it supports all the encodings. I mean, this will support general encodings but there might be some encodings writing a BOM-bits-like header.

Since Spark CSV is already supporting the encoding option, I cannot come up with more than three options below:

  • Only CSV data source supports some encodings for backward compatibility but except non-ascii compatible encodings and throws an exception when it is non-ascii compatible encodings.
  • CSV data source supports other encodings in this way but there are documentations to mention it does not guarantee all the encodings.
  • Supports all the encodings and add the tests for all the encodings (maybe with this encoding list in Java)

@srowen Would you maybe give some feedback please?

@HyukjinKwon
Copy link
Member Author

@falaki @rxin @srowen Sorry, this will not work for files written in Windows (\r\n). I am closing this. If you intend to just block non-ascii compatible encodings, then I will create a new PR or reopen this.

@HyukjinKwon HyukjinKwon closed this Apr 1, 2016
@HyukjinKwon HyukjinKwon deleted the SPARK-13108-non-ascii-encodings branch October 1, 2016 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants