[SPARK-23577][SQL] Supports custom line separator for text datasource #20727

HyukjinKwon · 2018-03-03T12:15:03Z

What changes were proposed in this pull request?

This PR proposes to add lineSep option for a configurable line separator in text datasource.

It supports this option by using LineRecordReader's functionality with passing it to the constructor.

How was this patch tested?

Manual tests and unit tests were added.

HyukjinKwon · 2018-03-03T12:19:06Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

@@ -400,6 +400,10 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
   * <ul>
   * <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
   * considered in every trigger.</li>
+   * <li>`wholetext` (default `false`): If true, read a file as a single row and not split by "\n".


I also added some changes missed for wholetext too while I am here.

SparkQA · 2018-03-03T15:31:33Z

Test build #87929 has finished for PR 20727 at commit 93652b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-05T18:27:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOptions.scala

@@ -39,9 +39,12 @@ private[text] class TextOptions(@transient private val parameters: CaseInsensiti
   */
  val wholeText = parameters.getOrElse(WHOLETEXT, "false").toBoolean

+  val lineSeparator: String = parameters.getOrElse(LINE_SEPARATOR, "\n")
+  require(lineSeparator.nonEmpty, s"'$LINE_SEPARATOR' cannot be an empty string.")


This require looks redundant after a getOrElse call

Nope, it's nonEmpty on String.

cloud-fan · 2018-03-05T18:28:27Z

looks good, also cc @MaxGekk

MaxGekk · 2018-03-05T21:09:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala

 */
 class HadoopFileLinesReader(
-    file: PartitionedFile, conf: Configuration) extends Iterator[Text] with Closeable {
+    file: PartitionedFile,
+    lineSeparator: String,


A constructor of LineRecordReader has the same param but its type is array of bytes. I would propose to expose the same type here - Array[Byte]. Also you handle special value - '\n'. If you define the param as lineSeparator: Option[Array[Byte]], None would indicate default behavior - '\n', '\r' or '\r\n'

I would suggest to use String here and control the encoding stuff in one place here. For option thing, see #20727 (comment). Will bring it back.

MaxGekk · 2018-03-05T21:12:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala

@@ -42,7 +52,12 @@ class HadoopFileLinesReader(
      Array.empty)
    val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
    val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
-    val reader = new LineRecordReader()
+    val reader = if (lineSeparator != "\n") {


if it has type Option[Array[Byte]]:

val reader = lineSeparator match { case Some(sep) => new LineRecordReader(sep) case _ => new LineRecordReader() }

ditto for option thing. Will bring it back.

MaxGekk · 2018-03-05T21:13:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala

@@ -42,7 +52,12 @@ class HadoopFileLinesReader(
      Array.empty)
    val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
    val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
-    val reader = new LineRecordReader()
+    val reader = if (lineSeparator != "\n") {
+      new LineRecordReader(lineSeparator.getBytes("UTF-8"))


It would be better to not depend on particular charset

Would you have a suggestion?

My suggestion is to pass Array[Byte] into the class. If charsets different from UTF-8 will be supported in the future, this place should be changed for sure. You can make this class more tolerant to input charsets right now. Just for an example, json reader (jackson json parser) is able to read json in any standard charsets. To fix its per-line mode, need to support lineSep in any charset and convert lineSep to array of byte before using the class. If you restrict charset of lineSep to UTF-8, you just make the wall for other datasources.

I mean, it's initially an unicode string via datasource interface and we need to somehow convert it to bytes once as it takes bytes. Do you mean adding another option for specifying charset or did I maybe miss something?

Why do you think this class is responsible for converting string separator to array of bytes? Especially restriction by one charset is not clear. The purpose of the class is to provide the Iterator interface of records/lines to datasources. And this class doesn't have to know about datasource's charset. I would not stick on particular charset here and expose the separator parameter with Option[Array[Byte]] like the LineReader provides a constructor with byte[] recordDelimiter.

OK. Let me try to address this one.

MaxGekk · 2018-03-05T21:15:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala

+    val reader = if (lineSeparator != "\n") {
+      new LineRecordReader(lineSeparator.getBytes("UTF-8"))
+    } else {
+      // This behavior follows Hive. `\n` covers `\r`, `\r\n` and `\n`.


The case for lineSeparator = '\n' covers '\r' and '\r\n', it looks not so good.

+1. It seems cleaner to use None as default, which indicates \r, \r\n and \n.

I initially did this but reverted it back - #18581 (comment). Let me bring the option back.

MaxGekk · 2018-03-05T21:35:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala

-    file: PartitionedFile, conf: Configuration) extends Iterator[Text] with Closeable {
+    file: PartitionedFile,
+    lineSeparator: String,
+    conf: Configuration) extends Iterator[Text] with Closeable {


I cannot predict where the class could be used but I believe we should keep backward compatibility by sources. I mean it would be better to add new parameter after conf: Configuration with default value which preserve old behavior.

That's preserved by def this(file: PartitionedFile, conf: Configuration) = this(file, "\n", conf) I believe. execution package is meant to be an internal package anyway.

MaxGekk · 2018-03-05T21:40:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala

+    }
+  }
+
+  Seq("|", "^", "::", "!!!@3").foreach { lineSep =>


Please, check invisible and control chars

Sure, sounds a good idea.

MaxGekk · 2018-03-05T21:55:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOptions.scala

 }

 private[text] object TextOptions {
  val COMPRESSION = "compression"
  val WHOLETEXT = "wholetext"
+  val LINE_SEPARATOR = "lineSep"


Why is it not "lineSeparator"? I would propose another name for the option: recordSeparator. Could you image you have the text file:

id: 123 cmd: ls -l --- id: 456 cmd: rm -rf

where the separator is ---. If the separator is not new line delimiter, records don't looks like lines. And recordSeparator would be closer to Hadoop's terminology. Besides of that, probably, we introduce similar option for another datasources like json. recordSeparator of json records (not lines) sounds better from my point of view.

It was taken after Univocity parser, setLineSeparator.

I think Hadoop calls it textinputformat.record.delimiter but the class name is LineRecordReader or LineReader.

The name is not so matter as having the same name for the option across all supported datasources - text, csv and json to don't confuse users.

To be clear, the name "lineSep" is taken after Python, R's sep and our supporting option sep. Here was another discussion - #18581 (comment)
"line" is taken after Univocity CSV parser and seems making sense in Hadoop too. I think we documented JSON is in lines, and I think it makes sense to CSV and text.

One example might sound counterintuitive to you but it looks less consistent with other places at least I usually refer.

I don't really care what we are going to call it this. I think it is important that we are consistent across datasources. IMO Since we can define anything to be a separator, not only newlines, records seems to fit a bit better.

The definition of records was introduced in the classical paper: https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf . I don't see any reasons for replacing it by line right now.

My reason is to refer other places so that practically other users feel comfortable practically, which I usually put more importances. I really don't want to spend time on researching why the other references used the term "line".

If we think about the plain text, CSV or JSON, the term "line" can be correct in a way. We documented http://jsonlines.org/ (even this reference used the term "line"). I think, for example, the line can be defined by its separator.

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala

Line 1645 in c36fecc

|LINES TERMINATED BY '\n'

We already used the term "line" everywhere in the doc. We could just say lines are separated by a character and minimise the doc fix and etc.

SparkQA · 2018-03-06T14:15:28Z

Test build #88001 has finished for PR 20727 at commit 242b0f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-06T14:25:29Z

Test build #88002 has finished for PR 20727 at commit e529d0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-03-06T20:56:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala

@@ -42,7 +52,12 @@ class HadoopFileLinesReader(
      Array.empty)
    val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
    val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
-    val reader = new LineRecordReader()
+    val reader = if (lineSeparator != "\n") {
+      new LineRecordReader(lineSeparator.getBytes("UTF-8"))


My suggestion is to pass Array[Byte] into the class. If charsets different from UTF-8 will be supported in the future, this place should be changed for sure. You can make this class more tolerant to input charsets right now. Just for an example, json reader (jackson json parser) is able to read json in any standard charsets. To fix its per-line mode, need to support lineSep in any charset and convert lineSep to array of byte before using the class. If you restrict charset of lineSep to UTF-8, you just make the wall for other datasources.

MaxGekk · 2018-03-06T20:58:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOptions.scala

 }

 private[text] object TextOptions {
  val COMPRESSION = "compression"
  val WHOLETEXT = "wholetext"
+  val LINE_SEPARATOR = "lineSep"


The name is not so matter as having the same name for the option across all supported datasources - text, csv and json to don't confuse users.

HyukjinKwon · 2018-03-07T02:01:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala

-    file: PartitionedFile, conf: Configuration) extends Iterator[Text] with Closeable {
+    file: PartitionedFile,
+    lineSeparator: Option[String],
+    conf: Configuration) extends Iterator[Text] with Closeable {


Note that it's an internal API for datasources and Hadoop's Text already has an assumption for utf8. I don't think we should call getBytes with utf8 at each caller side.

Some methods of Hadoop's Text have such assumption about UTF-8 encoding. In general a datasource could eliminate the restriction by using the Text class as container of raw bytes and calling methods like getBytes() and getLength().

Yup, I am sorry if I wasn't clear. I mean the doc describes:

This class stores text using standard UTF8 encoding.

I was wondering if that's a official way to use Text without the restriction because that sounds rather an informal workaround.

SparkQA · 2018-03-11T03:45:13Z

Test build #88158 has finished for PR 20727 at commit 97a8422.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-11T07:33:21Z

Test build #88159 has finished for PR 20727 at commit d6e9160.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-14T05:27:35Z

@cloud-fan and @MaxGekk, I believe this is ready for another look.

MaxGekk

LGTM besides of the option name

HyukjinKwon · 2018-03-20T06:57:51Z

retest this please

SparkQA · 2018-03-20T07:47:58Z

Test build #88408 has finished for PR 20727 at commit d6e9160.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-20T08:57:30Z

retest this please

SparkQA · 2018-03-20T12:15:09Z

Test build #88415 has finished for PR 20727 at commit d6e9160.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-21T03:19:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala

+ *
+ * @param file A part (i.e. "block") of a single file that should be read line by line.
+ * @param lineSeparator A line separator that should be used for each line. If the value is `None`,
+ *                      it covers `\r`, `\r\n` and `\n`.


We should mention that this default rule is not defined by us, but by hadoop.

cloud-fan · 2018-03-21T03:23:27Z

LGTM

…doop

SparkQA · 2018-03-21T07:05:02Z

Test build #88451 has finished for PR 20727 at commit f1c951f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-21T08:01:02Z

retest this please

SparkQA · 2018-03-21T10:32:16Z

Test build #88460 has finished for PR 20727 at commit f1c951f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-21T11:44:27Z

retest this please

SparkQA · 2018-03-21T15:05:13Z

Test build #88467 has finished for PR 20727 at commit f1c951f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-21T16:47:01Z

thanks, merging to master!

HyukjinKwon · 2018-03-22T03:28:08Z

Thanks for reviewing/merging this @cloud-fan, @MaxGekk and @hvanhovell.

## What changes were proposed in this pull request? This PR proposes to add lineSep option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. The approach is similar with #20727; however, one main difference is, it uses text datasource's `lineSep` option to parse line by line in JSON's schema inference. ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <[email protected]> Author: hyukjinkwon <[email protected]> Closes #20877 from HyukjinKwon/linesep-json.

## What changes were proposed in this pull request? This PR proposes to add lineSep option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. The approach is similar with apache#20727; however, one main difference is, it uses text datasource's `lineSep` option to parse line by line in JSON's schema inference. ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <[email protected]> Author: hyukjinkwon <[email protected]> Closes apache#20877 from HyukjinKwon/linesep-json.

HyukjinKwon mentioned this pull request Mar 3, 2018

[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources #18581

Closed

HyukjinKwon commented Mar 3, 2018

View reviewed changes

cloud-fan reviewed Mar 5, 2018

View reviewed changes

MaxGekk requested changes Mar 5, 2018

View reviewed changes

MaxGekk approved these changes Mar 6, 2018

View reviewed changes

HyukjinKwon commented Mar 7, 2018

View reviewed changes

HyukjinKwon added 6 commits March 11, 2018 13:07

Supports custom line separator for text datasource

a8eeef8

Address comments

84d2cef

Fix a nit

b5c8246

Address comment

6f1d317

Remove unused import

4835632

Resolve the diff with master (Utils to TestUtils)

d6e9160

HyukjinKwon force-pushed the linesep-text branch from 97a8422 to d6e9160 Compare March 11, 2018 04:11

MaxGekk approved these changes Mar 15, 2018

View reviewed changes

cloud-fan reviewed Mar 21, 2018

View reviewed changes

Add a comment that the default behaviour of line separator is from Ha…

f1c951f

…doop

asfgit closed this in 8d79113 Mar 21, 2018

HyukjinKwon mentioned this pull request Mar 22, 2018

[SPARK-23765][SQL] Supports custom line separator for json datasource #20877

Closed

dongjoon-hyun mentioned this pull request Sep 11, 2018

[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

Closed

HyukjinKwon deleted the linesep-text branch October 16, 2018 12:45

[SPARK-23577][SQL] Supports custom line separator for text datasource #20727

[SPARK-23577][SQL] Supports custom line separator for text datasource #20727

Conversation

HyukjinKwon commented Mar 3, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

SparkQA commented Mar 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 6, 2018

SparkQA commented Mar 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Mar 11, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 11, 2018

SparkQA commented Mar 11, 2018

HyukjinKwon commented Mar 14, 2018

MaxGekk left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 20, 2018

SparkQA commented Mar 20, 2018

HyukjinKwon commented Mar 20, 2018

SparkQA commented Mar 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 21, 2018

SparkQA commented Mar 21, 2018

HyukjinKwon commented Mar 21, 2018

SparkQA commented Mar 21, 2018

HyukjinKwon commented Mar 21, 2018

SparkQA commented Mar 21, 2018

cloud-fan commented Mar 21, 2018

HyukjinKwon commented Mar 22, 2018

HyukjinKwon commented Mar 3, 2018 •

edited

Loading

HyukjinKwon Mar 16, 2018 •

edited

Loading

HyukjinKwon Mar 11, 2018 •

edited

Loading