[SPARK-28366][CORE] Logging in driver when loading single large unsplittable file #25134

WeichenXu123 · 2019-07-12T11:57:24Z

What changes were proposed in this pull request?

Logging in driver when loading single large unsplittable file via sc.textFile or csv/json datasouce.
Current condition triggering logging is

only generate one partition
file is unsplittable, possible reason is:
- compressed by unsplittable compression algo such as gzip.
- multiLine mode in csv/json datasource
- wholeText mode in text datasource
file size exceed the config threshold spark.io.warning.largeFileThreshold (default value is 1GB)

How was this patch tested?

Manually test.
Generate one gzip file exceeding 1GB,

base64 -b 50 /dev/urandom | head -c 2000000000 > file1.txt
cat file1.txt | gzip > file1.gz

then launch spark-shell,

run

sc.textFile("file:///path/to/file1.gz").count()

Will print log like:

WARN HadoopRDD: Loading one large unsplittable file file:/.../f1.gz with only one partition, because the file is compressed by unsplittable compression codec

run

sc.textFile("file:///path/to/file1.txt").count()

Will print log like:

WARN HadoopRDD: Loading one large file file:/.../f1.gz with only one partition, we can increase partition numbers by the `minPartitions` argument in method `sc.textFile

run

spark.read.csv("file:///path/to/file1.gz").count

Will print log like:

WARN CSVScan: Loading one large unsplittable file file:/.../f1.gz with only one partition, the reason is: the file is compressed by unsplittable compression codec

run

spark.read.option("multiLine", true).csv("file:///path/to/file1.gz").count

Will print log like:

WARN CSVScan: Loading one large unsplittable file file:/.../f1.gz with only one partition, the reason is: the csv datasource is set multiLine mode

JSON and Text datasource also tested with similar cases.

Please review https://spark.apache.org/contributing.html before opening a pull request.

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

WeichenXu123

Comment.

SparkQA · 2019-07-12T14:23:59Z

Test build #107594 has finished for PR 25134 at commit 34a9a25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

I think the problem is that it's pretty commonly known that unsplittable codec isn't able to be proceeded in parallel.

Spark itself writes multiple files from each partition so Spark users won't meet this an issue often arguably. So, this logging mostly applies the case when we read a big file from external source.

If we should add the logging here, we should add warning here and there. For instance, multiLine option for CSV and JSON too.

WeichenXu123 · 2019-07-16T01:15:08Z

@HyukjinKwon Yea, but some users complain that loading large unsplittable file without logging make them confusing...

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

SparkQA · 2019-07-16T17:22:26Z

Test build #107749 has finished for PR 25134 at commit b48ced1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-16T19:45:34Z

Test build #107751 has finished for PR 25134 at commit 4ee25d6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/internal/config/package.scala

WeichenXu123 · 2019-07-17T01:39:35Z

@HyukjinKwon Now I handled all the cases which file is unsplittable.

SparkQA · 2019-07-17T03:50:10Z

Test build #107768 has finished for PR 25134 at commit 736587b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-18T06:13:41Z

Test build #107812 has finished for PR 25134 at commit 3da440b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-18T10:23:56Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

+      if (inputSplits.length == 1 && inputSplits(0).isInstanceOf[FileSplit]) {
+        val fileSplit = inputSplits(0).asInstanceOf[FileSplit]
+        val path = fileSplit.getPath
+        if (Utils.isFileSplittable(path, codecFactory)


do we really need to know if it's splittable or not? If Spark is scanning files with a single giant partition, it's going to be very slow.

@cloud-fan Yes. But we'd better tell user why it only generate only one partition. So I prefer:

If the file is unsplittable, then in log tell user the file is unsplittable (and include unsplittable reason)

If the file is splittable, then in log tell user we can increase parallelism by setting the argument minPartitions in method sc.textFile.

What do you think ?

@cloud-fan Any thoughts ?

SGTM, let's include the reason in the message.

SparkQA · 2019-07-18T11:31:00Z

Test build #107840 has finished for PR 25134 at commit 0c2ce85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-07-22T22:55:15Z

Retest this please.

SparkQA · 2019-07-23T01:49:38Z

Test build #108025 has finished for PR 25134 at commit 0c2ce85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-23T05:09:16Z

Test build #108029 has finished for PR 25134 at commit e6cf714.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-23T06:25:42Z

Test build #108030 has finished for PR 25134 at commit feb8dd0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-30T04:07:59Z

Test build #108361 has finished for PR 25134 at commit 4ce0d33.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2019-07-30T05:47:19Z

Jenkins, retest this please.

SparkQA · 2019-07-30T07:05:02Z

Test build #108370 has finished for PR 25134 at commit 4ce0d33.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-30T07:30:17Z

retest this please

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

SparkQA · 2019-07-30T10:25:26Z

Test build #108374 has finished for PR 25134 at commit 4ce0d33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-30T15:28:17Z

Test build #108393 has finished for PR 25134 at commit 9442948.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-31T14:34:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TextBasedFileScan.scala

+    if (!isSplitable(path)) {
+      Some("the file is compressed by unsplittable compression codec")
+    } else {
+      None


when will we hit this branch?

Remove all branch return None and add assert.

SparkQA · 2019-07-31T19:26:35Z

Test build #108476 has finished for PR 25134 at commit 801c6e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-01T12:29:52Z

thanks, merging to master!

HyukjinKwon · 2019-08-02T09:58:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala

+   * If a file with `path` is unsplittable, return the unsplittable reason,
+   * otherwise return `None`.
+   */
+  def getFileUnSplittableReason(path: Path): String = {


@cloud-fan, is it really worth to expose another internal API in our common source trait?

We have isSplittable and it makes sense to explain why it's unsplittable. Maybe there is a way to merge these 2 methods, but I can't think of one now.

HyukjinKwon · 2019-08-02T09:59:02Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .doc("When spark loading one single large file, if file size exceed this " +
+        "threshold, then log warning with possible reasons.")
+      .longConf
+      .createWithDefault(1024 * 1024 * 1024)


I don't think it's worth adding a config. It looks an overkill.

this is an internal config.

"large file" is vague, and I don't think we can hardcode a value and say that's "large file".

The problem is, this warning stuff is trivial and not important actually.

We can just pick any reasonable number. Who will configure this? I won't do that. This information shouldn't be job-based, too.

people may set it to Long.Max to disable the warning. Besides, an internal config doesn't hurt. I think we have many internal configs that users will never set.

HyukjinKwon · 2019-08-02T10:11:20Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

+          if (Utils.isFileSplittable(path, codecFactory)) {
+            logWarning(s"Loading one large file ${path.toString} with only one partition, " +
+              s"we can increase partition numbers by the `minPartitions` argument in method " +
+              "`sc.textFile`")


Is it always sc.textFile? Many datasource V1 implementation still uses hadoopFile or newHadoopFile often.

HyukjinKwon · 2019-08-02T10:13:03Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

+        if (fileSplit.getLength > conf.get(IO_WARNING_LARGEFILETHRESHOLD)) {
+          val codecFactory = new CompressionCodecFactory(jobConf)
+          if (Utils.isFileSplittable(path, codecFactory)) {
+            logWarning(s"Loading one large file ${path.toString} with only one partition, " +


nit toString won't be needed since here's string interpolation.

HyukjinKwon · 2019-08-02T10:13:11Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

+          val codecFactory = new CompressionCodecFactory(jobConf)
+          if (Utils.isFileSplittable(path, codecFactory)) {
+            logWarning(s"Loading one large file ${path.toString} with only one partition, " +
+              s"we can increase partition numbers by the `minPartitions` argument in method " +


nit: and s isn't needed too

HyukjinKwon · 2019-08-02T10:16:06Z

@cloud-fan, this looks overkill. Can we simply mention it in DataFrame(Reader|Writer)/DataStream(Reader|Writer) for our datasources?

For Hadoop ones, hadoop input format or somewhere else should describe it.

cloud-fan · 2019-08-02T12:45:18Z

even if we document it, how can they know what's the codec of the data files? It's better to give a warning before running a foreseeable long job.

HyukjinKwon · 2019-08-02T13:22:05Z

They will specify it in compression when they write out some data from Spark and the codec is detected via file extension. We can just simply leave a note for a warning. This info is rather a static one than job-based, and users will likely know what code was used for their files.

gatorsmile · 2019-11-26T07:08:26Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+  private[spark] val IO_WARNING_LARGEFILETHRESHOLD =
+    ConfigBuilder("spark.io.warning.largeFileThreshold")
+      .internal()
+      .doc("When spark loading one single large file, if file size exceed this " +


Please update the description to

If the size in bytes of a file loaded by Spark exceeds this threshold, a warning is logged with the possible reasons.

gatorsmile · 2019-11-26T07:08:54Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

+      .internal()
+      .doc("When spark loading one single large file, if file size exceed this " +
+        "threshold, then log warning with possible reasons.")
+      .longConf


Please update it to .bytesConf(ByteUnit.BYTE)

gatorsmile · 2019-11-26T07:10:05Z

cc @Ngone51 @cloud-fan

…HRESHOLD ### What changes were proposed in this pull request? Improve conf `IO_WARNING_LARGEFILETHRESHOLD` (a.k.a `spark.io.warning.largeFileThreshold`): * reword documentation * change type from `long` to `bytes` ### Why are the changes needed? Improvements according to #25134 (comment) & #25134 (comment). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #26691 from Ngone51/SPARK-28366-followup. Authored-by: wuyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…HRESHOLD ### What changes were proposed in this pull request? Improve conf `IO_WARNING_LARGEFILETHRESHOLD` (a.k.a `spark.io.warning.largeFileThreshold`): * reword documentation * change type from `long` to `bytes` ### Why are the changes needed? Improvements according to apache#25134 (comment) & apache#25134 (comment). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes apache#26691 from Ngone51/SPARK-28366-followup. Authored-by: wuyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

init pr

34a9a25

WeichenXu123 force-pushed the log_gz branch from 62ac219 to 34a9a25 Compare July 12, 2019 11:58

WeichenXu123 commented Jul 12, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala Outdated Show resolved Hide resolved

WeichenXu123 commented Jul 12, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Jul 13, 2019

HyukjinKwon reviewed Jul 15, 2019

View reviewed changes

cloud-fan reviewed Jul 16, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala Outdated Show resolved Hide resolved

update

b48ced1

WeichenXu123 added 2 commits July 17, 2019 01:24

fix

fafba7b

remove useless code

4ee25d6

srowen reviewed Jul 16, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated Show resolved Hide resolved

refine logging message with detail unsplittable reason

736587b

WeichenXu123 changed the title ~~[SPARK-28366][CORE] Logging in driver when loading single large unsplittable file via sc.textFile~~ [SPARK-28366][CORE] Logging in driver when loading single large unsplittable file Jul 17, 2019

update

3da440b

change to internal config

0c2ce85

cloud-fan reviewed Jul 18, 2019

View reviewed changes

WeichenXu123 added 2 commits July 23, 2019 10:23

update

e6cf714

update

feb8dd0

cloud-fan reviewed Jul 30, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala Outdated Show resolved Hide resolved

address comments

9442948

cloud-fan reviewed Jul 31, 2019

View reviewed changes

update

801c6e3

WeichenXu123 force-pushed the log_gz branch from 85bd7f0 to 801c6e3 Compare July 31, 2019 16:16

cloud-fan closed this in 26d03b6 Aug 1, 2019

HyukjinKwon reviewed Aug 2, 2019

View reviewed changes

gatorsmile reviewed Nov 26, 2019

View reviewed changes

Ngone51 mentioned this pull request Nov 27, 2019

[SPARK-28366][CORE][FOLLOW-UP] Improve the conf IO_WARNING_LARGEFILETHRESHOLD #26691

Closed

[SPARK-28366][CORE] Logging in driver when loading single large unsplittable file #25134

[SPARK-28366][CORE] Logging in driver when loading single large unsplittable file #25134

Conversation

WeichenXu123 commented Jul 12, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

WeichenXu123 left a comment

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

WeichenXu123 commented Jul 16, 2019

SparkQA commented Jul 16, 2019

SparkQA commented Jul 16, 2019

WeichenXu123 commented Jul 17, 2019

SparkQA commented Jul 17, 2019

SparkQA commented Jul 18, 2019

cloud-fan Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

WeichenXu123 Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 18, 2019

dongjoon-hyun commented Jul 22, 2019

SparkQA commented Jul 23, 2019

SparkQA commented Jul 23, 2019

SparkQA commented Jul 23, 2019

SparkQA commented Jul 30, 2019

WeichenXu123 commented Jul 30, 2019

SparkQA commented Jul 30, 2019

cloud-fan commented Jul 30, 2019

SparkQA commented Jul 30, 2019

SparkQA commented Jul 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 31, 2019

cloud-fan commented Aug 1, 2019

HyukjinKwon Aug 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Aug 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 2, 2019

cloud-fan commented Aug 2, 2019

HyukjinKwon commented Aug 2, 2019 • edited Loading

gatorsmile Nov 26, 2019 • edited Loading

Choose a reason for hiding this comment

gatorsmile Nov 26, 2019 • edited Loading

Choose a reason for hiding this comment

gatorsmile commented Nov 26, 2019

WeichenXu123 commented Jul 12, 2019 •

edited

Loading

cloud-fan Jul 18, 2019 •

edited

Loading

WeichenXu123 Jul 18, 2019 •

edited

Loading

HyukjinKwon Aug 2, 2019 •

edited

Loading

cloud-fan Aug 2, 2019 •

edited

Loading

HyukjinKwon commented Aug 2, 2019 •

edited

Loading

gatorsmile Nov 26, 2019 •

edited

Loading

gatorsmile Nov 26, 2019 •

edited

Loading